Duplicate Medical Studies Identification


Tech Stack: Databricks, Python, PySpark, SQL, Pandas, Transformers, PyTorch, Scikit-learn


  • Developed an algorithm to assess whether a proposed medical study overlaps with any existing database study, saving approximately $330,000 on every duplicate proposal identified.
  • Utilized Snorkel, a weak supervision framework, to encode business logic and generate an initial set of labels for study pair similarity, using parameters such as study title, primary objective, endpoints, etc.
  • Trained a baseline model on features derived from TF-IDF and medical BERT, and subsequently refined the model through active learning iterations to accurately detect and identify duplicate studies.