Duplicate Medical Studies Identification

Tech Stack: Databricks, Python, PySpark, SQL, Pandas, Transformers, PyTorch, Scikit-learn

Developed an algorithm to assess whether a proposed medical study overlaps with any existing database study, saving approximately $330,000 on every duplicate proposal identified.
Utilized Snorkel, a weak supervision framework, to encode business logic and generate an initial set of labels for study pair similarity, using parameters such as study title, primary objective, endpoints, etc.
Trained a baseline model on features derived from TF-IDF and medical BERT, and subsequently refined the model through active learning iterations to accurately detect and identify duplicate studies.