Duplicate Medical Studies Identification
Tech Stack: Databricks, Python, PySpark, SQL, Pandas, Transformers, PyTorch, Scikit-learn
- Developed an algorithm to assess whether a proposed medical study overlaps with any existing database study, saving approximately $330,000 on every duplicate proposal identified.
- Utilized Snorkel, a weak supervision framework, to encode business logic and generate an initial set of labels for study pair similarity, using parameters such as study title, primary objective, endpoints, etc.
- Trained a baseline model on features derived from TF-IDF and medical BERT, and subsequently refined the model through active learning iterations to accurately detect and identify duplicate studies.