Record Linking & Site Status Extraction

Tech Stack: Kedro, Python, PySpark, SQL, Pandas, Scikit-learn

Developed a robust web scraping module to extract critical information such as study objectives, site names, cities, and recruitment statuses from the ClinicalTrials.gov website.
Designed and implemented a pipeline that linked client database records with web-scraped data, allowing for seamless tracking of changes in trial recruitment status and streamlining the patient enrollment process.
Used fuzzy string matching, followed by a decision tree classifier, to accurately map records between internal and web-scraped data.