Record Linking & Site Status Extraction


Tech Stack: Kedro, Python, PySpark, SQL, Pandas, Scikit-learn


  • Developed a robust web scraping module to extract critical information such as study objectives, site names, cities, and recruitment statuses from the ClinicalTrials.gov website.
  • Designed and implemented a pipeline that linked client database records with web-scraped data, allowing for seamless tracking of changes in trial recruitment status and streamlining the patient enrollment process.
  • Used fuzzy string matching, followed by a decision tree classifier, to accurately map records between internal and web-scraped data.