https://github.com/ahmedshahriar/bd-medicine-scraper

Scrapy-Django PostgreSQL integrated API with Proxy IP configuration that scrapes all medicine data (meds, prices, generics, companies, indications) from Bangladesh (30k+ pages)

https://github.com/ahmedshahriar/bd-medicine-scraper

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary

Keywords

django django-rest-framework drug manufacturer medicine medicine-database postgresql proxy-ip python python3 rest-api scrapy web-scraping
Last synced: 6 months ago · JSON representation

Repository

Scrapy-Django PostgreSQL integrated API with Proxy IP configuration that scrapes all medicine data (meds, prices, generics, companies, indications) from Bangladesh (30k+ pages)

Basic Info
  • Host: GitHub
  • Owner: ahmedshahriar
  • License: apache-2.0
  • Language: Python
  • Default Branch: dev
  • Homepage:
  • Size: 133 KB
Statistics
  • Stars: 55
  • Watchers: 1
  • Forks: 14
  • Open Issues: 12
  • Releases: 0
Topics
django django-rest-framework drug manufacturer medicine medicine-database postgresql proxy-ip python python3 rest-api scrapy web-scraping
Created about 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

bd-medicine-scraper

made-with-python Django CI Kaggle Open in Visual Studio Code

Overview

Welcome to the bd-medicine-scraper repository!

In this project, I scraped Medicine data (from medex.com.bd) using scrapy and integrated it with Django REST Framework. The data is stored in a PostgreSQL database. I designed the scraper in a way to keep the relations between models.

I also customized the django admin panels, added additional features such as - - auto complete lookup relational fields - custom filtering (alphabetical, model property) - bulk actions (export to csv)

Other Customizations: - custom scrapy command to run scrapy spiders from django command line. (ex- python manage.py <spider_name>) - custom django commands - to export models to csv. (python manage.py <export_model_name> <export_data_path>) python manage.py export_medicine_data /home/ahmed/Desktop/medicine_data.csv - to export generic monograph PDFs python manage.py exportgenericsmonograph I also added proxy configuration to scrapy.

Run

Create a python virtual environment and run these commands from root directory- pip insrall -r requirements.txt

This will run the django app- python manage.py runserver

NB: Migrate before running the app python manage.py makemigrations && python manage.py migrate

To run all spiders-

python run_crawler.py

To run a specific spider- python manage.py <spider_name> ex - python manage.py med

Data Analytics

Dataset

The scraped dataset is available in kaggle - - Assorted Medicine Dataset of Bangladesh

The dataset has 6 CSV files - Here is a list of the CSV files with their featured columns:

  1. medicine.csv (21k+ entries)
    • brand name
    • medicine type (allopathic or herbal)
    • generic
    • strength
    • manufacturer
    • package container (unit price and pack info)
    • Package Size (unit price)
  2. manufacturer.csv (245 entries)
    • name
  3. indication.csv (2k+ entries)
    • name
  4. generic.csv (~1700-1800 entries)
    • name
    • monographic link (PDF URL)
    • drug class
    • indication
    • generic details such as "Indication description", "Pharmacology description", "Dosage & Administration description" etc.
  5. drug class.csv (~400 entries)
    • name
  6. dosage form.csv (~120 entries)
    • name

Analytics

Bangladesh Medicine Analytics - Notebook on Kaggle

Tests

Workflow script - django-ci.yml

Run the tests using: coverage run --omit='*/venv/*' manage.py test

or python manage.py test

Check the coverage coverage html

Built With

Django==3.2.12 djangorestframework==3.12.2 django-admin-autocomplete-filter==0.7.1 django-filter==21.1 coverage==6.2 Scrapy==2.4.1 scrapy-djangoitem==1.1.1 psycopg2==2.9.3

Preview

django_admin_generics

django_admin_medicine

django_admin_dosage_form

django_admin_manufacturer

Owner

  • Name: Ahmed Shahriar Sakib
  • Login: ahmedshahriar
  • Kind: user
  • Location: Ontario, Canada
  • Company: @criticalml-uw

Software Engineer, an expert in web scraping & automation, data analytics, and machine learning. Kaggle Master.

GitHub Events

Total
  • Issues event: 1
  • Watch event: 18
  • Fork event: 6
Last Year
  • Issues event: 1
  • Watch event: 18
  • Fork event: 6

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 129
  • Total Committers: 1
  • Avg Commits per committer: 129.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Ahmed Shahriar Sakib a****b@g****m 129

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 1
  • Total pull requests: 12
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.08
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 10
Past Year
  • Issues: 1
  • Pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.11
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 9
Top Authors
Issue Authors
  • ohiduzzamansiam1 (1)
  • asrafulattare (1)
Pull Request Authors
  • dependabot[bot] (18)
  • ahmedshahriar (2)
  • tafhimulkabir (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (18)