https://github.com/ahmedshahriar/bd-medicine-scraper
Scrapy-Django PostgreSQL integrated API with Proxy IP configuration that scrapes all medicine data (meds, prices, generics, companies, indications) from Bangladesh (30k+ pages)
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Keywords
Repository
Scrapy-Django PostgreSQL integrated API with Proxy IP configuration that scrapes all medicine data (meds, prices, generics, companies, indications) from Bangladesh (30k+ pages)
Basic Info
Statistics
- Stars: 55
- Watchers: 1
- Forks: 14
- Open Issues: 12
- Releases: 0
Topics
Metadata Files
README.md
bd-medicine-scraper
Overview
Welcome to the bd-medicine-scraper repository!
In this project, I scraped Medicine data (from medex.com.bd) using scrapy and integrated it with Django REST Framework. The data is stored in a PostgreSQL database. I designed the scraper in a way to keep the relations between models.
I also customized the django admin panels, added additional features such as - - auto complete lookup relational fields - custom filtering (alphabetical, model property) - bulk actions (export to csv)
Other Customizations:
- custom scrapy command to run scrapy spiders from django command line. (ex- python manage.py <spider_name>)
- custom django commands
- to export models to csv. (python manage.py <export_model_name> <export_data_path>)
python manage.py export_medicine_data /home/ahmed/Desktop/medicine_data.csv
- to export generic monograph PDFs
python manage.py exportgenericsmonograph
I also added proxy configuration to scrapy.
Run
Create a python virtual environment and run these commands from root directory-
pip insrall -r requirements.txt
This will run the django app-
python manage.py runserver
NB: Migrate before running the app
python manage.py makemigrations && python manage.py migrate
To run all spiders-
python run_crawler.py
To run a specific spider-
python manage.py <spider_name>
ex - python manage.py med
Data Analytics
Dataset
The scraped dataset is available in kaggle - - Assorted Medicine Dataset of Bangladesh
The dataset has 6 CSV files - Here is a list of the CSV files with their featured columns:
- medicine.csv (21k+ entries)
- brand name
- medicine type (allopathic or herbal)
- generic
- strength
- manufacturer
- package container (unit price and pack info)
- Package Size (unit price)
- manufacturer.csv (245 entries)
- name
- indication.csv (2k+ entries)
- name
- generic.csv (~1700-1800 entries)
- name
- monographic link (PDF URL)
- drug class
- indication
- generic details such as "Indication description", "Pharmacology description", "Dosage & Administration description" etc.
- drug class.csv (~400 entries)
- name
- dosage form.csv (~120 entries)
- name
Analytics
Bangladesh Medicine Analytics - Notebook on Kaggle
Tests
Workflow script - django-ci.yml
Run the tests using:
coverage run --omit='*/venv/*' manage.py test
or
python manage.py test
Check the coverage
coverage html
Built With
Django==3.2.12
djangorestframework==3.12.2
django-admin-autocomplete-filter==0.7.1
django-filter==21.1
coverage==6.2
Scrapy==2.4.1
scrapy-djangoitem==1.1.1
psycopg2==2.9.3
Preview




Owner
- Name: Ahmed Shahriar Sakib
- Login: ahmedshahriar
- Kind: user
- Location: Ontario, Canada
- Company: @criticalml-uw
- Website: https://ahmedshahriar.com
- Twitter: ahmed__shahriar
- Repositories: 5
- Profile: https://github.com/ahmedshahriar
Software Engineer, an expert in web scraping & automation, data analytics, and machine learning. Kaggle Master.
GitHub Events
Total
- Issues event: 1
- Watch event: 18
- Fork event: 6
Last Year
- Issues event: 1
- Watch event: 18
- Fork event: 6
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ahmed Shahriar Sakib | a****b@g****m | 129 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 12
- Average time to close issues: N/A
- Average time to close pull requests: about 1 month
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.08
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 10
Past Year
- Issues: 1
- Pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: about 1 month
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.11
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 9
Top Authors
Issue Authors
- ohiduzzamansiam1 (1)
- asrafulattare (1)
Pull Request Authors
- dependabot[bot] (18)
- ahmedshahriar (2)
- tafhimulkabir (1)