dhscrapers
A unified interface for scrapers for Digital Humanities resources
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary
Keywords
Repository
A unified interface for scrapers for Digital Humanities resources
Basic Info
Statistics
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 3
- Releases: 1
Topics
Metadata Files
README.md
DHScrapers
This repo attempts to collect various scrapers into one place, and create some re-usable base code in the process. As such, it is an attempt to make life easier for future developers that need to do some scraping (quickly). The idea is that each specific scraper becomes its own module, with classes in it that inherit from the base classes.
Installation
Make sure you have Python3 installed (any version should do, tested with 3.6) and create a virtualenv. There are two ways to install, and it depends on which scraper you intend to use which option you need. If the scraper you want to use does not have a requirements.txt in the module, the general dependencies will do. In this case, just run pip install -r requirements.txt from the project root folder.
In the other case, do the equivalent of pip install -r iis/requirement.txt, where iis is the name of the module (i.e. scraper) you want to use.
Once this is done, start the virtualenv, and you're good to go!
Overview
Scraper
Scraping stuff from the web is divided into three (optional) steps by this module:
- COLLECT the html from a webpage
- PARSE the html into entities, i.e. extract the info we need
- EXPORT these entities into the format(s) we want
These steps translate into the three respective base classes. In addition, there is the idea of a base_entity and some children of that (book_edition and book_review at the time of writing). These allow us to present the info that we extracted from any html in a uniform way to the exporter.
General
Calling a specific scraper should be as easy as python -m goodreads --whatever_args_we_expect. Therefore, any new module should have a __main__py entrypoint. If the scraper should be re-usable, it might be nice to create a neat command line interface for it here. Next to __main__.py should be a script that is the main entrypoint for non-commandline use (e.g. goodreads.py). This script can be as simple or as complex as you want. It could implement the steps described above directly, or import them from other scripts.
Base classes
| class | features | | ----- | ----- | | collector | handle actual requests, some url utility functions | | parser | create BeautifulSoup and GenderDetector instances, some whitespace utility functions | | exporter | export collected entities into different formats (CSV, TXT, XML)
Entities
Typically, we either need the entire page, for example when each page is a data instance (e.g. each page is XML containing one entity / instance), or we need to parse the HTML to extract the info we need. For the second case, this module offers base entities, the most important feature of which is uniformization to enable re-usablility of the exporter class.
The important thing with entities is that they will be translated into dicts. By default, to_dict() returns the output of vars() (which calls __dict__ underwater). If need be, this can be customized by overwriting to_dict, for example if one of the field values needs to be calculated on the basis of others, or the specific formatting is required.
Tip: the order of fields in the export(s) can be influenced by listing the fields of the entity in the class' constructor in the desired order (see goodreads.entities.review.py for an example).
Utilities / Logging
There is currently one utility in the utilities module: an initializer for a logger. Calling this function will give you a logger that will log INFO (and above) to the console, and DEBUG and above to a file. Simply call this function before doing anything else, and you can import the logger (logger = logging.getLogger(__name__)) in each script (parser, collector, etc) and start logging.
Owner
- Name: Centre for Digital Humanities
- Login: CentreForDigitalHumanities
- Kind: organization
- Email: cdh@uu.nl
- Location: Netherlands
- Website: https://cdh.uu.nl/
- Repositories: 39
- Profile: https://github.com/CentreForDigitalHumanities
Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: DHScrapers
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- name: >-
Research Software Lab, Centre for Digital Humanities,
Utrecht University
city: Utrecht
country: NL
website: >-
https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
repository-code: 'https://github.com/CentreForDigitalHumanities/DHScrapers'
abstract: >-
This software provides an interface to facilitate and
unify scrapers for Digital Humanities research.
keywords:
- scraping
- digital humanities
- python
- parsing
license: MIT
version: 0.1.0
date-released: '2024-12-12'
GitHub Events
Total
- Create event: 3
- Release event: 1
- Issues event: 2
- Member event: 2
- Issue comment event: 4
- Push event: 137
- Pull request event: 1
Last Year
- Create event: 3
- Release event: 1
- Issues event: 2
- Member event: 2
- Issue comment event: 4
- Push event: 137
- Pull request event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alex Hebing | a****g@u****l | 88 |
| BeritJanssen | b****n@g****m | 14 |
| unknown | r****t@c****l | 9 |
| Giorgos Damaskos | g****s@u****l | 7 |
| R. Loeber | r****r@u****l | 6 |
| José de Kruif | J****f@u****l | 3 |
| Luka van der Plas | l****s@u****l | 3 |
| Alex Hebing | a****g@g****m | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 6
- Total pull requests: 2
- Average time to close issues: about 6 hours
- Average time to close pull requests: about 19 hours
- Total issue authors: 4
- Total pull request authors: 2
- Average comments per issue: 1.33
- Average comments per pull request: 2.0
- Merged pull requests: 1
- Bot issues: 2
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 1
- Average time to close issues: 2 minutes
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 1.33
- Average comments per pull request: 2.0
- Merged pull requests: 0
- Bot issues: 2
- Bot pull requests: 0
Top Authors
Issue Authors
- lukavdplas (2)
- JeltevanBoheemen (1)
- BeritJanssen (1)
- github-actions[bot] (1)
Pull Request Authors
- BeritJanssen (2)
- lukavdplas (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- lxml *
- lxml *
- beautifulsoup4 *
- dicttoxml *
- gender_guesser *
- iso-639 *
- langdetect *
- pytest *
- requests *
- selenium *
- attrs ==19.3.0
- beautifulsoup4 ==4.8.2
- certifi ==2019.11.28
- chardet ==3.0.4
- dicttoxml ==1.7.4
- gender-guesser ==0.4.0
- idna ==2.9
- importlib-metadata ==1.5.0
- iso-639 ==0.4.5
- langdetect ==1.0.8
- more-itertools ==8.2.0
- packaging ==20.3
- pluggy ==0.13.1
- py ==1.8.1
- pyparsing ==2.4.6
- pytest ==5.4.1
- requests ==2.23.0
- selenium ==3.141.0
- six ==1.14.0
- soupsieve ==2.0
- urllib3 ==1.25.8
- wcwidth ==0.1.8
- zipp ==3.1.0