dhscrapers

A unified interface for scrapers for Digital Humanities resources

https://github.com/centrefordigitalhumanities/dhscrapers

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Keywords

digtial-humanities python scraping

Last synced: 6 months ago · JSON representation ·

Repository

A unified interface for scrapers for Digital Humanities resources

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: mit
Language: Python
Default Branch: develop
Homepage:
Size: 4.34 MB

Statistics

Stars: 0
Watchers: 4
Forks: 0
Open Issues: 3
Releases: 1

Topics

digtial-humanities python scraping

Created over 5 years ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

DHScrapers

This repo attempts to collect various scrapers into one place, and create some re-usable base code in the process. As such, it is an attempt to make life easier for future developers that need to do some scraping (quickly). The idea is that each specific scraper becomes its own module, with classes in it that inherit from the base classes.

Installation

Make sure you have Python3 installed (any version should do, tested with 3.6) and create a virtualenv. There are two ways to install, and it depends on which scraper you intend to use which option you need. If the scraper you want to use does not have a requirements.txt in the module, the general dependencies will do. In this case, just run pip install -r requirements.txt from the project root folder.

In the other case, do the equivalent of pip install -r iis/requirement.txt, where iis is the name of the module (i.e. scraper) you want to use.

Once this is done, start the virtualenv, and you're good to go!

Overview

Scraper

Scraping stuff from the web is divided into three (optional) steps by this module:

COLLECT the html from a webpage
PARSE the html into entities, i.e. extract the info we need
EXPORT these entities into the format(s) we want

These steps translate into the three respective base classes. In addition, there is the idea of a base_entity and some children of that (book_edition and book_review at the time of writing). These allow us to present the info that we extracted from any html in a uniform way to the exporter.

General

Calling a specific scraper should be as easy as python -m goodreads --whatever_args_we_expect. Therefore, any new module should have a __main__py entrypoint. If the scraper should be re-usable, it might be nice to create a neat command line interface for it here. Next to __main__.py should be a script that is the main entrypoint for non-commandline use (e.g. goodreads.py). This script can be as simple or as complex as you want. It could implement the steps described above directly, or import them from other scripts.

Base classes

| class | features | | ----- | ----- | | collector | handle actual requests, some url utility functions | | parser | create BeautifulSoup and GenderDetector instances, some whitespace utility functions | | exporter | export collected entities into different formats (CSV, TXT, XML)

Entities

Typically, we either need the entire page, for example when each page is a data instance (e.g. each page is XML containing one entity / instance), or we need to parse the HTML to extract the info we need. For the second case, this module offers base entities, the most important feature of which is uniformization to enable re-usablility of the exporter class.

The important thing with entities is that they will be translated into dicts. By default, to_dict() returns the output of vars() (which calls __dict__ underwater). If need be, this can be customized by overwriting to_dict, for example if one of the field values needs to be calculated on the basis of others, or the specific formatting is required.

Tip: the order of fields in the export(s) can be influenced by listing the fields of the entity in the class' constructor in the desired order (see goodreads.entities.review.py for an example).

Utilities / Logging

There is currently one utility in the utilities module: an initializer for a logger. Calling this function will give you a logger that will log INFO (and above) to the console, and DEBUG and above to a file. Simply call this function before doing anything else, and you can import the logger (logger = logging.getLogger(__name__)) in each script (parser, collector, etc) and start logging.

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: DHScrapers
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: >-
      Research Software Lab, Centre for Digital Humanities,
      Utrecht University
    city: Utrecht
    country: NL
    website: >-
      https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
repository-code: 'https://github.com/CentreForDigitalHumanities/DHScrapers'
abstract: >-
  This software provides an interface to facilitate and
  unify scrapers for Digital Humanities research.
keywords:
  - scraping
  - digital humanities
  - python
  - parsing
license: MIT
version: 0.1.0
date-released: '2024-12-12'

GitHub Events

Total

Create event: 3
Release event: 1
Issues event: 2
Member event: 2
Issue comment event: 4
Push event: 137
Pull request event: 1

Last Year

Create event: 3
Release event: 1
Issues event: 2
Member event: 2
Issue comment event: 4
Push event: 137
Pull request event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 132
Total Committers: 8
Avg Commits per committer: 16.5
Development Distribution Score (DDS): 0.333

Past Year

Commits: 13
Committers: 2
Avg Commits per committer: 6.5
Development Distribution Score (DDS): 0.462

Top Committers

Name	Email	Commits
Alex Hebing	a**g@u**l	88
BeritJanssen	b**n@g**m	14
unknown	r**t@c**l	9
Giorgos Damaskos	g**s@u**l	7
R. Loeber	r**r@u**l	6
José de Kruif	J**f@u**l	3
Luka van der Plas	l**s@u**l	3
Alex Hebing	a**g@g**m	2

Committer Domains (Top 20 + Academic)

uu.nl: 5 correcthosting.nl: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 6
Total pull requests: 2
Average time to close issues: about 6 hours
Average time to close pull requests: about 19 hours
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 1.33
Average comments per pull request: 2.0
Merged pull requests: 1
Bot issues: 2
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 1
Average time to close issues: 2 minutes
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 1
Average comments per issue: 1.33
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 2
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

lukavdplas (2)
JeltevanBoheemen (1)
BeritJanssen (1)
github-actions[bot] (1)

Pull Request Authors

BeritJanssen (2)
lukavdplas (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

epidat/requirements.txt pypi

lxml *

iis/requirements.txt pypi

lxml *

requirements.in pypi

beautifulsoup4 *
dicttoxml *
gender_guesser *
iso-639 *
langdetect *
pytest *
requests *
selenium *

requirements.txt pypi

attrs ==19.3.0
beautifulsoup4 ==4.8.2
certifi ==2019.11.28
chardet ==3.0.4
dicttoxml ==1.7.4
gender-guesser ==0.4.0
idna ==2.9
importlib-metadata ==1.5.0
iso-639 ==0.4.5
langdetect ==1.0.8
more-itertools ==8.2.0
packaging ==20.3
pluggy ==0.13.1
py ==1.8.1
pyparsing ==2.4.6
pytest ==5.4.1
requests ==2.23.0
selenium ==3.141.0
six ==1.14.0
soupsieve ==2.0
urllib3 ==1.25.8
wcwidth ==0.1.8
zipp ==3.1.0

dhscrapers

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DHScrapers

Installation

Overview

Scraper

General

Base classes

Entities

Utilities / Logging

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies