dhscrapers

A unified interface for scrapers for Digital Humanities resources

https://github.com/centrefordigitalhumanities/dhscrapers

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.0%) to scientific vocabulary

Keywords

digtial-humanities python scraping
Last synced: 6 months ago · JSON representation ·

Repository

A unified interface for scrapers for Digital Humanities resources

Basic Info
  • Host: GitHub
  • Owner: CentreForDigitalHumanities
  • License: mit
  • Language: Python
  • Default Branch: develop
  • Homepage:
  • Size: 4.34 MB
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 0
  • Open Issues: 3
  • Releases: 1
Topics
digtial-humanities python scraping
Created over 5 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

DHScrapers

This repo attempts to collect various scrapers into one place, and create some re-usable base code in the process. As such, it is an attempt to make life easier for future developers that need to do some scraping (quickly). The idea is that each specific scraper becomes its own module, with classes in it that inherit from the base classes.

Installation

Make sure you have Python3 installed (any version should do, tested with 3.6) and create a virtualenv. There are two ways to install, and it depends on which scraper you intend to use which option you need. If the scraper you want to use does not have a requirements.txt in the module, the general dependencies will do. In this case, just run pip install -r requirements.txt from the project root folder.

In the other case, do the equivalent of pip install -r iis/requirement.txt, where iis is the name of the module (i.e. scraper) you want to use.

Once this is done, start the virtualenv, and you're good to go!

Overview

Scraper

Scraping stuff from the web is divided into three (optional) steps by this module:

  1. COLLECT the html from a webpage
  2. PARSE the html into entities, i.e. extract the info we need
  3. EXPORT these entities into the format(s) we want

These steps translate into the three respective base classes. In addition, there is the idea of a base_entity and some children of that (book_edition and book_review at the time of writing). These allow us to present the info that we extracted from any html in a uniform way to the exporter.

General

Calling a specific scraper should be as easy as python -m goodreads --whatever_args_we_expect. Therefore, any new module should have a __main__py entrypoint. If the scraper should be re-usable, it might be nice to create a neat command line interface for it here. Next to __main__.py should be a script that is the main entrypoint for non-commandline use (e.g. goodreads.py). This script can be as simple or as complex as you want. It could implement the steps described above directly, or import them from other scripts.

Base classes

| class | features | | ----- | ----- | | collector | handle actual requests, some url utility functions | | parser | create BeautifulSoup and GenderDetector instances, some whitespace utility functions | | exporter | export collected entities into different formats (CSV, TXT, XML)

Entities

Typically, we either need the entire page, for example when each page is a data instance (e.g. each page is XML containing one entity / instance), or we need to parse the HTML to extract the info we need. For the second case, this module offers base entities, the most important feature of which is uniformization to enable re-usablility of the exporter class.

The important thing with entities is that they will be translated into dicts. By default, to_dict() returns the output of vars() (which calls __dict__ underwater). If need be, this can be customized by overwriting to_dict, for example if one of the field values needs to be calculated on the basis of others, or the specific formatting is required.

Tip: the order of fields in the export(s) can be influenced by listing the fields of the entity in the class' constructor in the desired order (see goodreads.entities.review.py for an example).

Utilities / Logging

There is currently one utility in the utilities module: an initializer for a logger. Calling this function will give you a logger that will log INFO (and above) to the console, and DEBUG and above to a file. Simply call this function before doing anything else, and you can import the logger (logger = logging.getLogger(__name__)) in each script (parser, collector, etc) and start logging.

Owner

  • Name: Centre for Digital Humanities
  • Login: CentreForDigitalHumanities
  • Kind: organization
  • Email: cdh@uu.nl
  • Location: Netherlands

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: DHScrapers
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: >-
      Research Software Lab, Centre for Digital Humanities,
      Utrecht University
    city: Utrecht
    country: NL
    website: >-
      https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
repository-code: 'https://github.com/CentreForDigitalHumanities/DHScrapers'
abstract: >-
  This software provides an interface to facilitate and
  unify scrapers for Digital Humanities research.
keywords:
  - scraping
  - digital humanities
  - python
  - parsing
license: MIT
version: 0.1.0
date-released: '2024-12-12'

GitHub Events

Total
  • Create event: 3
  • Release event: 1
  • Issues event: 2
  • Member event: 2
  • Issue comment event: 4
  • Push event: 137
  • Pull request event: 1
Last Year
  • Create event: 3
  • Release event: 1
  • Issues event: 2
  • Member event: 2
  • Issue comment event: 4
  • Push event: 137
  • Pull request event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 132
  • Total Committers: 8
  • Avg Commits per committer: 16.5
  • Development Distribution Score (DDS): 0.333
Past Year
  • Commits: 13
  • Committers: 2
  • Avg Commits per committer: 6.5
  • Development Distribution Score (DDS): 0.462
Top Committers
Name Email Commits
Alex Hebing a****g@u****l 88
BeritJanssen b****n@g****m 14
unknown r****t@c****l 9
Giorgos Damaskos g****s@u****l 7
R. Loeber r****r@u****l 6
José de Kruif J****f@u****l 3
Luka van der Plas l****s@u****l 3
Alex Hebing a****g@g****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 6
  • Total pull requests: 2
  • Average time to close issues: about 6 hours
  • Average time to close pull requests: about 19 hours
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.33
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 2
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 1
  • Average time to close issues: 2 minutes
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 1.33
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 2
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lukavdplas (2)
  • JeltevanBoheemen (1)
  • BeritJanssen (1)
  • github-actions[bot] (1)
Pull Request Authors
  • BeritJanssen (2)
  • lukavdplas (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

epidat/requirements.txt pypi
  • lxml *
iis/requirements.txt pypi
  • lxml *
requirements.in pypi
  • beautifulsoup4 *
  • dicttoxml *
  • gender_guesser *
  • iso-639 *
  • langdetect *
  • pytest *
  • requests *
  • selenium *
requirements.txt pypi
  • attrs ==19.3.0
  • beautifulsoup4 ==4.8.2
  • certifi ==2019.11.28
  • chardet ==3.0.4
  • dicttoxml ==1.7.4
  • gender-guesser ==0.4.0
  • idna ==2.9
  • importlib-metadata ==1.5.0
  • iso-639 ==0.4.5
  • langdetect ==1.0.8
  • more-itertools ==8.2.0
  • packaging ==20.3
  • pluggy ==0.13.1
  • py ==1.8.1
  • pyparsing ==2.4.6
  • pytest ==5.4.1
  • requests ==2.23.0
  • selenium ==3.141.0
  • six ==1.14.0
  • soupsieve ==2.0
  • urllib3 ==1.25.8
  • wcwidth ==0.1.8
  • zipp ==3.1.0