corpus-scraper

Automated corpus data collection for several online word-in-context corpuses

https://github.com/centrefordigitalhumanities/corpus-scraper

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Automated corpus data collection for several online word-in-context corpuses

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: mit
Language: JavaScript
Default Branch: develop
Homepage:
Size: 80.1 KB

Statistics

Stars: 4
Watchers: 11
Forks: 1
Open Issues: 1
Releases: 1

Created almost 12 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

CorpusScraper

By Research Software Lab, Centre for Digital Humanities, Utrecht University

Motivation

Some online text concordance corpora, such as Corpus del Español, do not offer a button to download your search results in a practical format, effectively expecting you to copy, paste and edit the data manually page by page. CorpusScraper is a bookmarklet that automates that work for you, so saving your search results becomes nearly as easy as if there would be such a button.

See https://centrefordigitalhumanities.github.io/corpus-scraper/ for a full explanation of the usage.

What’s included

The bookmarklet is written for Chrome and appears to work in Safari and Firefox as well. Internet Explorer seems not to work for most corpora, but your mileage may vary (either way, you should have at least version 9 installed).

Currently, the script can scrape data from all byu.edu corpora and from Real Academia Española (CREA/CORDE). As of version 2.0, corpora from the Fundación Rafael Lapesa are supported as well (CORPES, CREA, CNDHE). It is written such that you can add your own implementations for other online concordance corpora.

How to add support for a new online text corpus

The top anonymous function of the script contains a variable declaration that looks like this:

var domains = {
    ...
};

containing key-value pairs that look like this:

'corpus.example.org': {...}

add such a key-value pair for each domain that you want to add support for, using an existing pair as an example. Note that you can create aliases afterwards if the same implementation works on multiple domains, like in the following line:

domains['www.corpusdelespanol.org'] = domains['corpus.byu.edu'];

At the very least, the value part should contain the following four members:

columns: [...]

An array of strings, representing the names of the columns that will be extracted. This is considered a promise. Every data row that your domain implementation outputs, should have the same length as this array and contain the corresponding fields in the same order.

init: function ( ) {...}

Initializes the target and progressSteps top-level variables. target must be set to the innermost frame or window that contains the first page of data, while progressSteps should be set to the number of pages to extract from. Make sure to have a special case for when there is only a single page of results with no navigation available (see the Corpus del Español example).

getNext: function (doc) {...}

Returns either the URL to the next page in doc, which is a Document object, or a function which accepts a callback, fetches the next document and finally calls the callback with the next document. If there is no next page it returns undefined. Note that you cannot return the URL simply as anchor.href, because Chrome does not add the href property to anchor elements when the containing document is parsed in the background without rendering it. Instead, you should return anchor.getAttribute('href').

scrape1page: function (doc) {...}

Extracts the data from the page represented by doc (again a Document object) and appends those data to the top-level data array, one nested array per row (so data becomes an array of arrays of strings or numbers). Note that each row should contain the same number of fields in the same order as described in your columns member.

Please take note that your custom domain implementation should only use the target, data and progressSteps identifiers from the top-level function. Do not use window.document or window.jQuery, because that will probably not work out as you expect. You can however use doc.querySelector and doc.querySelectorAll instead. See http://www.w3.org/TR/selectors-api/ for a specification of these selectors.

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

GitHub Events

Total

Member event: 2

Last Year

Member event: 2

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 7
Total pull requests: 1
Average time to close issues: 2 months
Average time to close pull requests: about 17 hours
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 0.57
Average comments per pull request: 3.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 17 hours
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 3.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science