https://github.com/alan-turing-institute/defoe

Code to analyse books and newspapers data using Apache Spark.

Last synced: 10 months ago · JSON representation

Repository

Code to analyse books and newspapers data using Apache Spark.

Basic Info

Host: GitHub
Owner: alan-turing-institute
License: mit
Language: Lex
Default Branch: master
Homepage:
Size: 117 MB

Statistics

Stars: 17
Watchers: 11
Forks: 3
Open Issues: 16
Releases: 0

Created over 7 years ago · Last pushed over 4 years ago

Metadata Files

Readme License

"Defoe" - analysis of historical books and newspapers data

This repository contains deprecated code to analyse historical books and newspapers datasets using Apache Spark.

The current live repository is at https://github.com/defoe-code

Supported datasets

British Library Books

This dataset consists of ~1TB of digitised versions of ~68,000 books from the 16th to the 19th centuries. The books have been scanned into a collection of XML documents. Each book has one XML document one per page plus one XML document for metadata about the book as a whole. The XML documents for each book are held within a compressed, ZIP, file. Each ZIP file holds the XML documents for a single book (the exception is 1880-1889's 00000003701-42pgs_944211dat.zip which wholds the XML documents for 2 books). These ZIP files occupy ~224GB.

This dataset is available under an open, public domain, licence. See Datasets for content mining and BL Labs Flickr Data: Book data and tag history (Dec 2013 - Dec 2014). For links to the data itself, see Digitised Books largely from the 19th Century. The data is provided by Gale, a division of CENGAGE.

British Library Newspapers

This dataset consists of ~1TB of digitised versions of newspapers from the 18th to the early 20th century. Each newspaper has an associated folder of XML documents where each XML document corresponds to a single issue of the newspaper. Each XML document conforms to a British Library-specific XML schema.

This dataset is available, under licence, from Gale, a division of CENGAGE. The dataset is in 5 parts e.g. Part I: 1800-1900. For links to all 5 parts, see British Library Newspapers.

Times Digital Archive

The code can also handle the Times Digital Archive (TDA).

This dataset is available, under licence, from Gale, a division of CENGAGE.

The code was used with papers from 1785-2009.

Find My Past Newspapers

This dataset is available, under licence, from Find My Past. To run queries with this dataset we can chose either to use: * ALTO model: for running queries at page level. These are the same queries for the BL books. * FMP model: for running queries at article level.

Papers Past New Zealand and Pacific newspapers

Papers Past provide digitised New Zealand and Pacific newspapers from the 19th and 20th centuries.

Data can be accessed via API calls which return search results in the form of XML documents. Each XML document holds one or more articles.

This dataset is available, under licence, from Papers Past.

National Library of Scotland - Encyclopaedia Britannica, 1768-1860

National Library of Scotland provide digitised Encyclopaedia Britanica from the 18th and 20th centuries.

Data can be download from a zip file: 155,388 ALTO XML files; 195 METS files; 155,388 image files. See copyrights restrictions

Get started

Set up (local):

Set up local environment

Set up (Urika):

Set up Urika environment
Import data into Urika
Import British Library Books and Newspapers data into Urika (Alan Turing Institute-Scottish Enterprise Data Engineering Program University of Edinburgh project members only)

Run queries:

Available queries:

ALTO documents (British Library Books and Find My Past Newspapers (at page level))
British Library Newspapers (these can also be run on the Times Digital Archive)
FMP newspapers (Find My Past Newspapers datasets at article level)
Papers Past New Zealand and Pacific newspapers
Generic XML document queries (these can be run on arbitrary XML documents)
NLS queries (these can be run on the Encyclopaedia Britannica dataset)
HDFS queries - running queries against HDFS files - for interoperability across models.
PostgreSQL queries - running queries against PostgreSQL database - for interoperability across models.
ES queries - running queries against ES - for interoperability across models.

Developers:

Origins and acknowledgements

British Library Books analysis code

The code to analyse the British Library Books dataset has its origins in the first phase of 'Enabling Complex Analysis of Large Scale Digital Collections', a project funded by the Jisc Research Data Spring in 2015.

The project team included: Melissa Terras (UCL), James Baker (British Library), David Beavan (UCL), James Hetherington (UCL), Martin Zaltz Austwick (UCL), Oliver Duke-Williams (UCL), Will Finley (University of Sheffield), Helen O'Neill (UCL), Anne Welsh (UCL).

The code originated from the the GitHub repository UCL-dataspring/cluster-code:

Branch: sparkrods.
Commit: 08d8bfd0a6cf37f7e4408a9475b38d6747c0cfeb (10 November 2016).
Developers: James Hetherington (UCL), James Baker (BL)

Times Digital Archive and British Library Newspapers analysis code

The code to analyse the Times Digital Archive and British Library Newspapers dataset has its origins in code developed by UCL to analyse the Times Digital Archive. This work took place from 2016-2018.

The project team included: James Hetherington (UCL), Raquel Alegre (UCL), Roma Klapaukh (UCL).

The code originated from the the GitHub repository UCL/inewspaperrods:

Branch: master.
Commit: ffe58042b7c4655274aa6b99fbdd6f6b0304f7ff (22 June 2018)
Developers: James Hetherington (UCL), Raquel Alegre (UCL), Roma Klapaukh (UCL).

Analysing humanities data using Cray Urika-GX

Both the above codes were updated and extended by EPCC as part of the Research Engineering Group of the The Alan Turing Institute. The work focused on running both codes on the Alan Turing Institute Cray Urika-GX Service and analysing British Library Books, British Library Newspapers and Papers Past New Zealand and Pacific newspapers datasets.

This work was done in conjunction with Melissa Terras, College of Arts, Humanities and Social Sciences (CAHSS), The University of Edinburgh. The work was funded by Scottish Enterprise as part of the Alan Turing Institute-Scottish Enterprise Data Engineering Program. This work runs from 2018 to 2019 and is ongoing at present, using this repository.

The project team includes: Rosa Filgueira (EPCC), Mike Jackson (EPCC), Anna Roubickova (EPCC).

The code originated from the the GitHub repositories:

alan-turing-institute/cluster-code
- Branch: epcc-sparkrods
- Commit: 00561bff61030fdff131a20fe45ede97897c4743 (21 December 2018)
alan-turing-institute/inewspaperrods
- Branch: epcc-master
- Commit: b9c89764f97987ff1600a35cc3d3bc7bb68da79f (28 January 2019).
alan-turing-institute/inewspaperrods
- Branch: other-archives
- Commit: 43748ccd3839b71347660f4375e9a18c45648118 (13 February 2019).
Developers: Rosa Filgueira (EPCC), Mike Jackson (EPCC), Anna Roubickova (EPCC).

Living With Machines

The code to analyse the Find My Past Newspapers dataset and to support blobs on Azure was developed by David Beavan (The Alan Turing Institute) as part of Living With Machines funded by UKRI's Strategic Priorities Fund and led by the Arts and Humanities Research Council (AHRC). Living With Machines runs from 2018-2023 and is ongoing at present using this repository.

The development team includes: David Beavan (Alan Turing Institute), Rosa Filgueira (EPCC), Mike Jackson (EPCC).

The code originated from the the GitHub repositories:

DavidBeavan/cluster-code
- Branch: epcc-sparkrods
- Commit: 8e37fdaa0a57e164aecbdadaa4981b5b225a3932 (15 January 2019)
DavidBeavan/cluster-code
- Branch: azure-sparkrods
- Commit: 8110fb498631edcc5b385029cf5a45dd91d216fc (23 November 2018)
Developer: David Beavan (Alan Turing Institute)

Name

The code is called "defoe" after Daniel Defoe, writer, journalist and pamphleteer of the 17-18 century.

Copyright and licence

All code is available for use and reuse under a MIT Licence. See LICENSE.

Third-party data

defoe/test/books/fixtures/000000037_0_1-42pgs__944211_dat_modified.zip

A modified copy of the file 000000037_0_1-42pgs__944211_dat.zip from OCR text derived from digitised books published 1880 - 1889 in ALTO XML (doi: 10.21250/db11) which is licenced under CC0 1.0 Public Domain.

The modifications are as follows:

000000037_metadata.xml:

- <MODS:placeTerm type="text">Manchester</MODS:placeTerm> => + <MODS:placeTerm type="text">Manchester [1823]</MODS:placeTerm>

000000218_metadata.xml:

- <MODS:placeTerm type="text">London</MODS:placeTerm> + <MODS:placeTerm type="text">London [1823]</MODS:placeTerm>

defoe/test/alto/fixtures/000000037_000005.xml

A copy of the file ALTO/000000037_000005.xml from the above file.

defoe/test/papers/fixtures/1912_11_10.xml

A copy of the file newsrods/test/fixtures/20000424.xml from from ucl/inewspaperrods. The file has been renamed, most of its content removed, and its data replaced by dummy data.

Owner

Name: The Alan Turing Institute
Login: alan-turing-institute
Kind: organization
Email: info@turing.ac.uk

Website: https://turing.ac.uk
Repositories: 477
Profile: https://github.com/alan-turing-institute

The UK's national institute for data science and artificial intelligence.

GitHub Events

Total

Issues event: 1
Issue comment event: 1

Last Year

Issues event: 1
Issue comment event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 25
Total pull requests: 4
Average time to close issues: 12 days
Average time to close pull requests: about 1 hour
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.52
Average comments per pull request: 0.5
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 15 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/alan-turing-institute/defoe

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

"Defoe" - analysis of historical books and newspapers data

This repository contains deprecated code to analyse historical books and newspapers datasets using Apache Spark.

The current live repository is at https://github.com/defoe-code

Supported datasets

British Library Books

British Library Newspapers

Times Digital Archive

Find My Past Newspapers

Papers Past New Zealand and Pacific newspapers

National Library of Scotland - Encyclopaedia Britannica, 1768-1860

Get started

Origins and acknowledgements

British Library Books analysis code

Times Digital Archive and British Library Newspapers analysis code

Analysing humanities data using Cray Urika-GX

Living With Machines

Name

Copyright and licence

Third-party data

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels