trove-newspaper-harvester
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: wragge
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 572 KB
Statistics
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 4
- Releases: 6
Metadata Files
README.md
trove-newspaper-harvester
The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs. While the web interface will only show you the first 2,000 results matching your search, the Newspaper Harvester will get everything.
No installation required!
If you want to use the harvester without installing anything, just head over to the Trove Newspaper Harvester section in my GLAM Workbench.
Installation
sh
pip install trove-newspaper-harvester
Before you do any harvesting you need to get yourself a Trove API key.
Use as a library
python
from trove_newspaper_harvester.core import prepare_query, Harvester
Generate a set of query parameters using
prepare_query.
``` python myquery = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge" myapi_key = "mYSecREtkEy"
myqueryparams = preparequery(query=myquery) ```
Initialise the
Harvester
with your query parameters and api key.
python
harvester = Harvester(query_params=my_query_params, key=my_api_key)
Start the harvest!
python
harvester.harvest()
If the harvest fails just run
Harvester.harvest
again.
See the core module documentation for more options and examples.
Use as a command-line tool
There are three basic commands:
- start – start a new harvest
- restart – restart a stalled harvest
- report – view harvest details
Start a harvest
To start a new harvest you can just do:
sh
troveharvester start "[Trove query]" [Trove API key]
The Trove query can either be a url copied and pasted from a search in the Trove web interface, or a Trove API query url constructed using something like the Trove API Console. Enclose the url in double quotes.
See the CLI module documentation for more details.
Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.
Owner
- Name: Tim Sherratt
- Login: wragge
- Kind: user
- Website: https://timsherratt.org
- Repositories: 209
- Profile: https://github.com/wragge
Citation (citation.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Sherratt" given-names: "Tim" orcid: "https://orcid.org/0000-0001-7956-4498" title: "trove-newspaper-harvester" version: 0.6.4 doi: 10.5281/zenodo.7103174 date-released: 2022-10-11 url: "https://github.com/wragge/trove-newspaper-harvester"
GitHub Events
Total
- Issues event: 1
Last Year
- Issues event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Tim Sherratt | t****m@d****u | 21 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 9
- Total pull requests: 0
- Average time to close issues: 2 months
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 0.22
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- wragge (8)
- 5p4r74cu5 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 23 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 10
- Total maintainers: 1
pypi.org: trove-newspaper-harvester
Tool for bulk harvests of digitised newspaper articles from Trove
- Homepage: https://github.com/wragge/trove-newspaper-harvester
- Documentation: https://trove-newspaper-harvester.readthedocs.io/
- License: MIT License
-
Latest release: 0.7.2
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- fastai/workflows/quarto-ghp master composite