https://github.com/centre-for-humanities-computing/krak-phonebook-parser
https://github.com/centre-for-humanities-computing/krak-phonebook-parser
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: centre-for-humanities-computing
- Language: JavaScript
- Default Branch: master
- Size: 8.95 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Krak phonebook parser
A tool for extracting names, addresses and phone numbers from PDFs and parsing them according to a set of rules.
Created to extract demographic information from Krak's residency registry.
The OCR scanned PDFs can be located at https://bibliotek.kk.dk/temaer/krak/1990-2007/08
CLI installation
- Install Node.js (version?)
- Requires
pdftotextin the path (https://www.xpdfreader.com/pdftotext-man.html) - Clone this repository
- Navigate to the root of the repository and run
npm install
CLI usage
- Navigate to the root of the repository and run
node ./cli.js -s [input directory or file] -d [output directory]
- The input must be a valid PDF file or folder. If provided a folder, the extracter will read all PDFs in that folder and extract text from them.
- As an intermediate step, the script creates a temporary folder in the output directory (/temp/) with the raw text extracts. They are deleted after the parsing, unless flag
--keepis given. - If the output folder does not exist, it will be created.
- Note: All files must have their year in the filename in the format: YYYY. The script will take the first valid year in the filename as the basis for the output.
CLI options
-s, -source <path>[required]: A path to the file or folder containing the PDFs to be read and parsed. If the source path resolves to a file, extracts and parses only text from that file.-d, destination <directory>[required]: A path to the folder where data should be saved.-f, --file[optional]: Outputs parser statistics to file in the destination folder (stats/stats.txt). Otherwise prints to terminal.-t, --threshold <integer>[optional]: Defines a minimum line length (= number of characters) to consider when parsing. Disregards anything below this threshold. Default is5.-p, --parse[optional]: Parse only (skips the extraction of text from PDF). Useful when experimenting with different rules and thresholds.-k, --keep[optional]: Keeps the temporary folder and the raw text extracted from the PDFs. Must be used at least once before using the-p, --parseargument.-b, --debug[optional]: Debug mode. Prints a lot of things in the terminal while running.
Terminal output
When the parser has run, an estimate of success will be printed to the terminal (or to a file if flag --file is set).
Notes
Currently runs synchronously. Needs refactoring to extract text from multiple files simultaneously.
Finding unique names
unique.js is a simple script that finds all unique names in either a given year or for all years.
Arguments
<path>[required]: An absolute path to the directory that contains the output from the main CLI program. The data must be in the form produced by the CLI program (YYYY.ndjson).<integer>[optional]: A year in the format YYYY. If specified, generates a list of unique names only for that year provided the data for that year exists.
Usage
node ./unique.js [required: path to directory] [optional: year]
Example:
node ./unique.js /Users/me/data 2004
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0