https://github.com/centre-for-humanities-computing/krak-phonebook-parser

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: centre-for-humanities-computing
Language: JavaScript
Default Branch: master
Size: 8.95 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed over 3 years ago

Metadata Files

Readme

README.md

Krak phonebook parser

A tool for extracting names, addresses and phone numbers from PDFs and parsing them according to a set of rules.

Created to extract demographic information from Krak's residency registry.

The OCR scanned PDFs can be located at https://bibliotek.kk.dk/temaer/krak/1990-2007/08

CLI installation

Install Node.js (version?)
Requires pdftotext in the path (https://www.xpdfreader.com/pdftotext-man.html)
Clone this repository
Navigate to the root of the repository and run

npm install

CLI usage

Navigate to the root of the repository and run

node ./cli.js -s [input directory or file] -d [output directory]

The input must be a valid PDF file or folder. If provided a folder, the extracter will read all PDFs in that folder and extract text from them.
As an intermediate step, the script creates a temporary folder in the output directory (/temp/) with the raw text extracts. They are deleted after the parsing, unless flag --keep is given.
If the output folder does not exist, it will be created.
Note: All files must have their year in the filename in the format: YYYY. The script will take the first valid year in the filename as the basis for the output.

CLI options

-s, -source <path> [required]: A path to the file or folder containing the PDFs to be read and parsed. If the source path resolves to a file, extracts and parses only text from that file.
-d, destination <directory> [required]: A path to the folder where data should be saved.
-f, --file [optional]: Outputs parser statistics to file in the destination folder (stats/stats.txt). Otherwise prints to terminal.
-t, --threshold <integer> [optional]: Defines a minimum line length (= number of characters) to consider when parsing. Disregards anything below this threshold. Default is 5.
-p, --parse [optional]: Parse only (skips the extraction of text from PDF). Useful when experimenting with different rules and thresholds.
-k, --keep [optional]: Keeps the temporary folder and the raw text extracted from the PDFs. Must be used at least once before using the -p, --parse argument.
-b, --debug [optional]: Debug mode. Prints a lot of things in the terminal while running.

Terminal output

When the parser has run, an estimate of success will be printed to the terminal (or to a file if flag --file is set).

Notes

Currently runs synchronously. Needs refactoring to extract text from multiple files simultaneously.

Finding unique names

unique.js is a simple script that finds all unique names in either a given year or for all years.

Arguments

<path> [required]: An absolute path to the directory that contains the output from the main CLI program. The data must be in the form produced by the CLI program (YYYY.ndjson).
<integer> [optional]: A year in the format YYYY. If specified, generates a list of unique names only for that year provided the data for that year exists.

Usage

node ./unique.js [required: path to directory] [optional: year]

Example:

node ./unique.js /Users/me/data 2004

Owner

Name: Center for Humanities Computing Aarhus
Login: centre-for-humanities-computing
Kind: organization
Email: chcaa@cas.au.dk
Location: Aarhus, Denmark

Website: https://chc.au.dk/
Repositories: 130
Profile: https://github.com/centre-for-humanities-computing

GitHub Events

Total

Last Year

Committers

Last synced: 12 months ago

All Time

Total Commits: 29
Total Committers: 1
Avg Commits per committer: 29.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
donbjarkone	4****e	29

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/centre-for-humanities-computing/krak-phonebook-parser

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Krak phonebook parser

CLI installation

CLI usage

CLI options

Terminal output

Notes

Finding unique names

Arguments

Usage

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels