https://github.com/centre-for-humanities-computing/krak-phonebook-parser

https://github.com/centre-for-humanities-computing/krak-phonebook-parser

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: centre-for-humanities-computing
  • Language: JavaScript
  • Default Branch: master
  • Size: 8.95 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 3 years ago · Last pushed over 3 years ago
Metadata Files
Readme

README.md

Krak phonebook parser

A tool for extracting names, addresses and phone numbers from PDFs and parsing them according to a set of rules.

Created to extract demographic information from Krak's residency registry.

The OCR scanned PDFs can be located at https://bibliotek.kk.dk/temaer/krak/1990-2007/08

CLI installation

  • Install Node.js (version?)
  • Requires pdftotext in the path (https://www.xpdfreader.com/pdftotext-man.html)
  • Clone this repository
  • Navigate to the root of the repository and run

npm install

CLI usage

  • Navigate to the root of the repository and run

node ./cli.js -s [input directory or file] -d [output directory]

  • The input must be a valid PDF file or folder. If provided a folder, the extracter will read all PDFs in that folder and extract text from them.
  • As an intermediate step, the script creates a temporary folder in the output directory (/temp/) with the raw text extracts. They are deleted after the parsing, unless flag --keep is given.
  • If the output folder does not exist, it will be created.
  • Note: All files must have their year in the filename in the format: YYYY. The script will take the first valid year in the filename as the basis for the output.

CLI options

  • -s, -source <path> [required]: A path to the file or folder containing the PDFs to be read and parsed. If the source path resolves to a file, extracts and parses only text from that file.
  • -d, destination <directory> [required]: A path to the folder where data should be saved.
  • -f, --file [optional]: Outputs parser statistics to file in the destination folder (stats/stats.txt). Otherwise prints to terminal.
  • -t, --threshold <integer> [optional]: Defines a minimum line length (= number of characters) to consider when parsing. Disregards anything below this threshold. Default is 5.
  • -p, --parse [optional]: Parse only (skips the extraction of text from PDF). Useful when experimenting with different rules and thresholds.
  • -k, --keep [optional]: Keeps the temporary folder and the raw text extracted from the PDFs. Must be used at least once before using the -p, --parse argument.
  • -b, --debug [optional]: Debug mode. Prints a lot of things in the terminal while running.

Terminal output

When the parser has run, an estimate of success will be printed to the terminal (or to a file if flag --file is set).

Notes

Currently runs synchronously. Needs refactoring to extract text from multiple files simultaneously.


Finding unique names

unique.js is a simple script that finds all unique names in either a given year or for all years.

Arguments

  • <path> [required]: An absolute path to the directory that contains the output from the main CLI program. The data must be in the form produced by the CLI program (YYYY.ndjson).
  • <integer> [optional]: A year in the format YYYY. If specified, generates a list of unique names only for that year provided the data for that year exists.

Usage

node ./unique.js [required: path to directory] [optional: year]

Example:

node ./unique.js /Users/me/data 2004

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
Last Year

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 29
  • Total Committers: 1
  • Avg Commits per committer: 29.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
donbjarkone 4****e 29

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels