lines_merge_ocr
This script merges lines after extracting plain text from a pdf that produces OCR
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary
Keywords
Repository
This script merges lines after extracting plain text from a pdf that produces OCR
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Lines merger
What for?
An OCR program such as FineReader recognizes the text in the image and creates a pdf file where the line breaks correspond to the line breaks in the original image. In the hyphenation places, which are located inside the word, the program puts a special character called "soft hyphen".
The program in this repository takes the text extracted from the pdf and stitches the lines together when it sees an intra-word hyphenation.
In addition, since this program is designed to work on a project of digitization of Russian novel, the program removes improbable characters for 19th century Russian text.
How to?
- You must extract the text from the pdf file. One way you can do this:
```python
import pdftotext
Load your PDF
with open("file.pdf", "rb") as f: pdf = pdftotext.PDF(f)
Save all text to a txt file.
with open('file.txt', 'w') as f: f.write("\n====page====\n".join(pdf)) ```
- You have to run the script on the command line and set the folder where the txt files are located as a parameter:
python merge.py plain_text
- The script will create the same folder with the processed files and add a suffix
_merged.
Examples
Owner
- Name: Boris Orekhov
- Login: nevmenandr
- Kind: user
- Location: Moscow
- Website: https://nevmenandr.github.io
- Twitter: nevmenandr
- Repositories: 42
- Profile: https://github.com/nevmenandr
Digital humanities researcher
Citation (CITATION.cff)
cff-version: 1.2.0
title: Lines merger
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Boris
family-names: Orekhov
email: nevmenandr@gmail.com
affiliation: HSE University
orcid: 'https://orcid.org/0000-0002-9099-0436'
identifiers:
- type: doi
value: 10.5281/zenodo.12814092
repository-code: 'https://github.com/nevmenandr/lines_merge_OCR'
abstract: >-
An OCR program such as FineReader recognizes the text in
the image and creates a pdf file where the line breaks
correspond to the line breaks in the original image. In
the hyphenation places, which are located inside the word,
the program puts a special character called "soft hyphen".
The program in this repository takes the text extracted
from the pdf and stitches the lines together when it sees
an intra-word hyphenation.
In addition, since this program is designed to work on a
project of digitization of Russian novel, the program
removes improbable characters for 19th century Russian
text.
keywords:
- pdf
- ocr
license: GPL-3.0
commit: a60f01202c42782203fb3e3797ac7722ecffa683
version: 1.0.0
date-released: '2022-10-10'
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0