lines_merge_ocr

This script merges lines after extracting plain text from a pdf that produces OCR

https://github.com/nevmenandr/lines_merge_ocr

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.0%) to scientific vocabulary

Keywords

ocr pdf
Last synced: 6 months ago · JSON representation ·

Repository

This script merges lines after extracting plain text from a pdf that produces OCR

Basic Info
  • Host: GitHub
  • Owner: nevmenandr
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 190 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
ocr pdf
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

DOI

Python

Lines merger

What for?

An OCR program such as FineReader recognizes the text in the image and creates a pdf file where the line breaks correspond to the line breaks in the original image. In the hyphenation places, which are located inside the word, the program puts a special character called "soft hyphen".

The program in this repository takes the text extracted from the pdf and stitches the lines together when it sees an intra-word hyphenation.

In addition, since this program is designed to work on a project of digitization of Russian novel, the program removes improbable characters for 19th century Russian text.

How to?

  1. You must extract the text from the pdf file. One way you can do this:

```python

import pdftotext

Load your PDF

with open("file.pdf", "rb") as f: pdf = pdftotext.PDF(f)

Save all text to a txt file.

with open('file.txt', 'w') as f: f.write("\n====page====\n".join(pdf)) ```

  1. You have to run the script on the command line and set the folder where the txt files are located as a parameter:

python merge.py plain_text

  1. The script will create the same folder with the processed files and add a suffix _merged.

Examples

See source and target.

Owner

  • Name: Boris Orekhov
  • Login: nevmenandr
  • Kind: user
  • Location: Moscow

Digital humanities researcher

Citation (CITATION.cff)

cff-version: 1.2.0
title: Lines merger
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Boris
    family-names: Orekhov
    email: nevmenandr@gmail.com
    affiliation: HSE University
    orcid: 'https://orcid.org/0000-0002-9099-0436'
identifiers:
  - type: doi
    value: 10.5281/zenodo.12814092
repository-code: 'https://github.com/nevmenandr/lines_merge_OCR'
abstract: >-
  An OCR program such as FineReader recognizes the text in
  the image and creates a pdf file where the line breaks
  correspond to the line breaks in the original image. In
  the hyphenation places, which are located inside the word,
  the program puts a special character called "soft hyphen".

  The program in this repository takes the text extracted
  from the pdf and stitches the lines together when it sees
  an intra-word hyphenation.

  In addition, since this program is designed to work on a
  project of digitization of Russian novel, the program
  removes improbable characters for 19th century Russian
  text.
keywords:
  - pdf
  - ocr
license: GPL-3.0
commit: a60f01202c42782203fb3e3797ac7722ecffa683
version: 1.0.0
date-released: '2022-10-10'

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels