final-project-brenda

https://github.com/s33btorr/final-project-brenda

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 60% confidence

Engineering Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: s33btorr
License: mit
Language: Python
Default Branch: main
Size: 4.28 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract

MIT license

This project focuses on:

Training a deep learning model to detect tabular data on PDFs.
Detection and extraction of a specific PDF file with complex tables.
Cleaning of the data extracted.

It uses the model called "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml" from Detectron2 for training the model. The database used in the training was made by me, it contains 25 images in which the columns have been marked. For the detection of the text PyTesseract was used.

[!NOTE] PyTesseract can sometimes have issues reading words correctly. Some issues with OCR accuracy may require manual verification before serious use.

How to Run the Project

To run this project, follow these steps:

1. Clone this repository

2. Create and activate environment

console $ mamba env create -f environment.yml $ conda activate final_project_btb

3. Download data

You have two options to download the data:

Via Google Drive: Click on this link ([https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing]) and then click "Download".
Via Dropbox: Click on this link ([https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0]) and then click "Download".

4. Place data in the data folder of src/finalprojectbtb

Path: final-project-s33btorr/src/finalprojectbtb/data

5. Run Pytask command

console pytask

Short explanation of the project

Motivation

My motivation for this project stems from the fact that I could not find any pre-trained model, software, or package that could accurately read the table I needed given its complexity. Therefore, I trained a model using images similar to those I need to extract, allowing me to automate the extraction of a large number of pages in the future. With other programs, this process would take hours and result in a significant number of errors.

Overview

In this project, I have trained a deep learning model to detect the columns of a table from scanned PDFs using the Roboflow dataset I generated. After training, the model can identify the positions of different table columns. The extracted data is then processed and cleaned for analysis.

Dataset

You can access the dataset used for training via the following link: Roboflow Dataset

Training the Model

To train the model, I used the following approach:

The model was trained using a GPU provided by Google Colab.
The model was saved after training as model_final-2.pth.
The model is capable of detecting the columns in the table from a specific scanned PDF.

You can view and download to modify the code used to train the model in this notebook: Training Model Notebook

Making Predictions

Once the model is trained, it is saved as model_final-2.pth. This file is used to:

Extract the text using PyTesseract. I noticed that PyTesseract leaves a blank cell whenever the text goes to a new line. This can be used to determine the boundaries of each row in the table.
Predict column positions in new PDF tables with similar structure. Use this predictions to know where the text is located regarding the columns.

Cleaning the Data

After extracting the data, the following cleaning steps were performed:

Handling Missing Values: If the first column is empty, it means that row is part of the previous one. These rows are merged accordingly.
Numeric Fields: Cleaned numeric fields to ensure they are in a format suitable for analysis (e.g., removing non-numeric characters).
CVS and Graph Generation: Generated the cvs file and some basic graphs to visualize the cleaned data.

References

Code for training model:

Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020). "A Large Dataset of Historical Japanese Documents with Complex Layouts" arXiv:2004.08686

Model used for training:

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. Retrieved from https://github.com/facebookresearch/detectron2.

Owner

Login: s33btorr
Kind: user

Repositories: 1
Profile: https://github.com/s33btorr

Citation (CITATION)

@Unpublished{DOE2024,
    Title  = "EXAMPLE PROJECT",
    Author = "BRENDA TORRES BLOMER",
    Year   = "2024",
    Url    = "https://github.com/opensourceeconomics/econ-project-templates",
}

GitHub Events

Total

Push event: 2

Last Year

Push event: 2

Dependencies

.github/workflows/main.yml actions

actions/checkout v4 composite
codecov/codecov-action v3 composite
mamba-org/setup-micromamba v1 composite
r-lib/actions/setup-tinytex v2 composite

environment.yml pypi

kaleido *
pdbp *
supervision *

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science