nlp_behavior_analytic_journals

https://github.com/jakesosine/nlp_behavior_analytic_journals

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: jakesosine
Language: Python
Default Branch: master
Size: 13.9 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 1

Created almost 6 years ago · Last pushed about 4 years ago

Metadata Files

Readme Citation

JABA Preprocessing and NLP

Introduction

This project uses optical character recognition to gather written text from the Journal of Applied Behavior Analysis tesseract-ocr information

Installation

Install tesseract to your system
Locate and add the path to pytesseract from your system to
Install the required python packages or create a virtual environment and install from requirements.txt

Description of files and folders

articles folder is where you will copy and paste articles that will be
jpegs - This folder will save each page of a pdf that has been processed.
txt_files - this is where the .txt files will be saved.
.gitignore - file indicates which files will not be uploaded to github.
ocr_api_interaction.py - this is the original interaction using OCR SPACE API with a free API key. Pricing models vary based on volume.
OCR_article_to_text.py - Use for slow extraction from PDF's in research articles
preprocessing_text.py - these are the functions where text is added to memory, parsed, and cleaned.
resize_image_functions.py- A variety of functions involving image processing to analyze their effects on sample pictures.
tesseract_info.py - This includes information on how to get started with OCR.
test.py ### How it works
....

Recommended format of pdf files in library

articles ├── year1 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf ├── year2 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf ── year3 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf

Owner

Name: Jacob Sosine
Login: jakesosine
Kind: user
Location: Clayton, California

Website: https://peaceful-baklava-eaba61.netlify.app/#
Twitter: 87Jts
Repositories: 14
Profile: https://github.com/jakesosine

Software Engineer, Data Scientist, Behavior Analyst.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sosine"
  given-names: "Jacob"
  orcid: "https://orcid.org/0000-0000-0000-0000"
- family-names: "David J."
  given-names: "Cox"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "NLP_behavior_analytic_journals"
version: 1.0.0
doi: DOI: 10.5281/zenodo.6536543
date-released: 2022-05-09
url: "https://github.com/jakesosine/NLP_behavior_analytic_journals"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

Pillow ==7.2.0
PyPDF2 ==1.26.0
certifi ==2020.6.20
chardet ==3.0.4
idna ==2.10
numpy ==1.18.5
opencv-python ==4.4.0.42
pdf2image ==1.14.0
pytesseract ==0.3.5
requests ==2.24.0
urllib3 ==1.25.10

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science