nlp_behavior_analytic_journals
https://github.com/jakesosine/nlp_behavior_analytic_journals
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
·
Repository
Basic Info
- Host: GitHub
- Owner: jakesosine
- Language: Python
- Default Branch: master
- Size: 13.9 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Created over 5 years ago
· Last pushed almost 4 years ago
Metadata Files
Readme
Citation
README.md
JABA Preprocessing and NLP
Introduction
This project uses optical character recognition to gather written text from the Journal of Applied Behavior Analysis tesseract-ocr information
Installation
- Install tesseract to your system
- Locate and add the path to pytesseract from your system to
- Install the required python packages or create a virtual environment and install from requirements.txt
Description of files and folders
- articles folder is where you will copy and paste articles that will be
- jpegs - This folder will save each page of a pdf that has been processed.
- txt_files - this is where the
.txtfiles will be saved. .gitignore- file indicates which files will not be uploaded to github.ocr_api_interaction.py- this is the original interaction using OCR SPACE API with a free API key. Pricing models vary based on volume.OCR_article_to_text.py- Use for slow extraction from PDF's in research articlespreprocessing_text.py- these are the functions where text is added to memory, parsed, and cleaned.resize_image_functions.py- A variety of functions involving image processing to analyze their effects on sample pictures.tesseract_info.py- This includes information on how to get started with OCR.test.py### How it works- ....
Recommended format of pdf files in library
articles
├── year1
│ ├── article1.pdf
│ ├── article2.pdf
│ └── article3.pdf
├── year2
│ ├── article1.pdf
│ ├── article2.pdf
│ └── article3.pdf
── year3
│ ├── article1.pdf
│ ├── article2.pdf
│ └── article3.pdf
Owner
- Name: Jacob Sosine
- Login: jakesosine
- Kind: user
- Location: Clayton, California
- Website: https://peaceful-baklava-eaba61.netlify.app/#
- Twitter: 87Jts
- Repositories: 14
- Profile: https://github.com/jakesosine
Software Engineer, Data Scientist, Behavior Analyst.
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Sosine" given-names: "Jacob" orcid: "https://orcid.org/0000-0000-0000-0000" - family-names: "David J." given-names: "Cox" orcid: "https://orcid.org/0000-0000-0000-0000" title: "NLP_behavior_analytic_journals" version: 1.0.0 doi: DOI: 10.5281/zenodo.6536543 date-released: 2022-05-09 url: "https://github.com/jakesosine/NLP_behavior_analytic_journals"
GitHub Events
Total
Last Year
Dependencies
requirements.txt
pypi
- Pillow ==7.2.0
- PyPDF2 ==1.26.0
- certifi ==2020.6.20
- chardet ==3.0.4
- idna ==2.10
- numpy ==1.18.5
- opencv-python ==4.4.0.42
- pdf2image ==1.14.0
- pytesseract ==0.3.5
- requests ==2.24.0
- urllib3 ==1.25.10