Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: jakesosine
  • Language: Python
  • Default Branch: master
  • Size: 13.9 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 5 years ago · Last pushed almost 4 years ago
Metadata Files
Readme Citation

README.md

JABA Preprocessing and NLP

Introduction

This project uses optical character recognition to gather written text from the Journal of Applied Behavior Analysis tesseract-ocr information

Installation

  1. Install tesseract to your system
  2. Locate and add the path to pytesseract from your system to
  3. Install the required python packages or create a virtual environment and install from requirements.txt

Description of files and folders

  • articles folder is where you will copy and paste articles that will be
  • jpegs - This folder will save each page of a pdf that has been processed.
  • txt_files - this is where the .txt files will be saved.
  • .gitignore - file indicates which files will not be uploaded to github.
  • ocr_api_interaction.py - this is the original interaction using OCR SPACE API with a free API key. Pricing models vary based on volume.
  • OCR_article_to_text.py - Use for slow extraction from PDF's in research articles
  • preprocessing_text.py - these are the functions where text is added to memory, parsed, and cleaned.
  • resize_image_functions.py- A variety of functions involving image processing to analyze their effects on sample pictures.
  • tesseract_info.py - This includes information on how to get started with OCR.
  • test.py ### How it works
  • ....

Recommended format of pdf files in library

articles ├── year1 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf ├── year2 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf ── year3 │ ├── article1.pdf │ ├── article2.pdf │ └── article3.pdf

Owner

  • Name: Jacob Sosine
  • Login: jakesosine
  • Kind: user
  • Location: Clayton, California

Software Engineer, Data Scientist, Behavior Analyst.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sosine"
  given-names: "Jacob"
  orcid: "https://orcid.org/0000-0000-0000-0000"
- family-names: "David J."
  given-names: "Cox"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "NLP_behavior_analytic_journals"
version: 1.0.0
doi: DOI: 10.5281/zenodo.6536543
date-released: 2022-05-09
url: "https://github.com/jakesosine/NLP_behavior_analytic_journals"

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • Pillow ==7.2.0
  • PyPDF2 ==1.26.0
  • certifi ==2020.6.20
  • chardet ==3.0.4
  • idna ==2.10
  • numpy ==1.18.5
  • opencv-python ==4.4.0.42
  • pdf2image ==1.14.0
  • pytesseract ==0.3.5
  • requests ==2.24.0
  • urllib3 ==1.25.10