corpus_text_processor

A desktop application for preparing files for use in a corpus

https://github.com/writecrow/corpus_text_processor

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 3 committers (66.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary

Keywords

corpora corpus-linguistics desktop-app text-processing
Last synced: 6 months ago · JSON representation ·

Repository

A desktop application for preparing files for use in a corpus

Basic Info
  • Host: GitHub
  • Owner: writecrow
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 27.4 MB
Statistics
  • Stars: 8
  • Watchers: 5
  • Forks: 4
  • Open Issues: 4
  • Releases: 20
Topics
corpora corpus-linguistics desktop-app text-processing
Created over 6 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Corpus Text Processor

Contents

Getting started

The Corpus Text Processor (download here) for Windows or Mac is a downloadable application for Windows and Mac that provides batched (multiple-files-at-a-time) operations for common corpus processing tasks. The screenshot of the application below shows the four tasks currently available:

Screenshot of application UI

After running operations on a set of files, Corpus Text Processor provides debugging output that indicates how many files were processed and which files, if any, had issues.

Installation

Download the latest version of the Corpus Text Processor for Windows or MacOS at https://github.com/writecrow/corpustextprocessor/releases

Trouble installing?

If you get error messages related to application security upon installation, please review one of these pages based on your operating system:

Preparing files to be processed

Before running the tool, place all files to be processed in a specific folder; this can include sub-directories. It is recommended that you then create a new top-level folder to save your processed files to. You can create that folder at the same level of the top-level parent directory for the files you wish to process. You don’t need to recreate the sub-directory structure because the corpus processing tool will do this for you.

You may have files in a variety of formats in the "Original" folder, such as .doc, .pdf, etc. These files, however, need to be named differently. If multiple files in different formats have the same name, they will be written over each other after being converted to .txt format. In this case, you will get only one .txt file in the "Converted" folder.

Processing a folder of files

To run the tool, choose the folder that you want to process (below, “Original”) and your output folder to save the processed data to (below, “Converted”). You can only process a folder, not individual files. If you have multiple sub-directories to convert, just choose the parent directory. The corpus processing tool will process all the sub-directories and recreate the directory structure in the output folder. DO NOT select the same folder to read and write the files.

We recommend running the processes in the order sequentially (Convert to plaintext, Encode to UTF-8 encoding, Standardize non-ASCII characters and remove non-English characters, Remove PDF metadata). In fact, it should be noted that the "Standardize non-ASCII characters and remove non-English characters" process is designed only to work with files already in UTF-8 format. The only exception to this is that you do not generally need to run the “Remove PDF metadata” process. We use this for de-identifying files that will go into the repository as PDFs as well as plain text.

Reviewing the results

Once the files have been processed (after each processing step), inspect documents to ensure proper conversion: 1. Check for errors in the program’s debug window.

  1. Determine whether you have the same number of input and output files:

a. If you open a folder in Windows File Explorer you can see the number of files in the bottom left corner.

Screenshot of Windows text number

b. For Macs, right click and choose “Get Info” to get the number of files.

Screenshot of Mac text number

  1. Check if there are any files that haven’t been processed correctly. Look for files of size 0 bytes or 1 byte. Change your view to list view in order to see the file sizes of all the files in a folder.

  2. For any files that have failed to process, troubleshoot a couple of items:

    1. Is it read only? If so, try opening and saving the input file in Word before running the program again.
    2. Is the file name too long? Try renaming the file, retaining important information in the filename (e.g., name or group number).
    3. Are there special characters in the file name? If a file doesn’t convert, check for special characters in the file name: colons and other symbols, other language characters, etc. Remove these from the file name and try again.

Advantages of our Corpus Text Processor

  • All three steps are in one package. Other programs are available that do one of these steps but the Corpus Text Processor allows you to do all three steps with one program.
  • Like some (but not all) other programs, the Corpus Text Processor attempts to alleviate some of the problems with converting PDFs by first converting the PDFs to Word.
  • We have also added logic to remove extra line breaks in cases where we are able to reliably remove them.
  • The corpus text processor recreates the folder structure (and copies into a new location).
  • Our team is actively using our tool, meaning that we are updating it as we encounter issues with our own corpus building.

Known limitations

  • Text-as-image PDFs are not currently supported for conversion to plaintext. We are in alpha development for a script that uses the Textract OCR library for converting text-as-image PDFs. You are free to use that project, as-is, at https://github.com/writecrow/ocr2text
  • The tool can detect and convert all of the encoding types listed on the Chardet library page. The tool will make a best-effort attempt to convert other encoding types but is not guaranteed to work.
  • Text files that consist primarily of non-Romanized characters may interfere with the tool's ability to identify the encoding type and may not convert.
  • Converting files of type .doc to plaintext is not currently supported. We recommend using a batch utility to convert .doc files to .docx format, which this application can convert. Here are recommended utilities for converting .doc files:
    • MacOS: the built-in textutil utility can do this. See the tutorial at https://www.chriswrites.com/convert-txt-rtf-doc-and-docx-files-with-textutil/
    • Windows: Zilla word-to-text is a free application. Download at https://download.cnet.com/Zilla-Word-To-Text-Converter/3000-2079_4-75118863.html

Video presentation

A video version of this content is available on our YouTube channel.

Video: Corpus Text Processor Demonstration

Owner

  • Name: Corpus & Repository of Writing (Crow)
  • Login: writecrow
  • Kind: organization
  • Email: collaborate@writecrow.org
  • Location: Purdue University | University of Arizona

Crow brings together researchers at Purdue, Arizona, and other universities to create a web-based archive for writing studies.

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Staples"
  given-names: "Shelley"
- family-names: "Dilger"
  given-names: "Bradley"
title: "Corpus Text Processor"
version: 1.0.8
date-released: 2019-07-09
url: "https://github.com/writecrow/corpus_text_processor"

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 67
  • Total Committers: 3
  • Avg Commits per committer: 22.333
  • Development Distribution Score (DDS): 0.209
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mark Fullmer m****r@g****m 53
jmf3658 j****r@a****u 13
Shelley Staples s****s@e****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 3
  • Average time to close issues: 17 days
  • Average time to close pull requests: 1 day
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.63
  • Average comments per pull request: 0.67
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • markfullmer (7)
  • unidonburi (1)
Pull Request Authors
  • markfullmer (2)
  • dependabot[bot] (1)
Top Labels
Issue Labels
bug (1)
Pull Request Labels
dependencies (1)

Dependencies

requirements.txt pypi
  • PyPDF3 ==1.0.1
  • PySimpleGUI ==4.38.0
  • PySimpleGUIQt ==0.28.0
  • beautifulsoup4 ==4.8.0
  • chardet ==3.0.4
  • docx2txt ==0.8
  • pdf2docx ==0.5.2
  • pdf2image ==1.9.0
  • pdfminer.six ==20181108
  • python-pptx ==0.6.18
  • six ==1.12.0
  • striprtf ==0.0.8
  • tabulate ==0.8.6
setup.py pypi
  • PySide2 *
  • PySimpleGUI *
  • antiword *
  • https *
  • python-pptx ==0.6.6