corpus_text_processor

A desktop application for preparing files for use in a corpus

https://github.com/writecrow/corpus_text_processor

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 3 committers (66.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

corpora corpus-linguistics desktop-app text-processing

Last synced: 10 months ago · JSON representation ·

Repository

A desktop application for preparing files for use in a corpus

Basic Info

Host: GitHub
Owner: writecrow
License: mit
Language: Python
Default Branch: main
Size: 27.4 MB

Statistics

Stars: 8
Watchers: 5
Forks: 4
Open Issues: 4
Releases: 20

Topics

corpora corpus-linguistics desktop-app text-processing

Created about 7 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Corpus Text Processor

Getting started
Installation
Preparing files to be processed
Processing a folder of files
Reviewing the results
Advantages of our Corpus Text Processor
Known limitations
Video presentation

Getting started

The Corpus Text Processor (download here) for Windows or Mac is a downloadable application for Windows and Mac that provides batched (multiple-files-at-a-time) operations for common corpus processing tasks. The screenshot of the application below shows the four tasks currently available:

After running operations on a set of files, Corpus Text Processor provides debugging output that indicates how many files were processed and which files, if any, had issues.

Installation

Download the latest version of the Corpus Text Processor for Windows or MacOS at https://github.com/writecrow/corpustextprocessor/releases

Trouble installing?

If you get error messages related to application security upon installation, please review one of these pages based on your operating system:

See Windows Installation
See Mac Installation

Preparing files to be processed

Before running the tool, place all files to be processed in a specific folder; this can include sub-directories. It is recommended that you then create a new top-level folder to save your processed files to. You can create that folder at the same level of the top-level parent directory for the files you wish to process. You don’t need to recreate the sub-directory structure because the corpus processing tool will do this for you.

You may have files in a variety of formats in the "Original" folder, such as .doc, .pdf, etc. These files, however, need to be named differently. If multiple files in different formats have the same name, they will be written over each other after being converted to .txt format. In this case, you will get only one .txt file in the "Converted" folder.

Processing a folder of files

To run the tool, choose the folder that you want to process (below, “Original”) and your output folder to save the processed data to (below, “Converted”). You can only process a folder, not individual files. If you have multiple sub-directories to convert, just choose the parent directory. The corpus processing tool will process all the sub-directories and recreate the directory structure in the output folder. DO NOT select the same folder to read and write the files.

We recommend running the processes in the order sequentially (Convert to plaintext, Encode to UTF-8 encoding, Standardize non-ASCII characters and remove non-English characters, Remove PDF metadata). In fact, it should be noted that the "Standardize non-ASCII characters and remove non-English characters" process is designed only to work with files already in UTF-8 format. The only exception to this is that you do not generally need to run the “Remove PDF metadata” process. We use this for de-identifying files that will go into the repository as PDFs as well as plain text.

Reviewing the results

Once the files have been processed (after each processing step), inspect documents to ensure proper conversion: 1. Check for errors in the program’s debug window.

Determine whether you have the same number of input and output files:

a. If you open a folder in Windows File Explorer you can see the number of files in the bottom left corner.

b. For Macs, right click and choose “Get Info” to get the number of files.

Check if there are any files that haven’t been processed correctly. Look for files of size 0 bytes or 1 byte. Change your view to list view in order to see the file sizes of all the files in a folder.
For any files that have failed to process, troubleshoot a couple of items:
1. Is it read only? If so, try opening and saving the input file in Word before running the program again.
2. Is the file name too long? Try renaming the file, retaining important information in the filename (e.g., name or group number).
3. Are there special characters in the file name? If a file doesn’t convert, check for special characters in the file name: colons and other symbols, other language characters, etc. Remove these from the file name and try again.

Advantages of our Corpus Text Processor

All three steps are in one package. Other programs are available that do one of these steps but the Corpus Text Processor allows you to do all three steps with one program.
Like some (but not all) other programs, the Corpus Text Processor attempts to alleviate some of the problems with converting PDFs by first converting the PDFs to Word.
We have also added logic to remove extra line breaks in cases where we are able to reliably remove them.
The corpus text processor recreates the folder structure (and copies into a new location).
Our team is actively using our tool, meaning that we are updating it as we encounter issues with our own corpus building.

Known limitations

Text-as-image PDFs are not currently supported for conversion to plaintext. We are in alpha development for a script that uses the Textract OCR library for converting text-as-image PDFs. You are free to use that project, as-is, at https://github.com/writecrow/ocr2text
The tool can detect and convert all of the encoding types listed on the Chardet library page. The tool will make a best-effort attempt to convert other encoding types but is not guaranteed to work.
Text files that consist primarily of non-Romanized characters may interfere with the tool's ability to identify the encoding type and may not convert.
Converting files of type .doc to plaintext is not currently supported. We recommend using a batch utility to convert .doc files to .docx format, which this application can convert. Here are recommended utilities for converting .doc files:
- MacOS: the built-in textutil utility can do this. See the tutorial at https://www.chriswrites.com/convert-txt-rtf-doc-and-docx-files-with-textutil/
- Windows: Zilla word-to-text is a free application. Download at https://download.cnet.com/Zilla-Word-To-Text-Converter/3000-2079_4-75118863.html

Video presentation

A video version of this content is available on our YouTube channel.

Video: Corpus Text Processor Demonstration

Owner

Name: Corpus & Repository of Writing (Crow)
Login: writecrow
Kind: organization
Email: collaborate@writecrow.org
Location: Purdue University | University of Arizona

Website: https://writecrow.org
Twitter: writecroworg
Repositories: 16
Profile: https://github.com/writecrow

Crow brings together researchers at Purdue, Arizona, and other universities to create a web-based archive for writing studies.

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Staples"
  given-names: "Shelley"
- family-names: "Dilger"
  given-names: "Bradley"
title: "Corpus Text Processor"
version: 1.0.8
date-released: 2019-07-09
url: "https://github.com/writecrow/corpus_text_processor"

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 12 months ago

All Time

Total Commits: 67
Total Committers: 3
Avg Commits per committer: 22.333
Development Distribution Score (DDS): 0.209

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Mark Fullmer	m**r@g**m	53
jmf3658	j**r@a**u	13
Shelley Staples	s**s@e**u	1

Committer Domains (Top 20 + Academic)

email.arizona.edu: 1 austin.utexas.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 8
Total pull requests: 3
Average time to close issues: 17 days
Average time to close pull requests: 1 day
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 0.63
Average comments per pull request: 0.67
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

markfullmer (7)
unidonburi (1)

Pull Request Authors

markfullmer (2)
dependabot[bot] (1)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

dependencies (1)

Dependencies

requirements.txt pypi

PyPDF3 ==1.0.1
PySimpleGUI ==4.38.0
PySimpleGUIQt ==0.28.0
beautifulsoup4 ==4.8.0
chardet ==3.0.4
docx2txt ==0.8
pdf2docx ==0.5.2
pdf2image ==1.9.0
pdfminer.six ==20181108
python-pptx ==0.6.18
six ==1.12.0
striprtf ==0.0.8
tabulate ==0.8.6

setup.py pypi

PySide2 *
PySimpleGUI *
antiword *
https *
python-pptx ==0.6.6

corpus_text_processor

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Corpus Text Processor

Contents

Getting started

Installation

Trouble installing?

Preparing files to be processed

Processing a folder of files

Reviewing the results

Advantages of our Corpus Text Processor

Known limitations

Video presentation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies