tidytext
tidytext: Text Mining and Analysis Using Tidy Data Principles in R - Published in JOSS (2016)
Fast, Consistent Tokenization of Natural Language Text
Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)
LISC
LISC: A Python Package for Scientific Literature Collection and Analysis - Published in JOSS (2019)
jstor
jstor: Import and Analyse Data from Scientific Texts - Published in JOSS (2018)
TRUNAJOD
TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)
seesus
seesus: a social, environmental, and economic sustainability classifier for Python - Published in JOSS (2024)
Arabica
Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)
ldaPrototype
ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations - Published in JOSS (2020)
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
SDGdetector
SDGdetector: an R-based text mining tool for quantifying efforts toward Sustainable Development Goals - Published in JOSS (2023)
EndoMineR for the extraction of endoscopic and associated pathology data from medical reports
EndoMineR for the extraction of endoscopic and associated pathology data from medical reports - Published in JOSS (2018)
edsnlp
Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
cntext
text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, and Large Language Models (LLMs).文本分析包,支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。
scattertext
Beautiful visualizations of how language differs among document types.
uk.ac.cam.ch.wwmm.oscar
OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles.
CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database
CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database - Published in JOSS (2016)
textexplorer
A tool designed for the exploration, analysis, and comparison of textual data variants.
https://github.com/bluebrain/search
Blue Brain text mining toolbox for semantic search and structured information extraction
qdap
Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
https://github.com/cthoyt/onto2nx
Converts OWL ontologies and OBO to NetworkX Graphs
https://github.com/cran-task-views/naturallanguageprocessing
CRAN Task View: Natural Language Processing
https://github.com/ggnowayback/cathodedataextractor
A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries.
https://github.com/caimeng2/uniscraper
A universal scraper that grabs text from multiple types of webpages.
chemdataextractor
Automatically extract chemical information from scientific documents
pubchunks
:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles
https://github.com/adbar/german-nlp
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
snap-umls-clusters
Master Thesis Project in Arab American University Palestine with Palestinian Neuro Initiative Educational Research Center - Clustering medical sentences based on Unified Medical Language System (UMLS) terms and expanded UMLS terms present in them
acep
Análisis Computacional de Eventos de Protesta (ACEP). Computer-Aided Protest Event Analysis (CAPEA)
orange-story-navigator
Add-on to the Orange3 data mining toolkit with text processing widgets from the project Navigating Stories
https://github.com/cedergrouphub/materialparser
Utility to compile string of chemical terms into data structure with chemical formula and composition
Jabberwocky
Jabberwocky: an ontology-aware toolkit for manipulating text - Published in JOSS (2020)
https://github.com/brucewlee/wiki-text-summarizer-keyword-extractor
Uses Beautiful Soup to read Wiki pages, Gensim to summarize, NLTK to process, and extracts keywords based on entropy: everything in one beautiful code. A simple but effective solution to extractive text summarization.
semantic-outlier-removal
Code and data for SORE (ACL 2025), a semantic boilerplate remover.
architxt
ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.
corpusexplorer.terminal.console
Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf Analysen/Daten des CorpusExplorer v2.0
authorship_clustering_code_repo
LAC: Latent Authorial Clustering of Shorter Texts