Projects | Open Source Science

Scientific Software

Updated 10 months ago

tidytext — Peer-reviewed • Rank 22.7 • Science 95%

tidytext: Text Mining and Analysis Using Tidy Data Principles in R - Published in JOSS (2016)

natural-language-processing r text-mining tidy-data tidyverse

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

Fast, Consistent Tokenization of Natural Language Text — Peer-reviewed • Rank 19.9 • Science 95%

Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)

nlp peer-reviewed r r-package rstats text-mining tokenizer

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

LISC — Peer-reviewed • Rank 11.2 • Science 100%

LISC: A Python Package for Scientific Literature Collection and Analysis - Published in JOSS (2019)

literature-mining literature-review meta-analysis scientific-publications text-mining web-scraping

Mathematics

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

jstor — Peer-reviewed • Rank 16.6 • Science 93%

jstor: Import and Analyse Data from Scientific Texts - Published in JOSS (2018)

jstor peer-reviewed r r-package rstats text-analysis text-mining

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

TRUNAJOD — Peer-reviewed • Rank 12.2 • Science 95%

TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)

coherence cohesion entity-graph lexical-diversity natural-language-processing readability-metrics semantic-measurements spacy spacy-extensions text-analysis text-mining text-processing ttr type-token-ratio

Engineering

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

seesus — Peer-reviewed • Rank 6.9 • Science 100%

seesus: a social, environmental, and economic sustainability classifier for Python - Published in JOSS (2024)

classification regular-expressions sdg sustainability sustainability-developoment-goals text-mining

Engineering

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

Arabica — Peer-reviewed • Rank 12.3 • Science 93%

Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)

exploratory-data-analysis nlp text-mining

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

ldaPrototype — Peer-reviewed • Rank 8.4 • Science 95%

ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations - Published in JOSS (2020)

latent-dirichlet-allocation lda model-selection modelselection reliability text-mining textdata topic-model topic-models topic-similarities topicmodeling topicmodelling

Engineering

Scientific Software · Peer-reviewed

Updated 10 months ago

trafilatura • Rank 26.3 • Science 77%

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

article-extractor corpus-builder corpus-tools crawler html-to-markdown html2text llm news-aggregator news-crawler nlp rag readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping

Scientific Software

Updated 10 months ago

SDGdetector — Peer-reviewed • Rank 9.3 • Science 93%

SDGdetector: an R-based text mining tool for quantifying efforts toward Sustainable Development Goals - Published in JOSS (2023)

cran r r-package sdg sdgs sustainability sustainable-development-goals text-mining

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

EndoMineR for the extraction of endoscopic and associated pathology data from medical reports — Peer-reviewed • Rank 4.2 • Science 93%

EndoMineR for the extraction of endoscopic and associated pathology data from medical reports - Published in JOSS (2018)

endoscopy gastroenterology peer-reviewed r r-package rstats semi-structured-data text-mining

Scientific Software · Peer-reviewed

Updated 10 months ago

edsnlp • Rank 17.2 • Science 77%

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

clinical-data-warehouse deep-learning fast french medical multi-task nlp pytorch rule-based spacy text-mining

Updated 10 months ago

wpextract • Rank 6.1 • Science 85%

Create datasets from WordPress sites for research or archiving

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Updated 10 months ago

cntext • Rank 12.3 • Science 67%

text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, and Large Language Models (LLMs).文本分析包，支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。

chinese content-analysis discourse-analysis glove llm nlp semantic-analysis sentiment-analysis social-science text-analysis text-mining word2vec

Updated 10 months ago

huspacy • Rank 7.3 • Science 64%

HuSpaCy: industrial-strength Hungarian natural language processing

dependency-parsing hungarian hunlp huspacy information-extraction lemmatization machine-learning morphological-analysis named-entity-recognition natural-language-processing ner nlp pos-tagger python spacy spacy-models spacy-pipeline text-mining universal-dependencies

Updated 10 months ago

scattertext • Rank 19.4 • Science 49%

Beautiful visualizations of how language differs among document types.

computational-social-science d3 eda exploratory-data-analysis japanese-language machine-learning natural-language-processing nlp scatter-plot semiotic-squares sentiment stylometric stylometry text-as-data text-mining text-visualization topic-modeling visualization word-embeddings word2vec

Updated 10 months ago

@stdlib/datasets-afinn-96 • Rank 4.8 • Science 57%

A list of English words rated for valence.

data dataset datasets emotion emotive javascript list negative node node-js nodejs opinion positive sample sentiment stdlib subjectivity text-mining valence words

Updated 10 months ago

@stdlib/datasets-afinn-111 • Rank 4.2 • Science 57%

A list of English words rated for valence.

data dataset datasets emotion emotive javascript list negative node node-js nodejs opinion positive sample sentiment stdlib subjectivity text-mining valence words

Updated 10 months ago

uk.ac.cam.ch.wwmm.oscar • Rank 11.0 • Science 49%

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles.

blueobelisk chemistry text-mining

Updated 10 months ago

nlppln • Rank 4.5 • Science 54%

NLP pipeline software using common workflow language

cwl nlp pipeline text-mining workflow

Scientific Software

Updated 10 months ago

CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database — Peer-reviewed • Rank 8.8 • Science 49%

CAZy-parser a way to extract information from the Carbohydrate-Active enZYmes Database - Published in JOSS (2016)

carbohydrates cazy data-mining enzymes scrapper text-mining

Biology (34%)

Scientific Software · Peer-reviewed

Updated 10 months ago

@stdlib/nlp-tokenize • Rank 11.7 • Science 44%

Tokenize a string.

javascript nlp node node-js nodejs separate split stdlib text-mining tokenizer tokens util utilities utility utils word

Updated 10 months ago

@stdlib/nlp-sentencize • Rank 11.0 • Science 44%

Split a string into an array of sentences.

javascript nlp node node-js nodejs sentence sentences separate split stdlib text-mining tokenizer util utilities utility utils

Updated 10 months ago

@stdlib/nlp-lda • Rank 5.5 • Science 44%

Latent Dirichlet Allocation via collapsed Gibbs sampling.

bayesian clustering javascript learning mcmc mixed-membership model nlp node node-js nodejs stdlib text-mining topic unsupervised

Updated 10 months ago

packFinder • Rank 12.6 • Science 33%

A package for the de novo discovery of pack-TYPE transposons

bioinformatics r text-mining

Updated 10 months ago

textexplorer • Rank 1.4 • Science 44%

A tool designed for the exploration, analysis, and comparison of textual data variants.

compose-multiplatform kotlin text-comparison text-mining text-search textual-analysis variant-analysis

Updated 10 months ago

https://github.com/bluebrain/search • Rank 11.4 • Science 33%

Blue Brain text mining toolbox for semantic search and structured information extraction

deep-learning machine-learning natural-language-processing nlp python text-mining

Updated 10 months ago

rmdl • Rank 11.4 • Science 33%

RMDL: Random Multimodel Deep Learning for Classification

classification cnn convolutional-neural-networks data-mining deep-learning deep-neural-networks dnn ensemble-learning image-classification information-retrieval keras machine-learning multimodel recurrent-neural-networks rnn tensorflow text-classification text-mining

Updated 10 months ago

R.temis • Rank 17.5 • Science 26%

R.TeMiS: R Text Mining Solution

r text-mining

Updated 10 months ago

qdap • Rank 18.4 • Science 23%

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis

qdap quantitative-discourse-analysis text-analysis text-mining text-plotting

Updated 10 months ago

https://github.com/cthoyt/onto2nx • Rank 7.4 • Science 33%

Converts OWL ontologies and OBO to NetworkX Graphs

ontologies terminologies text-mining

Updated 10 months ago

https://github.com/cran-task-views/naturallanguageprocessing • Rank 4.0 • Science 36%

CRAN Task View: Natural Language Processing

cran natural-language-processing r rstats task-views text-mining

Updated 10 months ago

LDAvis • Rank 15.9 • Science 23%

R package for web-based interactive topic model visualization.

javascript r text-mining topic-modeling visualization

Updated 10 months ago

https://github.com/ggnowayback/cathodedataextractor • Rank 5.2 • Science 23%

A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries.

battery-information electrochemistry information-extraction materials-science nature-inspired-algorithms nature-language-process relation-extraction synthesis-parameters text-mining

Updated 10 months ago

https://github.com/caimeng2/uniscraper • Rank 4.7 • Science 23%

A universal scraper that grabs text from multiple types of webpages.

text-mining web-scraper

Updated 10 months ago

ngram • Rank 16.6 • Science 10%

Fast n-Gram Tokenization

ngram r text text-mining

Updated 10 months ago

chemdataextractor • Rank 13.4 • Science 10%

Automatically extract chemical information from scientific documents

chemistry information-extraction natural-language-processing nlp python text-mining

Updated 10 months ago

textstem • Rank 12.7 • Science 10%

Tools for fast text stemming & lemmatization

lemmatization r stemming text-mining

Updated 10 months ago

sentimentpy • Rank 2.3 • Science 18%

A Python port of the #rstats sentimentr package

emotion nlp polarity sentiment text-mining

Updated 10 months ago

quran • Rank 8.7 • Science 10%

📖 An R package for the complete text of the Qur'an

islam quran r text-mining tidytext

Updated 10 months ago

scripturs • Rank 8.6 • Science 10%

📖 An R package for the complete LDS Scriptures

lds lds-scriptures r text-mining tidytext

Updated 10 months ago

hcandersenr • Rank 7.7 • Science 10%

An R Package for H.C. Andersens fairy tales

andersens-fairy-tales r text-mining

Updated 10 months ago

pubchunks • Rank 3.2 • Science 13%

:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles

literature open-access r r-package rstats text-mining xml

Updated 9 months ago

https://github.com/dcavar/julia_nlp_notebooks • Rank 0.7 • Science 13%

Julia NLP Notebooks

ai julia jupyter ml nlp text-mining text-processing

Updated 10 months ago

acep • Science 23%

Análisis Computacional de Eventos de Protesta (ACEP). Computer-Aided Protest Event Analysis (CAPEA)

computer-aided-detection conflict-analysis conflict-detection dictionaries nlp-keywords-extraction package protest-events r rstats text-mining visualization

Updated 10 months ago

https://github.com/cedergrouphub/materialparser • Science 23%

Utility to compile string of chemical terms into data structure with chemical formula and composition

chemical-compounds chemical-terms composition material-parser materials-science natural-language-processing python text-mining

Updated 10 months ago

https://github.com/brucewlee/wiki-text-summarizer-keyword-extractor • Science 13%

Uses Beautiful Soup to read Wiki pages, Gensim to summarize, NLTK to process, and extracts keywords based on entropy: everything in one beautiful code. A simple but effective solution to extractive text summarization.

gensim gensim-model keyword-extraction keyword-identification nltk simple-summarizer text-mining text-summarization text-summarizer wikipedia-summarizer

Updated 10 months ago

https://github.com/adbar/german-nlp • Science 36%

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

computational-linguistics corpus-linguistics german-language natural-language-processing nlp text-mining

Updated 10 months ago

supermat • Science 57%

Superconductors material dataset

material-informatics superconductors tdm text-mining

Updated 10 months ago

qtm • Science 49%

QTLTableMiner++ tool for mining tables in scientific articles

candidate-genes europe-pmc ontologies qtl scientific-articles solr text-mining

Updated 10 months ago

corpusexplorer.terminal.console • Science 44%

Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf Analysen/Daten des CorpusExplorer v2.0

api corpus-linguistics corpusexplorer linguistic nlp text-mining

Updated 10 months ago

architxt • Science 44%

ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.

architxt data-analysis database nlp open-source python python-library research structured-data text-analysis text-mining

Scientific Software

Updated 10 months ago

Jabberwocky — Peer-reviewed • Science 93%

Jabberwocky: an ontology-aware toolkit for manipulating text - Published in JOSS (2020)

annotation grep jabberwocky ontology plotting python synonyms text-mining textual-data tfidf

Artificial Intelligence and Machine Learning

Scientific Software · Peer-reviewed

Updated 10 months ago

semantic-outlier-removal • Science 54%

Code and data for SORE (ACL 2025), a semantic boilerplate remover.

article-extractor crawler embedding html-to-text html2text llm nlp outlier-removal preprocessing readability scraping text-extraction text-mining web-scraping

Updated 10 months ago

tall • Science 26%

Text Analysis for aLL

r-shiny text-analysis-and-sentiment-analysis text-classification text-mining textual-analysis

Updated 10 months ago

authorship_clustering_code_repo • Science 41%

LAC: Latent Authorial Clustering of Shorter Texts

authorship-analysis authorship-clustering authorship-verification clustering text-mining topic-modeling

Updated 10 months ago

orange-story-navigator • Science 67%

Add-on to the Orange3 data mining toolkit with text processing widgets from the project Navigating Stories

data-analysis orange3 stories storytelling text-mining

Updated 10 months ago

iramuteqlike • Science 26%

💬⛏️ IRaMuTeQ Software Analyses in R

iramuteq qualitative-analysis r r-package rstats text-analysis text-mining

Updated 10 months ago

snap-umls-clusters • Science 36%

Master Thesis Project in Arab American University Palestine with Palestinian Neuro Initiative Educational Research Center - Clustering medical sentences based on Unified Medical Language System (UMLS) terms and expanded UMLS terms present in them

deep-neural-networks knowledge-graph language-model machine-learning natural-language-processing text-mining

Updated 10 months ago

corpusexplorer.sdk • Science 44%

Korpuslinguistik war noch nie so einfach...

big-data cleaning-data cooccurrence corpus-linguistics corpus-processing data-minig data-mining data-science datajournalism journalism linguistics natural-language-processing natural-language-understanding nlp sdk tagger text-analysis text-mining text-processing visualization