Pubmed Parser
Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset - Published in JOSS (2020)
htmldate
htmldate: A Python package to extract publication dates from web pages - Published in JOSS (2020)
TextDescriptives
TextDescriptives: A Python package for calculating a large variety of metrics from text - Published in JOSS (2023)
Fast, Consistent Tokenization of Natural Language Text
Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)
Augmenty
Augmenty: A Python Library for Structured Text Augmentation - Published in JOSS (2024)
textnets
textnets: A Python package for text analysis with networks - Published in JOSS (2020)
WordTokenizers.jl
WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)
Jury
Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)
giotto-deep
giotto-deep: A Python Package for Topological Deep Learning - Published in JOSS (2022)
Mordecai
Mordecai: Full Text Geoparsing and Event Geocoding - Published in JOSS (2017)
Arabica
Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)
Shekar: A Python Toolkit for Persian Natural Language Processing
Shekar: A Python Toolkit for Persian Natural Language Processing - Published in JOSS (2025)
gobbli
gobbli: A uniform interface to deep learning for text in Python - Published in JOSS (2021)
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
lazyllm-llamafactory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
edsnlp
Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
pytextrank
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)
txtai
💡 All-in-one open-source AI framework for semantic search, LLM orchestration and language model workflows
flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
pydata-wrangler
Wrangle messy numerical, image, and text data into consistent well-organized formats
ammico
AI-based Media and Misinformation Content Analysis Tool: Analyze text and images
contextualspellcheck
✔️Contextual word checker for better suggestions (not actively maintained)
farm-haystack
AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
lexicalrichness
:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).
deepke
[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
cntext
text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, and Large Language Models (LLMs).文本分析包,支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。
adapters
A Unified Library for Parameter-Efficient and Modular Transfer Learning
hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
mlconjug3
A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.
learn_prompting
Prompt Engineering, Generative AI, and LLM Guide by Learn Prompting | Join our discord for the largest Prompt Engineering learning community
py-torchtext
Models, data loaders and abstractions for language processing, powered by PyTorch
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
finmeter
Tools for assessing Finnish poetry: rhymes, meter, hyphenation of Finnish and so on.
vulntrain
A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.
harmony
The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.
torchdistill
A coding-free framework built on PyTorch for reproducible deep learning studies. PyTorch Ecosystem. 🏆26 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
transformers-interpret
Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.
text2vec
text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。
deepsearch-toolkit
Interact with the Deep Search platform for new knowledge explorations and discoveries
emnlp23-paraphrase-types
The official implementation of the EMNLP 2023 paper "Paraphrase Types for Generation and Detection"
english-text-normalization
Command-line interface (CLI) and library to normalize English texts.
scattertext
Beautiful visualizations of how language differs among document types.
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
negativas
negativas, uma ferramenta para auxiliar na busca e classificação de negações sentenciais no Português Brasileiro.
https://github.com/google-research/retvec
RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.
classy-classification
This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.
pytextclassifier
pytextclassifier is a toolkit for text classification. 文本分类,LR,Xgboost,TextCNN,FastText,TextRNN,BERT等分类模型实现,开箱即用。
thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
summertime
An open-source text summarization toolkit for non-experts. EMNLP'2021 Demo
banks
LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. It allows attaching metadata to prompts to ease their management, and versioning is first-class citizen. Banks provides ways to store prompts on disk along with their metadata.
detoxify
Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai.
chinese-llama-alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
transformer-srl
Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.
https://github.com/datamade/usaddress
:us: a python library for parsing unstructured United States address strings into address components
turkish-question-generation
Automated question generation and question answering from Turkish texts using text-to-text transformers