htmldate
htmldate: A Python package to extract publication dates from web pages - Published in JOSS (2020)
tidytext
tidytext: Text Mining and Analysis Using Tidy Data Principles in R - Published in JOSS (2016)
Augmenty
Augmenty: A Python Library for Structured Text Augmentation - Published in JOSS (2024)
Talisman
Talisman: a JavaScript archive of fuzzy matching, information retrieval and record linkage building blocks - Published in JOSS (2020)
Jury
Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)
TRUNAJOD
TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)
pygamma-agreement
pygamma-agreement: Gamma γ measure for inter/intra-annotator agreement in Python - Published in JOSS (2021)
Shekar: A Python Toolkit for Persian Natural Language Processing
Shekar: A Python Toolkit for Persian Natural Language Processing - Published in JOSS (2025)
transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
flaml
A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
pytextrank
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
flair
A very simple framework for state-of-the-art Natural Language Processing (NLP)
contextualspellcheck
✔️Contextual word checker for better suggestions (not actively maintained)
nlpo3
Thai natural language processing library in Rust, with Python and Node bindings.
lexicalrichness
:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).
adapters
A Unified Library for Parameter-Efficient and Modular Transfer Learning
hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
catalyst
Accelerated deep learning R&D
harmony
The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.
torchdistill
A coding-free framework built on PyTorch for reproducible deep learning studies. PyTorch Ecosystem. 🏆26 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.
transformers-interpret
Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.
nlp-progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
scattertext
Beautiful visualizations of how language differs among document types.
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
noisy-sentences-dataset
550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.
https://github.com/google-research/retvec
RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.
classy-classification
This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.
thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
transformer-srl
Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.
ml-visuals
🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.
shiba-model
Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
matchzoo
Facilitating the design, comparison and sharing of deep text matching models.
https://github.com/datamade/usaddress
:us: a python library for parsing unstructured United States address strings into address components
zensols-mimicsid
MIMIC-III corpus parsing and section prediction with MedSecId (COLING paper)
obsei
Obsei is a low code AI powered automation tool. It can be used in various business flows like social listening, AI based alerting, brand image analysis, comparative study and more .
asent
Asent is a python library for performing efficient and transparent sentiment analysis using spaCy.
forte
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
conversationalign
An R package for analyzing linguistic alignment between partners in conversation transcripts
odin-slides
This is an advanced Python tool that empowers you to effortlessly draft customizable PowerPoint slides using the Generative Pre-trained Transformer (GPT) of your choice. Leveraging the capabilities of Large Language Models (LLM), odin-slides enables you to turn the lengthiest Word documents into well organized presentations.
zensols-deepnlp
Deep learning utility library for natural language processing (NLP-OSS paper)
concise-concepts
This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
crosslingual-coreference
A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.
argilla
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
odinrunes
Odin Runes, a java-based GPT client, facilitates interaction with your preferred GPT model right through your favorite text editor. There is more: It also facilitates prompt-engineering by extracting context from diverse sources using technologies such as OCR, enhancing overall productivity and saving costs.
natural-language-processing
Fundamentals of natural language processing with Python
odin-tabs
The Odin Tabs extension is a browser extension that allows you to navigate through your browser tabs using speech recognition and the Large Language Model (LLM) of your choice.
knowurenvironment
Official release of KnowUREnvironment, a knowledge graph on climate change and related environmental issues. Paper link: https://www.climatechange.ai/papers/aaaifss2022/3
https://github.com/cedrickchee/awesome-transformer-nlp
A curated list of NLP resources focused on Transformer networks, attention mechanism, GPT, BERT, ChatGPT, LLMs, and transfer learning.
pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
https://github.com/bluebrain/search
Blue Brain text mining toolbox for semantic search and structured information extraction
https://github.com/brucewlee/lingfeat
[EMNLP 2021] LingFeat - A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment
https://github.com/alexeyev/awesome-kyrgyz-nlp
Kyrgyz language processing software, models and datasets.
https://github.com/alexeyev/awesome-azerbaijani-nlp
Azerbaijani language processing software, models and datasets.
https://github.com/cran-task-views/naturallanguageprocessing
CRAN Task View: Natural Language Processing
spacy-wrap
spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.
https://github.com/dair-ai/ml-course-notes
🎓 Sharing machine learning course / lecture notes.
https://github.com/bramvanroy/spacy_conll
Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.
citation-report
Parse legal citations having the publisher format - i.e. SCRA, PHIL, OFFG - referring to Philippine Supreme Court decisions.
https://github.com/agamiko/100-days-of-code
My 100 days journey with coding to improve my Machine Learning, Deep Learning, Data Science skills
ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
https://github.com/asyml/texar-pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
https://github.com/brucewlee/lftk
[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.