Scientific Software
Updated 9 months ago

Pubmed Parser — Peer-reviewed • Rank 19.4 • Science 100%

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset - Published in JOSS (2020)

Scientific Software
Updated 9 months ago

htmldate — Peer-reviewed • Rank 23.6 • Science 95%

htmldate: A Python package to extract publication dates from web pages - Published in JOSS (2020)

Scientific Software
Updated 9 months ago

TextDescriptives — Peer-reviewed • Rank 17.6 • Science 98%

TextDescriptives: A Python package for calculating a large variety of metrics from text - Published in JOSS (2023)

Scientific Software
Updated 9 months ago

Fast, Consistent Tokenization of Natural Language Text — Peer-reviewed • Rank 19.9 • Science 95%

Fast, Consistent Tokenization of Natural Language Text - Published in JOSS (2018)

Scientific Software · Peer-reviewed
Scientific Software
Updated 9 months ago

Augmenty — Peer-reviewed • Rank 15.9 • Science 98%

Augmenty: A Python Library for Structured Text Augmentation - Published in JOSS (2024)

Scientific Software
Updated 9 months ago

textnets — Peer-reviewed • Rank 13.9 • Science 100%

textnets: A Python package for text analysis with networks - Published in JOSS (2020)

Scientific Software
Updated 9 months ago

WordTokenizers.jl — Peer-reviewed • Rank 13.2 • Science 95%

WordTokenizers.jl: Basic tools for tokenizing natural language in Julia - Published in JOSS (2020)

Engineering (40%) Earth and Environmental Sciences (40%)
Scientific Software · Peer-reviewed
Scientific Software
Updated 9 months ago

Jury — Peer-reviewed • Rank 14.8 • Science 93%

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

Scientific Software
Updated 9 months ago

giotto-deep — Peer-reviewed • Rank 10.6 • Science 95%

giotto-deep: A Python Package for Topological Deep Learning - Published in JOSS (2022)

Sociology
Scientific Software · Peer-reviewed
Scientific Software
Updated 9 months ago

Mordecai — Peer-reviewed • Rank 12.4 • Science 93%

Mordecai: Full Text Geoparsing and Event Geocoding - Published in JOSS (2017)

Sociology (40%)
Scientific Software · Peer-reviewed
Scientific Software
Updated 9 months ago

Arabica — Peer-reviewed • Rank 12.3 • Science 93%

Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)

Scientific Software · Peer-reviewed
Scientific Software
Updated 9 months ago

gobbli — Peer-reviewed • Rank 10.5 • Science 93%

gobbli: A uniform interface to deep learning for text in Python - Published in JOSS (2021)

Scientific Software · Peer-reviewed
Updated 9 months ago

trafilatura • Rank 26.3 • Science 77%

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated 9 months ago

transformers • Rank 38.7 • Science 64%

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Updated 9 months ago

datasets • Rank 34.4 • Science 64%

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Updated 9 months ago

edsnlp • Rank 17.2 • Science 77%

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

Updated 9 months ago

wpextract • Rank 6.1 • Science 85%

Create datasets from WordPress sites for research or archiving

Updated 3 months ago

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP • Rank 1.4 • Science 87%

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)

Updated 9 months ago

fugashi • Rank 22.1 • Science 64%

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Updated 9 months ago

metanno • Rank 8.4 • Science 77%

Annotator building tool for Jupyter

Updated 9 months ago

pydata-wrangler • Rank 8.2 • Science 77%

Wrangle messy numerical, image, and text data into consistent well-organized formats

Updated 9 months ago

ammico • Rank 10.1 • Science 75%

AI-based Media and Misinformation Content Analysis Tool: Analyze text and images

Updated 9 months ago

dolma • Rank 19.6 • Science 64%

Data and tools for generating and inspecting OLMo pre-training data.

Updated 9 months ago

farm-haystack • Rank 28.7 • Science 54%

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Updated 9 months ago

lexicalrichness • Rank 15.5 • Science 67%

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).

Updated 9 months ago

promptsource • Rank 17.8 • Science 64%

Toolkit for creating, sharing and using natural language prompts.

Updated 9 months ago

flexrag • Rank 12.8 • Science 67%

FlexRAG: A RAG Framework for Information Retrieval and Generation.

Updated 9 months ago

cntext • Rank 12.3 • Science 67%

text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, and Large Language Models (LLMs).文本分析包,支持字数统计、可读性、文档相似度、情感分析在内的多种文本分析方法。

Updated 9 months ago

hanlp • Rank 24.3 • Science 54%

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Updated 9 months ago

seb • Rank 11.3 • Science 67%

A Scandinavian Benchmark for sentence embeddings

Updated 9 months ago

mlconjug3 • Rank 14.1 • Science 64%

A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.

Updated 9 months ago

learn_prompting • Rank 13.3 • Science 64%

Prompt Engineering, Generative AI, and LLM Guide by Learn Prompting | Join our discord for the largest Prompt Engineering learning community

Updated 9 months ago

py-torchtext • Rank 30.7 • Science 46%

Models, data loaders and abstractions for language processing, powered by PyTorch

Updated 9 months ago

finmeter • Rank 7.8 • Science 67%

Tools for assessing Finnish poetry: rhymes, meter, hyphenation of Finnish and so on.

Updated 9 months ago

vulntrain • Rank 7.6 • Science 67%

A tool to generate datasets and models based on vulnerabilities descriptions from @Vulnerability-Lookup.

Updated 9 months ago

pymusas • Rank 12.1 • Science 62%

Python Multilingual Ucrel Semantic Analysis System

Updated 9 months ago

torchdistill • Rank 15.0 • Science 59%

A coding-free framework built on PyTorch for reproducible deep learning studies. PyTorch Ecosystem. 🏆26 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.

Updated 9 months ago

transformers-interpret • Rank 19.3 • Science 54%

Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.

Updated 9 months ago

bio-epidemiology-ner • Rank 5.9 • Science 67%

Recognize bio-medical entities from a text corpus

Updated 9 months ago

text2vec • Rank 17.9 • Science 54%

text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。

Updated 9 months ago

tango • Rank 17.4 • Science 54%

Organize your experiments into discrete steps that can be cached and reused throughout the lifetime of your research project.

Updated 9 months ago

span-marker • Rank 17.2 • Science 54%

SpanMarker for Named Entity Recognition

Updated 9 months ago

deepsearch-toolkit • Rank 16.5 • Science 54%

Interact with the Deep Search platform for new knowledge explorations and discoveries

Updated 9 months ago

emnlp23-paraphrase-types • Rank 2.6 • Science 67%

The official implementation of the EMNLP 2023 paper "Paraphrase Types for Generation and Detection"

Updated 9 months ago

english-text-normalization • Rank 1.8 • Science 67%

Command-line interface (CLI) and library to normalize English texts.

Updated 9 months ago

tokenizers • Rank 14.0 • Science 54%

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Updated 9 months ago

subtitle-word-frequencies • Rank 0.7 • Science 67%

Analyse word frequencies from webVTT subtitles

Updated 9 months ago

noisy-sentences-dataset • Rank 0.7 • Science 67%

550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.

Updated 9 months ago

Egret • Rank 18.2 • Science 49%

Tools for building power systems optimization problems

Updated 9 months ago

negativas • Rank 0.0 • Science 67%

negativas, uma ferramenta para auxiliar na busca e classificação de negações sentenciais no Português Brasileiro.

Updated 8 months ago

https://github.com/google-research/retvec • Rank 12.3 • Science 54%

RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.

Updated 9 months ago

classy-classification • Rank 12.3 • Science 54%

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.

Updated 9 months ago

pytextclassifier • Rank 12.2 • Science 54%

pytextclassifier is a toolkit for text classification. 文本分类,LR,Xgboost,TextCNN,FastText,TextRNN,BERT等分类模型实现,开箱即用。

Updated 9 months ago

summertime • Rank 11.8 • Science 54%

An open-source text summarization toolkit for non-experts. EMNLP'2021 Demo

Updated 9 months ago

banks • Rank 21.3 • Science 44%

LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. It allows attaching metadata to prompts to ease their management, and versioning is first-class citizen. Banks provides ways to store prompts on disk along with their metadata.

Updated 9 months ago

detoxify • Rank 21.0 • Science 44%

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai.

Updated 9 months ago

chinese-llama-alpaca-2 • Rank 11.0 • Science 54%

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Updated 9 months ago

frame-semantic-transformer • Rank 10.9 • Science 54%

Frame Semantic Parser based on T5 and FrameNet

Updated 9 months ago

transformer-srl • Rank 10.4 • Science 54%

Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.

Updated 9 months ago

dkpro-cassis • Rank 7.0 • Science 57%

UIMA CAS processing library written in Python

Updated 9 months ago

polydedupe • Rank 6.4 • Science 57%

PolyDeDupe: Multi-Lingual Data Deduplication

Updated 9 months ago

gismo • Rank 9.2 • Science 54%

GISMO is a NLP tool to rank and organize a corpus of documents according to a query.

Updated 9 months ago

textblob • Rank 27.2 • Science 36%

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Updated 9 months ago

wn • Rank 17.8 • Science 44%

A modern, interlingual wordnet interface for Python

Updated 9 months ago

text • Rank 15.7 • Science 46%

Using Transformers from HuggingFace in R

Updated 9 months ago

chinese-mixtral • Rank 7.1 • Science 54%

中文Mixtral混合专家大模型(Chinese Mixtral MoE LLMs)

Updated 9 months ago

https://github.com/datamade/usaddress • Rank 24.9 • Science 36%

:us: a python library for parsing unstructured United States address strings into address components

Updated 9 months ago

turkish-question-generation • Rank 3.9 • Science 57%

Automated question generation and question answering from Turkish texts using text-to-text transformers