Scientific Software
Updated 6 months ago

htmldate — Peer-reviewed • Rank 23.6 • Science 95%

htmldate: A Python package to extract publication dates from web pages - Published in JOSS (2020)

Scientific Software
Updated 6 months ago

tidytext — Peer-reviewed • Rank 22.7 • Science 95%

tidytext: Text Mining and Analysis Using Tidy Data Principles in R - Published in JOSS (2016)

Scientific Software · Peer-reviewed
Scientific Software
Updated 6 months ago

Augmenty — Peer-reviewed • Rank 15.9 • Science 98%

Augmenty: A Python Library for Structured Text Augmentation - Published in JOSS (2024)

Scientific Software
Updated 6 months ago

Talisman — Peer-reviewed • Rank 20.9 • Science 93%

Talisman: a JavaScript archive of fuzzy matching, information retrieval and record linkage building blocks - Published in JOSS (2020)

Scientific Software
Updated 6 months ago

Jury — Peer-reviewed • Rank 14.8 • Science 93%

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

Scientific Software
Updated 6 months ago

TRUNAJOD — Peer-reviewed • Rank 12.2 • Science 95%

TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)

Scientific Software
Updated 6 months ago

pygamma-agreement — Peer-reviewed • Rank 11.7 • Science 93%

pygamma-agreement: Gamma γ measure for inter/intra-annotator agreement in Python - Published in JOSS (2021)

Mathematics
Scientific Software · Peer-reviewed
Updated 6 months ago

transformers • Rank 38.7 • Science 64%

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Updated 6 months ago

datasets • Rank 34.4 • Science 64%

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Updated 6 months ago

nlpo3 • Rank 15.7 • Science 67%

Thai natural language processing library in Rust, with Python and Node bindings.

Updated 6 months ago

lexicalrichness • Rank 15.5 • Science 67%

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).

Updated 6 months ago

promptsource • Rank 17.8 • Science 64%

Toolkit for creating, sharing and using natural language prompts.

Updated 6 months ago

hanlp • Rank 24.3 • Science 54%

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Updated 6 months ago

seb • Rank 11.3 • Science 67%

A Scandinavian Benchmark for sentence embeddings

Updated 6 months ago

pymusas • Rank 12.1 • Science 62%

Python Multilingual Ucrel Semantic Analysis System

Updated 6 months ago

torchdistill • Rank 15.0 • Science 59%

A coding-free framework built on PyTorch for reproducible deep learning studies. PyTorch Ecosystem. 🏆26 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.

Updated 6 months ago

transformers-interpret • Rank 19.3 • Science 54%

Model explainability that works seamlessly with 🤗 transformers. Explain your transformers model in just 2 lines of code.

Updated 6 months ago

nlp-progress • Rank 15.8 • Science 54%

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Updated 6 months ago

mishkal • Rank 15.3 • Science 54%

Mishkal is an arabic text vocalization software

Updated 6 months ago

tokenizers • Rank 14.0 • Science 54%

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Updated 6 months ago

noisy-sentences-dataset • Rank 0.7 • Science 67%

550K sentences in 5 European languages augmented with noise for training and evaluating spell correction tools or machine learning models.

Updated 4 months ago

https://github.com/google-research/retvec • Rank 12.3 • Science 54%

RETVec is an efficient, multilingual, and adversarially-robust text vectorizer.

Updated 6 months ago

classy-classification • Rank 12.3 • Science 54%

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.

Updated 6 months ago

transformer-srl • Rank 10.4 • Science 54%

Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.

Updated 6 months ago

ml-visuals • Rank 10.4 • Science 54%

🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.

Updated 5 months ago

textblob • Rank 27.2 • Science 36%

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Updated 6 months ago

shiba-model • Rank 8.7 • Science 54%

Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.

Updated 5 months ago

matchzoo • Rank 15.2 • Science 46%

Facilitating the design, comparison and sharing of deep text matching models.

Updated 5 months ago

https://github.com/datamade/usaddress • Rank 24.9 • Science 36%

:us: a python library for parsing unstructured United States address strings into address components

Updated 6 months ago

zensols-mimicsid • Rank 5.4 • Science 54%

MIMIC-III corpus parsing and section prediction with MedSecId (COLING paper)

Updated 6 months ago

asent • Rank 14.9 • Science 44%

Asent is a python library for performing efficient and transparent sentiment analysis using spaCy.

Updated 6 months ago

dacy • Rank 13.3 • Science 44%

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Updated 6 months ago

forte • Rank 16.2 • Science 41%

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

Updated 5 months ago

conversationalign • Rank 9.6 • Science 46%

An R package for analyzing linguistic alignment between partners in conversation transcripts

Updated 6 months ago

odin-slides • Rank 7.8 • Science 44%

This is an advanced Python tool that empowers you to effortlessly draft customizable PowerPoint slides using the Generative Pre-trained Transformer (GPT) of your choice. Leveraging the capabilities of Large Language Models (LLM), odin-slides enables you to turn the lengthiest Word documents into well organized presentations.

Updated 5 months ago

zensols-deepnlp • Rank 7.5 • Science 44%

Deep learning utility library for natural language processing (NLP-OSS paper)

Updated 6 months ago

concise-concepts • Rank 7.3 • Science 44%

This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.

Updated 6 months ago

crosslingual-coreference • Rank 6.4 • Science 44%

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Updated 6 months ago

odinrunes • Rank 4.5 • Science 44%

Odin Runes, a java-based GPT client, facilitates interaction with your preferred GPT model right through your favorite text editor. There is more: It also facilitates prompt-engineering by extracting context from diverse sources using technologies such as OCR, enhancing overall productivity and saving costs.

Updated 6 months ago

knowurenvironment • Rank 2.4 • Science 44%

Official release of KnowUREnvironment, a knowledge graph on climate change and related environmental issues. Paper link: https://www.climatechange.ai/papers/aaaifss2022/3

Updated 5 months ago

pyonmttok • Rank 18.5 • Science 26%

Fast and customizable text tokenization library with BPE and SentencePiece support

Updated 5 months ago

https://github.com/bluebrain/search • Rank 11.4 • Science 33%

Blue Brain text mining toolbox for semantic search and structured information extraction

Updated 6 months ago

spacy-wrap • Rank 12.4 • Science 26%

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.

Updated 5 months ago

word2vec • Rank 15.1 • Science 23%

Distributed Representations of Words using word2vec

Updated 5 months ago

ai • Rank 10.9 • Science 26%

AI ——人工智能工具集,包含机器学习,深度学习,自然语言处理

Updated 5 months ago

https://github.com/bramvanroy/spacy_conll • Rank 12.6 • Science 23%

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.

Updated 6 months ago

citation-report • Rank 4.5 • Science 31%

Parse legal citations having the publisher format - i.e. SCRA, PHIL, OFFG - referring to Philippine Supreme Court decisions.

Updated 6 months ago

ucto • Rank 9.3 • Science 26%

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

Updated 5 months ago

stringx • Rank 9.2 • Science 26%

Drop-in replacements for base R string functions powered by stringi

Updated 5 months ago

https://github.com/asyml/texar-pytorch • Rank 14.5 • Science 20%

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Updated 5 months ago

https://github.com/brucewlee/lftk • Rank 11.0 • Science 23%

[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.