corporaexplorer
corporaexplorer: An R package for dynamic exploration of text collections - Published in JOSS (2019)
ms3
ms3: A parser for MuseScore files, serving as data factory for annotated music corpora - Published in JOSS (2023)
colibri-core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
https://github.com/alexeyev/awesome-kyrgyz-nlp
Kyrgyz language processing software, models and datasets.
https://github.com/dcavar/sociallenseonline.github.io
social-lense.online website.
https://github.com/compnet/wikisynch
Synchronization between two Wikipedia-based Corpora
side17-html
SIDE 17: Genre Analysis and Corpus Design. Nineteenth-Century Spanish-American Novels (1830–1910)
https://github.com/hidadeng/chinese-pretrained-word-embeddings
中文文本分析工具、语料、预训练模型相关资源汇总。
dahncorpus
Ground Truth dataset for French 20th typewritten OCR produced by the DAHN project
scisynthesis
for prompts, dataset, and code addressing the task of scientific synthesis
roberta-legal-portuguese
Related resources to the paper RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese.
https://github.com/cyberagentailab/adparaphrase
This repository contains data for our paper "AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts".
https://github.com/chartes/of3c
Old French Collective Corpus of the École des chartes
corpws-meincnodi-rhannau-ymadrodd
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers
corpws-cc0
Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence
quanteda
quanteda: An R package for the quantitative analysis of textual data - Published in JOSS (2018)
wovensnips
WovenSnips: A Lightweight, Free, and Open-source Implementation of Retrieval-Augmented Generation (RAG) using Straico API
roman18
Collection de romans français du dix-huitième siècle (1751-1800) / Collection of Eighteenth-Century French Novels (1751-1800)