Scientific Software
Updated 6 months ago

corporaexplorer — Peer-reviewed • Rank 14.8 • Science 93%

corporaexplorer: An R package for dynamic exploration of text collections - Published in JOSS (2019)

Scientific Software · Peer-reviewed
Scientific Software
Updated 6 months ago

ms3 — Peer-reviewed • Rank 12.4 • Science 95%

ms3: A parser for MuseScore files, serving as data factory for annotated music corpora - Published in JOSS (2023)

Updated 6 months ago

wpextract • Rank 6.1 • Science 85%

Create datasets from WordPress sites for research or archiving

Updated 6 months ago

colibri-core • Rank 6.9 • Science 49%

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Updated 6 months ago

instruction_ja • Rank 3.2 • Science 44%

Japanese instruction data (日本語指示データ)

Updated 6 months ago

https://github.com/compnet/wikisynch • Rank 0.0 • Science 13%

Synchronization between two Wikipedia-based Corpora

Updated 6 months ago

side17-html • Science 36%

SIDE 17: Genre Analysis and Corpus Design. Nineteenth-Century Spanish-American Novels (1830–1910)

Updated 6 months ago

wisesight-sentiment • Science 67%

Thai social media text sentiment dataset

Updated 6 months ago

corpus • Science 57%

Corpus en dialecte Tunisien

Updated 6 months ago

dahncorpus • Science 36%

Ground Truth dataset for French 20th typewritten OCR produced by the DAHN project

Updated 6 months ago

jrte-corpus • Science 44%

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

Updated 6 months ago

roberta-legal-portuguese • Science 41%

Related resources to the paper RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese.

Updated 6 months ago

https://github.com/cyberagentailab/adparaphrase • Science 49%

This repository contains data for our paper "AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts".

Updated 6 months ago

https://github.com/chartes/of3c • Science 23%

Old French Collective Corpus of the École des chartes

Updated 6 months ago

corpws-meincnodi-rhannau-ymadrodd • Science 65%

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers

Updated 6 months ago

corpws-cc0 • Science 62%

Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence

Scientific Software
Updated 6 months ago

quanteda — Peer-reviewed • Science 95%

quanteda: An R package for the quantitative analysis of textual data - Published in JOSS (2018)

Scientific Software · Peer-reviewed
Updated 6 months ago

galmisocorpus2023 • Science 44%

:bookmark_tabs: Galician corpus for misogyny detection

Updated 6 months ago

wovensnips • Science 44%

WovenSnips: A Lightweight, Free, and Open-source Implementation of Retrieval-Augmented Generation (RAG) using Straico API

Updated 6 months ago

quasi_japanese_reviews • Science 44%

Quasi Japanese Reviews (擬似レビューデータ)

Updated 6 months ago

roman18 • Science 49%

Collection de romans français du dix-huitième siècle (1751-1800) / Collection of Eighteenth-Century French Novels (1751-1800)