portuguese-ngram-frequencies

Frequency counts for ngrams of the Portuguese Language

https://github.com/superar/portuguese-ngram-frequencies

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Frequency counts for ngrams of the Portuguese Language

Basic Info

Host: GitHub
Owner: Superar
Language: Python
Default Branch: master
Size: 262 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

N-gram Frequency Counts for the Portuguese Language

This repository contains the code used to generate the n-gram frequency counts for the Portuguese language. The n-gram frequency counts were generated using BrWaC corpus. Our project is based on the work of orgtre/google-books-ngram-frequency.

This repository contains the following files:

Portuguese-Ngram-Frequencies ├─── results [Folder with the n-gram frequency counts] └─── scripts [Folder with the scripts used to generate the n-gram frequency counts]

How to install

You can install the required dependencies using python uv. To do so, run the following command:

bash uv venv uv sync

If you want to install the dependencies using pip, you can run the following command:

bash pip install -r requirements.txt

How to run

To generate the n-gram frequency counts, you first need to have the BrWaC in a folder named data in the root of the project. The BrWaC corpus can be downloaded from here.

Then, first run the script/preprocess_brwac.sh script to preprocess the BrWaC corpus. This script will generate a one-sentence-per-line version of the BrWaC corpus in the data folder, named brwac_awk.txt.

After that, run the script/run_ngrams.sh script to generate the n-gram frequency counts. This script will generate the n-gram frequency counts in the results folder.

Citation

If you use the n-gram frequency counts generated by this project, please cite the following paper:

bibtex @misc{Inacio2024Ngram author = {Lima Inácio, Marcio and Oliveira, Hugo Gonçalo}, month = oct, title = {{N-gram Frequency Counts for the Portuguese Language}}, url = {https://github.com/Superar/Portuguese-Ngram-Frequencies}, year = {2024} }

Owner

Name: Marcio Lima
Login: Superar
Kind: user
Location: Coimbra - Portugal

Twitter: limma_
Repositories: 29
Profile: https://github.com/Superar

PhD student @NLP-CISUC. MSc from @nilc-nlp . BSc from @LALIC-UFSCar.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: N-gram Frequency Counts for the Portuguese Language
message: >-
  If you use the n-gram frequency counts generated by this
  project, please cite us.
type: dataset
authors:
  - given-names: Marcio
    family-names: Lima Inácio
    email: mlinacio@dei.uc.pt
    affiliation: >-
      Centre for Informatics and Systems of the University
      of Coimbra
    orcid: 'https://orcid.org/0000-0002-0875-4574'
  - given-names: Hugo Gonçalo
    family-names: Oliveira
    email: hroliv@dei.uc.pt
    orcid: 'https://orcid.org/0000-0002-5779-8645'
    affiliation: >-
      Centre for Informatics and Systems of the University
      of Coimbra
identifiers:
  - type: url
    value: >-
      https://github.com/Superar/Portuguese-Ngram-Frequencies
    description: Github Repository
repository-code: 'https://github.com/Superar/Portuguese-Ngram-Frequencies'
abstract: >-
  This repository contains the code used to generate the
  n-gram frequency counts for the Portuguese language. The
  n-gram frequency counts were generated using BrWaC corpus.
keywords:
  - N-gram distribution
  - Corpus
  - Portuguese
license: CC-BY-SA-4.0
commit: 3072e8eedab7dcb93d0cde3abef5d59dfacef02d
version: '1.0'
date-released: '2024-10-03'

GitHub Events

Total

Public event: 1

Last Year

Public event: 1

Dependencies

pyproject.toml pypi

conllu >=5.0.1
cupy-cuda11x >=13.3.0
datasets >=3.0.0
numpy ==1.26.4
pip >=24.2
polars >=1.7.0
pyspark >=3.5.3
setuptools >=74.1.2
spacy [transformers]>=3.4.1
tqdm >=4.66.5
wheel >=0.44.0

requirements.txt pypi

aiohappyeyeballs ==2.4.0
aiohttp ==3.10.5
aiosignal ==1.3.1
async-timeout ==4.0.3
attrs ==24.2.0
blis ==0.7.11
catalogue ==2.0.10
certifi ==2024.8.30
charset-normalizer ==3.3.2
click ==8.1.7
colorama ==0.4.6
confection ==0.1.5
conllu ==5.0.1
cupy-cuda11x ==13.3.0
cymem ==2.0.8
datasets ==3.0.0
dill ==0.3.8
fastrlock ==0.8.2
filelock ==3.16.0
frozenlist ==1.4.1
fsspec ==2024.6.1
huggingface-hub ==0.24.6
idna ==3.8
isort ==5.13.2
jinja2 ==3.1.4
langcodes ==3.4.0
language-data ==1.2.0
marisa-trie ==1.2.0
markupsafe ==2.1.5
mpmath ==1.3.0
multidict ==6.1.0
multiprocess ==0.70.16
murmurhash ==1.0.10
networkx ==3.3
numpy ==1.26.4
nvidia-cublas-cu12 ==12.1.3.1
nvidia-cuda-cupti-cu12 ==12.1.105
nvidia-cuda-nvrtc-cu12 ==12.1.105
nvidia-cuda-runtime-cu12 ==12.1.105
nvidia-cudnn-cu12 ==9.1.0.70
nvidia-cufft-cu12 ==11.0.2.54
nvidia-curand-cu12 ==10.3.2.106
nvidia-cusolver-cu12 ==11.4.5.107
nvidia-cusparse-cu12 ==12.1.0.106
nvidia-nccl-cu12 ==2.20.5
nvidia-nvjitlink-cu12 ==12.6.68
nvidia-nvtx-cu12 ==12.1.105
packaging ==24.1
pandas ==2.2.2
pathlib-abc ==0.1.1
pathy ==0.11.0
pip ==24.2
polars ==1.7.0
preshed ==3.0.9
py4j ==0.10.9.7
pyarrow ==17.0.0
pydantic ==1.9.2
pyspark ==3.5.3
python-dateutil ==2.9.0.post0
pytz ==2024.2
pyyaml ==6.0.2
regex ==2024.7.24
requests ==2.32.3
setuptools ==74.1.2
six ==1.16.0
smart-open ==6.4.0
spacy ==3.4.1
spacy-alignments ==0.9.1
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
spacy-transformers ==1.1.9
srsly ==2.4.8
sympy ==1.13.2
thinc ==8.1.12
tokenizers ==0.13.3
torch ==2.4.1
tqdm ==4.66.5
transformers ==4.25.1
triton ==3.0.0
typer ==0.4.2
typing-extensions ==4.12.2
tzdata ==2024.1
urllib3 ==2.2.2
wasabi ==0.10.1
wheel ==0.44.0
xxhash ==3.5.0
yarl ==1.11.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science