portuguese-ngram-frequencies
Frequency counts for ngrams of the Portuguese Language
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Repository
Frequency counts for ngrams of the Portuguese Language
Basic Info
- Host: GitHub
- Owner: Superar
- Language: Python
- Default Branch: master
- Size: 262 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
N-gram Frequency Counts for the Portuguese Language
This repository contains the code used to generate the n-gram frequency counts for the Portuguese language. The n-gram frequency counts were generated using BrWaC corpus. Our project is based on the work of orgtre/google-books-ngram-frequency.
This repository contains the following files:
Portuguese-Ngram-Frequencies
├─── results [Folder with the n-gram frequency counts]
└─── scripts [Folder with the scripts used to generate the n-gram frequency counts]
How to install
You can install the required dependencies using python uv. To do so, run the following command:
bash
uv venv
uv sync
If you want to install the dependencies using pip, you can run the following command:
bash
pip install -r requirements.txt
How to run
To generate the n-gram frequency counts, you first need to have the BrWaC in a folder named data in the root of the project. The BrWaC corpus can be downloaded from here.
Then, first run the script/preprocess_brwac.sh script to preprocess the BrWaC corpus. This script will generate a one-sentence-per-line version of the BrWaC corpus in the data folder, named brwac_awk.txt.
After that, run the script/run_ngrams.sh script to generate the n-gram frequency counts. This script will generate the n-gram frequency counts in the results folder.
Citation
If you use the n-gram frequency counts generated by this project, please cite the following paper:
bibtex
@misc{Inacio2024Ngram
author = {Lima Inácio, Marcio and Oliveira, Hugo Gonçalo},
month = oct,
title = {{N-gram Frequency Counts for the Portuguese Language}},
url = {https://github.com/Superar/Portuguese-Ngram-Frequencies},
year = {2024}
}
Owner
- Name: Marcio Lima
- Login: Superar
- Kind: user
- Location: Coimbra - Portugal
- Twitter: limma_
- Repositories: 29
- Profile: https://github.com/Superar
PhD student @NLP-CISUC. MSc from @nilc-nlp . BSc from @LALIC-UFSCar.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: N-gram Frequency Counts for the Portuguese Language
message: >-
If you use the n-gram frequency counts generated by this
project, please cite us.
type: dataset
authors:
- given-names: Marcio
family-names: Lima Inácio
email: mlinacio@dei.uc.pt
affiliation: >-
Centre for Informatics and Systems of the University
of Coimbra
orcid: 'https://orcid.org/0000-0002-0875-4574'
- given-names: Hugo Gonçalo
family-names: Oliveira
email: hroliv@dei.uc.pt
orcid: 'https://orcid.org/0000-0002-5779-8645'
affiliation: >-
Centre for Informatics and Systems of the University
of Coimbra
identifiers:
- type: url
value: >-
https://github.com/Superar/Portuguese-Ngram-Frequencies
description: Github Repository
repository-code: 'https://github.com/Superar/Portuguese-Ngram-Frequencies'
abstract: >-
This repository contains the code used to generate the
n-gram frequency counts for the Portuguese language. The
n-gram frequency counts were generated using BrWaC corpus.
keywords:
- N-gram distribution
- Corpus
- Portuguese
license: CC-BY-SA-4.0
commit: 3072e8eedab7dcb93d0cde3abef5d59dfacef02d
version: '1.0'
date-released: '2024-10-03'
GitHub Events
Total
- Public event: 1
Last Year
- Public event: 1
Dependencies
- conllu >=5.0.1
- cupy-cuda11x >=13.3.0
- datasets >=3.0.0
- numpy ==1.26.4
- pip >=24.2
- polars >=1.7.0
- pyspark >=3.5.3
- setuptools >=74.1.2
- spacy [transformers]>=3.4.1
- tqdm >=4.66.5
- wheel >=0.44.0
- aiohappyeyeballs ==2.4.0
- aiohttp ==3.10.5
- aiosignal ==1.3.1
- async-timeout ==4.0.3
- attrs ==24.2.0
- blis ==0.7.11
- catalogue ==2.0.10
- certifi ==2024.8.30
- charset-normalizer ==3.3.2
- click ==8.1.7
- colorama ==0.4.6
- confection ==0.1.5
- conllu ==5.0.1
- cupy-cuda11x ==13.3.0
- cymem ==2.0.8
- datasets ==3.0.0
- dill ==0.3.8
- fastrlock ==0.8.2
- filelock ==3.16.0
- frozenlist ==1.4.1
- fsspec ==2024.6.1
- huggingface-hub ==0.24.6
- idna ==3.8
- isort ==5.13.2
- jinja2 ==3.1.4
- langcodes ==3.4.0
- language-data ==1.2.0
- marisa-trie ==1.2.0
- markupsafe ==2.1.5
- mpmath ==1.3.0
- multidict ==6.1.0
- multiprocess ==0.70.16
- murmurhash ==1.0.10
- networkx ==3.3
- numpy ==1.26.4
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu12 ==9.1.0.70
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-nccl-cu12 ==2.20.5
- nvidia-nvjitlink-cu12 ==12.6.68
- nvidia-nvtx-cu12 ==12.1.105
- packaging ==24.1
- pandas ==2.2.2
- pathlib-abc ==0.1.1
- pathy ==0.11.0
- pip ==24.2
- polars ==1.7.0
- preshed ==3.0.9
- py4j ==0.10.9.7
- pyarrow ==17.0.0
- pydantic ==1.9.2
- pyspark ==3.5.3
- python-dateutil ==2.9.0.post0
- pytz ==2024.2
- pyyaml ==6.0.2
- regex ==2024.7.24
- requests ==2.32.3
- setuptools ==74.1.2
- six ==1.16.0
- smart-open ==6.4.0
- spacy ==3.4.1
- spacy-alignments ==0.9.1
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- spacy-transformers ==1.1.9
- srsly ==2.4.8
- sympy ==1.13.2
- thinc ==8.1.12
- tokenizers ==0.13.3
- torch ==2.4.1
- tqdm ==4.66.5
- transformers ==4.25.1
- triton ==3.0.0
- typer ==0.4.2
- typing-extensions ==4.12.2
- tzdata ==2024.1
- urllib3 ==2.2.2
- wasabi ==0.10.1
- wheel ==0.44.0
- xxhash ==3.5.0
- yarl ==1.11.1