portuguese-ngram-frequencies

Frequency counts for ngrams of the Portuguese Language

https://github.com/superar/portuguese-ngram-frequencies

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Frequency counts for ngrams of the Portuguese Language

Basic Info
  • Host: GitHub
  • Owner: Superar
  • Language: Python
  • Default Branch: master
  • Size: 262 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

N-gram Frequency Counts for the Portuguese Language

This repository contains the code used to generate the n-gram frequency counts for the Portuguese language. The n-gram frequency counts were generated using BrWaC corpus. Our project is based on the work of orgtre/google-books-ngram-frequency.

This repository contains the following files:

Portuguese-Ngram-Frequencies ├─── results [Folder with the n-gram frequency counts] └─── scripts [Folder with the scripts used to generate the n-gram frequency counts]

How to install

You can install the required dependencies using python uv. To do so, run the following command:

bash uv venv uv sync

If you want to install the dependencies using pip, you can run the following command:

bash pip install -r requirements.txt

How to run

To generate the n-gram frequency counts, you first need to have the BrWaC in a folder named data in the root of the project. The BrWaC corpus can be downloaded from here.

Then, first run the script/preprocess_brwac.sh script to preprocess the BrWaC corpus. This script will generate a one-sentence-per-line version of the BrWaC corpus in the data folder, named brwac_awk.txt.

After that, run the script/run_ngrams.sh script to generate the n-gram frequency counts. This script will generate the n-gram frequency counts in the results folder.

Citation

If you use the n-gram frequency counts generated by this project, please cite the following paper:

bibtex @misc{Inacio2024Ngram author = {Lima Inácio, Marcio and Oliveira, Hugo Gonçalo}, month = oct, title = {{N-gram Frequency Counts for the Portuguese Language}}, url = {https://github.com/Superar/Portuguese-Ngram-Frequencies}, year = {2024} }

Owner

  • Name: Marcio Lima
  • Login: Superar
  • Kind: user
  • Location: Coimbra - Portugal

PhD student @NLP-CISUC. MSc from @nilc-nlp . BSc from @LALIC-UFSCar.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: N-gram Frequency Counts for the Portuguese Language
message: >-
  If you use the n-gram frequency counts generated by this
  project, please cite us.
type: dataset
authors:
  - given-names: Marcio
    family-names: Lima Inácio
    email: mlinacio@dei.uc.pt
    affiliation: >-
      Centre for Informatics and Systems of the University
      of Coimbra
    orcid: 'https://orcid.org/0000-0002-0875-4574'
  - given-names: Hugo Gonçalo
    family-names: Oliveira
    email: hroliv@dei.uc.pt
    orcid: 'https://orcid.org/0000-0002-5779-8645'
    affiliation: >-
      Centre for Informatics and Systems of the University
      of Coimbra
identifiers:
  - type: url
    value: >-
      https://github.com/Superar/Portuguese-Ngram-Frequencies
    description: Github Repository
repository-code: 'https://github.com/Superar/Portuguese-Ngram-Frequencies'
abstract: >-
  This repository contains the code used to generate the
  n-gram frequency counts for the Portuguese language. The
  n-gram frequency counts were generated using BrWaC corpus.
keywords:
  - N-gram distribution
  - Corpus
  - Portuguese
license: CC-BY-SA-4.0
commit: 3072e8eedab7dcb93d0cde3abef5d59dfacef02d
version: '1.0'
date-released: '2024-10-03'

GitHub Events

Total
  • Public event: 1
Last Year
  • Public event: 1

Dependencies

pyproject.toml pypi
  • conllu >=5.0.1
  • cupy-cuda11x >=13.3.0
  • datasets >=3.0.0
  • numpy ==1.26.4
  • pip >=24.2
  • polars >=1.7.0
  • pyspark >=3.5.3
  • setuptools >=74.1.2
  • spacy [transformers]>=3.4.1
  • tqdm >=4.66.5
  • wheel >=0.44.0
requirements.txt pypi
  • aiohappyeyeballs ==2.4.0
  • aiohttp ==3.10.5
  • aiosignal ==1.3.1
  • async-timeout ==4.0.3
  • attrs ==24.2.0
  • blis ==0.7.11
  • catalogue ==2.0.10
  • certifi ==2024.8.30
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • colorama ==0.4.6
  • confection ==0.1.5
  • conllu ==5.0.1
  • cupy-cuda11x ==13.3.0
  • cymem ==2.0.8
  • datasets ==3.0.0
  • dill ==0.3.8
  • fastrlock ==0.8.2
  • filelock ==3.16.0
  • frozenlist ==1.4.1
  • fsspec ==2024.6.1
  • huggingface-hub ==0.24.6
  • idna ==3.8
  • isort ==5.13.2
  • jinja2 ==3.1.4
  • langcodes ==3.4.0
  • language-data ==1.2.0
  • marisa-trie ==1.2.0
  • markupsafe ==2.1.5
  • mpmath ==1.3.0
  • multidict ==6.1.0
  • multiprocess ==0.70.16
  • murmurhash ==1.0.10
  • networkx ==3.3
  • numpy ==1.26.4
  • nvidia-cublas-cu12 ==12.1.3.1
  • nvidia-cuda-cupti-cu12 ==12.1.105
  • nvidia-cuda-nvrtc-cu12 ==12.1.105
  • nvidia-cuda-runtime-cu12 ==12.1.105
  • nvidia-cudnn-cu12 ==9.1.0.70
  • nvidia-cufft-cu12 ==11.0.2.54
  • nvidia-curand-cu12 ==10.3.2.106
  • nvidia-cusolver-cu12 ==11.4.5.107
  • nvidia-cusparse-cu12 ==12.1.0.106
  • nvidia-nccl-cu12 ==2.20.5
  • nvidia-nvjitlink-cu12 ==12.6.68
  • nvidia-nvtx-cu12 ==12.1.105
  • packaging ==24.1
  • pandas ==2.2.2
  • pathlib-abc ==0.1.1
  • pathy ==0.11.0
  • pip ==24.2
  • polars ==1.7.0
  • preshed ==3.0.9
  • py4j ==0.10.9.7
  • pyarrow ==17.0.0
  • pydantic ==1.9.2
  • pyspark ==3.5.3
  • python-dateutil ==2.9.0.post0
  • pytz ==2024.2
  • pyyaml ==6.0.2
  • regex ==2024.7.24
  • requests ==2.32.3
  • setuptools ==74.1.2
  • six ==1.16.0
  • smart-open ==6.4.0
  • spacy ==3.4.1
  • spacy-alignments ==0.9.1
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • spacy-transformers ==1.1.9
  • srsly ==2.4.8
  • sympy ==1.13.2
  • thinc ==8.1.12
  • tokenizers ==0.13.3
  • torch ==2.4.1
  • tqdm ==4.66.5
  • transformers ==4.25.1
  • triton ==3.0.0
  • typer ==0.4.2
  • typing-extensions ==4.12.2
  • tzdata ==2024.1
  • urllib3 ==2.2.2
  • wasabi ==0.10.1
  • wheel ==0.44.0
  • xxhash ==3.5.0
  • yarl ==1.11.1