calamancy

NLP pipelines for Tagalog using spaCy

https://github.com/ljvmiranda921/calamancy

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Keywords

computational-linguistics low-resource-languages low-resource-nlp machine-learning natural-language-processing ner nlp spacy

Last synced: 6 months ago · JSON representation ·

Repository

NLP pipelines for Tagalog using spaCy

Basic Info

Host: GitHub
Owner: ljvmiranda921
License: mit
Language: Python
Default Branch: master
Homepage: https://ljvmiranda921.github.io/calamanCy/
Size: 1.32 MB

Statistics

Stars: 62
Watchers: 4
Forks: 5
Open Issues: 1
Releases: 2

Topics

computational-linguistics low-resource-languages low-resource-nlp machine-learning natural-language-processing ner nlp spacy

Created over 3 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

calamanCy: NLP pipelines for Tagalog

calamanCy is a Tagalog natural language preprocessing framework made with spaCy. Its goal is to provide pipelines and datasets for downstream NLP tasks. This repository contains material for using calamanCy, reproduction of results, and guides on usage.

calamanCy takes inspiration from other language-specific spaCy Universe frameworks such as DaCy, huSpaCy, and graCy. The name is based from calamansi, a citrus fruit native to the Philippines and used in traditional Filipino cuisine.

Website: https://ljvmiranda921.github.io/calamanCy

News

[2025-05-15] UD-NewsCrawl, a work that currently powers calamanCy v2 and where I'm one of the lead authors, has been accepted at ACL 2025! I will be presenting this work in Vienna on July 29! See you :)
[2025-01-19] Released v0.2.0 models with significantly improved performance on syntactic parsing and NER! All thanks to the newly-released UD-NewsCrawl treebank! See full changes in this blogpost.
[2024-08-01] Released new NER-only models based on GLiNER! You can find the models in this HuggingFace collection. Span-Marker and calamanCy models are still superior, but GLiNER offers a lot of extensibility on unseen entity labels. You can find the training pipeline here.
[2024-07-02] I talked about calamanCy during my guest lecture, "Artisanal Filipino NLP Resources in the time of Large Language Models," @ DLSU Manila. You can find the slides (and an accompanying blog post) here.
[2023-12-05] We released the paper calamanCy: A Tagalog Natural Language Processing Toolkit and will be presented in the NLP-OSS workshop at EMNLP 2023! Feel free to check out the Tagalog NLP collection in HuggingFace.
[2023-11-01] The named entity recognition (NER) dataset used to train the NER component of calamanCy has now a corresponding paper: Developing a Named Entity Recognition Dataset for Tagalog! It will be presented in the SEALP workshop at IJCNLP-AACL 2023! The dataset is also available in HuggingFace. I've also talked about my thoughts on the annotation process in my blog.
[2023-08-01] First release of calamanCy! Please check out this blog post to learn more and read some of my preliminary work back in February here.

Installation

To get started with calamanCy, simply install it using pip by running the following line in your terminal:

sh pip install calamanCy

Development

If you are developing calamanCy, first clone the repository:

sh git clone git@github.com:ljvmiranda921/calamanCy.git

Then, create a virtual environment and install the dependencies:

```sh python -m venv .venv .venv/bin/pip install -e . # requires pip>=23.0 .venv/bin/pip install .[dev]

Activate the virtual environment

source venv/bin/activate ```

[Experimental] If you want to use uv, then install via the following commands:

```sh uv sync --dev

Activate the virtual environment

source .venv/bin/activate ```

We also require using pre-commit hooks to standardize formatting. The pre-commit dependency should be in your virtual environment if the installation steps above were successful:

sh pre-commit install

Running the tests

We use pytest as our test runner:

sh python -m pytest --pyargs calamancy

Usage

To use calamanCy you first have to download either the medium, large, or transformer model. To see a list of all available models, run:

```python import calamancy for model in calamancy.models(): print(model)

..

tlcalamancymd-0.1.0

tlcalamancylg-0.1.0

tlcalamancytrf-0.1.0

```

To download and load a model, run:

python nlp = calamancy.load("tl_calamancy_md-0.1.0") doc = nlp("Ako si Juan de la Cruz")

The nlp object is an instance of spaCy's Language class and you can use it as any other spaCy pipeline. You can also access these models on Hugging Face .

Models and Datasets

calamanCy provides Tagalog models and datasets that you can use in your spaCy pipelines. You can download them directly or use the calamancy Python library to access them. The training procedure for each pipeline can be found in the models/ directory. They are further subdivided into versions. Each folder is an instance of a spaCy project.

Here are the models for the latest release:

| Model | Pipelines | Description | | --------------------------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | | tlcalamancymd (73.7 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) | | tlcalamancylg (431.9 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) | | tlcalamancytrf (775.6 MB) | transformer, tagger, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors. |

API

The calamanCy library contains utility functions that help you load its models and infer on your text. You can think of these functions as "syntactic sugar" to the spaCy API. We highly recommend checking out the spaCy Doc object, as it provides the most flexibility.

Loaders

The loader functions provide an easier interface to download calamanCy models. These models are hosted on HuggingFace so you can try them out first before downloading.

`function` `get_latest_version`

Return the latest version of a calamanCy model.

| Argument | Type | Description | | ----------- | ----- | -------------------------------- | | model | str | The string indicating the model. | | RETURNS | str | The latest version of the model. |

`function` `models`

Get a list of valid calamanCy models.

| Argument | Type | Description | | ----------- | ----------- | ------------------------------ | | RETURNS | List[str] | List of valid calamanCy models |

`function` `load`

Load a calamanCy model as a spaCy language pipeline.

| Argument | Type | Description | | ----------- | ------------------------------------------- | -------------------------------------------------------------------------------------------- | | model | str | The model to download. See the available models at calamancy.models(). | | force | bool | Force download the model. Defaults to False. | | **kwargs | dict | Additional arguments to spacy.load(). | | RETURNS | Language | A spaCy language pipeline. |

Inference

Below are lightweight utility classes for users who are not familiar with spaCy's primitives. They are only useful for inference and not for training. If you wish to train on top of these calamanCy models (e.g., text categorization, task-specific NER, etc.), we advise you to follow the standard spaCy training workflow.

General usage: first, you need to instantiate a class with the name of a model. Then, you can use the __call__ method to perform the prediction. The output is of the type Iterable[Tuple[str, Any]] where the first part of the tuple is the token and the second part is its label.

`method` `EntityRecognizer.call`

Perform named entity recognition (NER). By default, it uses the v0.1.0 of TLUnified-NER with the following entity labels: PER (Person), ORG (Organization), LOC (Location).

| Argument | Type | Description | | ---------- | --------------------------- | --------------------------------------- | | text | str | The text to get the entities from. | | YIELDS | Iterable[Tuple[str, str]] | the token and its entity in IOB format. |

`method` `Tagger.call`

Perform parts-of-speech tagging. It uses the annotations from the TRG and Ugnayan treebanks with the following tags: ADJ, ADP, ADV, AUX, DET, INTJ, NOUN, PART, PRON, PROPN, PUNCT, SCONJ, VERB.

| Argument | Type | Description | | ---------- | --------------------------------------- | --------------------------------------------------- | | text | str | The text to get the POS tags from. | | YIELDS | Iterable[Tuple[str, Tuple[str, str]]] | the token and its coarse- and fine-grained POS tag. |

`method` `Parser.call`

Perform syntactic dependency parsing. It uses the annotations from the TRG and Ugnayan treebanks.

| Argument | Type | Description | | ---------- | --------------------------- | ---------------------------------------------- | | text | str | The text to get the dependency relations from. | | YIELDS | Iterable[Tuple[str, str]] | the token and its dependency relation. |

Reporting Issues

If you have questions regarding the usage of calamanCy, bug reports, or just want to give us feedback after giving it a spin, please use the Issue tracker. Thank you!

Citation

If you are citing the open-source software, please use:

bib @inproceedings{miranda-2023-calamancy, title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit", author = "Miranda, Lester James", editor = "Tan, Liling and Milajevs, Dmitrijs and Chauhan, Geeticka and Gwinnup, Jeremy and Rippeth, Elijah", booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", month = dec, year = "2023", address = "Singapore, Singapore", publisher = "Empirical Methods in Natural Language Processing", url = "https://aclanthology.org/2023.nlposs-1.1", pages = "1--7", abstract = "We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER). calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework.The calamanCy toolkit is available on GitHub: https://github.com/ljvmiranda921/calamanCy.", }

If you are citing the NER dataset, please use:

bib @inproceedings{miranda-2023-developing, title = "Developing a Named Entity Recognition Dataset for {T}agalog", author = "Miranda, Lester James", editor = "Wijaya, Derry and Aji, Alham Fikri and Vania, Clara and Winata, Genta Indra and Purwarianti, Ayu", booktitle = "Proceedings of the First Workshop in South East Asian Language Processing", month = nov, year = "2023", address = "Nusa Dua, Bali, Indonesia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.sealp-1.2", doi = "10.18653/v1/2023.sealp-1.2", pages = "13--20", }

Owner

Name: Lj Miranda
Login: ljvmiranda921
Kind: user
Company: @explosion

Website: https://ljvmiranda921.github.io/
Twitter: ljvmiranda
Repositories: 40
Profile: https://github.com/ljvmiranda921

Machine Learning Engineer at @explosion 💥

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Miranda"
    given-names: "Lester James Validad"
    orcid: "https://orcid.org/0000-0002-7872-6464"
title: "calamanCy: A Tagalog Natural Language Processing Toolkit"
version: 0.1.0
doi: 10.48550/arXiv.2311.07171
date-released: 2023-07-02
url: "https://github.com/ljvmiranda921/calamanCy"

GitHub Events

Total

Create event: 14
Release event: 1
Issues event: 12
Watch event: 16
Delete event: 13
Issue comment event: 12
Push event: 117
Pull request event: 20
Fork event: 1

Last Year

Create event: 14
Release event: 1
Issues event: 12
Watch event: 16
Delete event: 13
Issue comment event: 12
Push event: 117
Pull request event: 20
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 283
Total Committers: 4
Avg Commits per committer: 70.75
Development Distribution Score (DDS): 0.011

Past Year

Commits: 65
Committers: 3
Avg Commits per committer: 21.667
Development Distribution Score (DDS): 0.031

Top Committers

Name	Email	Commits
Lj Miranda	l**a@g**m	280
doge	3****9	1
Ikko Eltociear Ashimine	e**r@g**m	1
root	r**t@a**n	1

Committer Domains (Top 20 + Academic)

allennlp-cs-aus-273.reviz.ai2.in: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 25
Total pull requests: 42
Average time to close issues: 2 months
Average time to close pull requests: 4 days
Total issue authors: 11
Total pull request authors: 3
Average comments per issue: 1.16
Average comments per pull request: 0.12
Merged pull requests: 42
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 19
Average time to close issues: 10 days
Average time to close pull requests: 1 day
Issue authors: 4
Pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ljvmiranda921 (10)
stickykeys99 (4)
andreo-serrano (2)
wadid (1)
abraham-harris (1)
woodthom2 (1)
Nirupam2016 (1)
weixiu00 (1)
Tutuldot (1)
herbiel (1)
lamrongol (1)

Pull Request Authors

ljvmiranda921 (42)
stickykeys99 (2)
eltociear (1)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

enhancement (2) bug (2) documentation (1)

Packages

Total packages: 1
Total downloads:
- pypi 252 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 9
Total maintainers: 1

pypi.org: calamancy

NLP Pipelines for Tagalog

Documentation: https://calamancy.readthedocs.io/
License: MIT License
Latest release: 0.2.2
published about 1 year ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 252 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 12.3%

Average: 17.2%

Forks count: 19.1%

Dependent repos count: 21.8%

Downloads: 22.9%

Maintainers (1)

ljvmiranda

Last synced: 6 months ago

Dependencies

.github/workflows/publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

models/v0.1.0/meta.json cpan

datasets/tl_calamancy_gold_corpus/requirements.txt pypi

plotext ==4.2.0
spacy >=3.4.0
spacy-transformers *
srsly *
tqdm *
typer *
wandb *
wasabi *

datasets/tl_calamancy_silver_corpus/requirements.txt pypi

datasets *
spacy *
tqdm *

models/v0.1.0/requirements.txt pypi

pathy *
spacy >=3.5.0
spacy-huggingface-hub >=0.0.9
spacy-transformers >=1.2.5
typer *
wasabi *

pyproject.toml pypi

spacy >=3.5.0
typer >=0.4.2
wasabi >=0.9.1

reports/software/benchmark/requirements.txt pypi

calamancy ==0.1.0
pathy *
scipy *
spacy >=3.5.0
spacy-transformers *
typer *
wasabi *

calamancy

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

calamanCy: NLP pipelines for Tagalog

News

Installation

Development

Activate the virtual environment

Activate the virtual environment

Running the tests

Usage

..

tlcalamancymd-0.1.0

tlcalamancylg-0.1.0

tlcalamancytrf-0.1.0

Models and Datasets

API

Loaders

function get_latest_version

function models

function load

Inference

method EntityRecognizer.__call__

method Tagger.__call__

method Parser.__call__

Reporting Issues

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: calamancy

Rankings

Maintainers (1)

Dependencies

`function` `get_latest_version`

`function` `models`

`function` `load`

`method` `EntityRecognizer.call`

`method` `Tagger.call`

`method` `Parser.call`