https://github.com/camel-lab/barec_analyzer

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: CAMeL-Lab
License: mit
Language: Python
Default Branch: main
Size: 24.4 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License

BAREC Analyzer

This repository contains scripts for preprocessing, training, and evaluating the models in our paper A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment. The BAREC corpus is available on Hugging Face.

Repository Structure

scripts/preprocess.py: Processes raw texts into our tokenized input variants (Word, D3Tok, Lex, and D3Lex). You DO NOT need to run this script to process the BAREC corpus as we already provide these input variants for the full corpus.
scripts/train.py: Script for fine-tuning pre-trained models using the BAREC corpus. The script supports different loss functions and input variants. It also generates results and saves trained models.
scripts/collect_results.py: Aggregates evaluation results from multiple trained models and exports them as CSV files for further analysis.

Setup

Install Dependencies

To run scripts/preprocess.py, you need to install CAMeL Tools and get the CAMeLBERT MSA morphosyntactic tagger from camel_data:

```sh git clone https://github.com/CAMeL-Lab/cameltools.git cd cameltools

conda create -n cameltools python=3.9 conda activate cameltools

pip install -e . camel_data -i disambig-bert-unfactored-msa ```

To run scripts/train.py and scripts/collect_results.py:

```sh git clone https://github.com/CAMeL-Lab/barecanalyzer.git cd barecanalyzer

conda create -n barec python=3.9 conda activate barec

pip install -r requirements.txt ```

Usage

Preprocessing

Preprocess raw text to different input variants. You DO NOT need this script if you want to train on the BAREC corpus as we already provide these input variants for the full corpus.

sh python scripts/preprocess.py \ --input <INPUT_TXT_PATH> \ --input_var <INPUT_VARIANT> \ --db <MORPHOLOGY_DATABASE> \ --output <OUTPUT_TXT_PATH>

--input: Path to input text file containing raw text data
--input_var: Input variant (Word, D3Tok, Lex, or D3Lex)
--db (optional): Path to morphological database to use for processing
--output: Path to output file to save processed text data

Important Note: The default morphological analyzer used in the preprocessing script is not the same as the one in the paper, which is licensed by LDC. To download the same morphological analyzer, you need to:

Obtain the morphological analyzer from LDC (LDC2010L01).
Download the muddled version of the analyzer from here.
Install Muddler, a tool for sharing derived data, and use it to unmuddle the encrypted file. sh pip install muddler muddler unmuddle -s /PATH/TO/LDC2010L01.tgz -m /PATH/TO/analyzer-msa.muddle /PATH/TO/almor-s31.db.utf8
To use this analyzer in scripts/preprocess.py, pass it as a parameter (--db "/PATH/TO/almor-s31.db.utf8").

Training a Model

Run the training script on the BAREC corpus with configurable parameters:

sh python scripts/train.py \ --loss <LOSS_TYPE> \ --model <MODEL_CHECKPOINT> \ --input_var <INPUT_TYPE> \ --save_dir <MODEL_SAVE_BASE_DIR> \ --output_path <OUTPUT_XLSX_DIR>

--loss: Loss function (e.g., CE, EMD, OLL1, etc.)
--model: Model checkpoint (e.g., HuggingFace model name or path)
--input_var: Input variant (Word, D3Tok, Lex, or D3Lex)
--save_dir: Base directory for saving trained model folders
--output_path: Directory to save output XLSX files

Collecting Results

After training multiple models, aggregate their results:

sh python scripts/collect_results.py \ --models_path <MODELS_DIR> \ --output_path <RESULTS_CSV_DIR>

--models_path: Directory containing all trained model folders
--output_path: Directory to save the aggregated CSV files

Citation

@inproceedings{elmadani-etal-2025-readability, title = "A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment", author = "Elmadani, Khalid N. and Habash, Nizar and Taha-Thomure, Hanada", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics" }

License

See the LICENSE file for license information.

Owner

Name: CAMeL Lab
Login: CAMeL-Lab
Kind: organization
Location: Abu Dhabi, UAE

Website: http://camel-lab.com
Repositories: 22
Profile: https://github.com/CAMeL-Lab

The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi

GitHub Events

Total

Watch event: 1
Push event: 1
Public event: 1

Last Year

Watch event: 1
Push event: 1
Public event: 1

Dependencies

requirements.txt pypi

accelerate ==1.0.0
datasets ==3.0.1
fastjsonschema ==2.20.0
jsonschema ==4.23.0
jsonschema-specifications ==2024.10.1
numpy ==1.26.4
openpyxl ==3.1.5
pandas ==2.2.3
scikit-learn ==1.5.2
torch ==2.4.1
transformers ==4.43.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science