multilang-probe

A solution to detect languages and type characters in a multilingual setting.

https://github.com/floriancafiero/multilang-probe

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.9%) to scientific vocabulary

Keywords

ascii languages multilingual

Last synced: 6 months ago · JSON representation ·

Repository

A solution to detect languages and type characters in a multilingual setting.

Basic Info

Host: GitHub
Owner: floriancafiero
Language: Python
Default Branch: main
Homepage: https://pypi.org/project/multilang-probe/
Size: 76.2 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

ascii languages multilingual

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

multilang-probe

A Python package for analyzing multilingual text.

Overview

multilang-probe is a toolkit designed to classify character sets, detect languages in text files, and extract specific multilingual passages. It supports character detection for a wide range of writing systems using Unicode script properties (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari, and more). Additionally, it leverages the FastText model for robust language detection.

Whether you are analyzing large corpora or extracting specific language data, multilang-probe simplifies the process with an easy-to-use API.

Features

Character Set Classification:

Detect and calculate proportions of character types (e.g., Latin, Japanese, Cyrillic, Arabic, Devanagari) in text.
Uses regex with Unicode script properties (\p{Script}) for more accurate classification.
Special handling for Japanese vs Chinese characters (Han script).

Example: Character Detection

```python from multilangprobe.characterdetection import classifytextwith_proportions

Sample text with multiple languages/scripts

text = "これは日本語です。Привет мир! Ελληνικά και हिन्दी।"

Classify the text

proportions = classifytextwith_proportions(text)

Print the proportions

print("Character script proportions:") print(proportions) ``` Expected outcome:

plaintext Character script proportions: {'japanese': 19.51, 'cyrillic': 21.95, 'greek': 26.83, 'devanagari': 14.63, 'other': 17.07}

Explanation:
- If the text contains Hiragana/Katakana, Han characters are considered Japanese Kanji.
- Otherwise, Han characters are considered Chinese.

Language Detection:

Identify top languages in text using Facebook's FastText pre-trained model.

Example: Language Detection

```python from charlangdetect.languagedetection import detectlanguagefasttext

text = "Ceci est un texte en français." languages = detectlanguagefasttext(text) print(languages)

Output example: "fr: 99.2%, en: 0.8%"

```

Corpus Analysis:

Analyze all .txt files in a folder to detect multilingual passages and language distributions.
Character-based filtering: Identify and filter text lines containing specific character sets (e.g., Japanese, Cyrillic, Arabic).
Language-based filtering: Extract passages in a specific language, with customizable confidence thresholds (e.g., 70%).
Targeted extraction: Extract lines of text meeting both minimum length requirements and language detection accuracy.
Calculate language proportions: Aggregate detected languages across files and calculate their proportions.

Example: Analyze and Detect Multilingual Passages

```python from charlangdetect.corpusanalysis import analyzecorpuswith_fasttext

folderpath = "path/to/corpus/" results = analyzecorpuswithfasttext(folder_path) for filename, langs in results.items(): print(filename, langs) ```

Example: Filter Passages by Character Types

Example: Extract Passages by Language with Threshold

```python from charlangdetect.corpusanalysis import filterpassagesby_language

folderpath = "path/to/corpus/" targetlanguages = ["fr", "en"] threshold = 70 filtered = filterpassagesbylanguage(results, targetlanguages, folder_path, threshold) for filename, passages in filtered.items(): print(filename, passages) ```

Supported Character Sets

Japanese (Hiragana, Katakana)
Han (Kanji; considered Japanese if Hiragana/Katakana present, else Chinese)
Korean (Hangul)
Cyrillic (for languages like Russian, Bulgarian, etc.)
Arabic
Hebrew
Greek
Latin (basic and extended)
Devanagari (e.g., Hindi, Sanskrit)
Tamil, Bengali, Thai
Extendable via Unicode scripts
"other" category for characters not belonging to known scripts

Dependencies

Python 3.7+
FastText
Regex (for Unicode script classification)

License

This project is licensed under the MIT License. While the MIT License allows unrestricted use, modification, and distribution of this software, I kindly request that proper credit be given when this project is used in academic, research, or published work. For citation purposes, please refer to the following:

CAFIERO Florian, 'multilang-probe', 2024, [https://github.com/floriancafiero/multilang-probe].

Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Author

Florian Cafiero
GitHub: floriancafiero
Email: florian.cafiero@chartes.psl.eu

Future Features

Support for other pre-trained language models (e.g., spaCy).
Detection of mathematical language
Visualization tools for multilingual analysis.
CLI (Command-Line Interface) for easy usage without writing code.

Owner

Name: Florian Cafiero
Login: floriancafiero
Kind: user
Location: Paris
Company: CNRS

Website: https://www.lerobert.com/autour-des-mots/francais/affaires-de-style-du-cas-moliere-a-l-affaire-gregory-la-stylometrie-mene-l-enquete-9782321017349.html
Twitter: F_Cafiero
Repositories: 1
Profile: https://github.com/floriancafiero

Statistician - CNRS/ Paris-Sorbonne ; visiting scholar Columbia University ; lecturer Ecole nationale des chartes / PSL.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this package, please cite the followinf information: "
authors:
  - family-names: "Cafiero"
    given-names: "Florian"
    affiliation: "Paris Sciences et Lettres (PSL)"
title: "Multilang-Probe :a Python tool for multilingual analysis"
version: "0.1.3"
# doi: "10.1234/xyz.123456" 
date-released: "2024-12-12"  
repository-code: "https://github.com/floriancafiero/multilang-probe"
license: "MIT"

GitHub Events

Total

Release event: 2
Watch event: 1
Push event: 34
Create event: 5

Last Year

Release event: 2
Watch event: 1
Push event: 34
Create event: 5

Committers

Last synced: about 1 year ago

All Time

Total Commits: 41
Total Committers: 1
Avg Commits per committer: 41.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 41
Committers: 1
Avg Commits per committer: 41.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Florian Cafiero	f**o@p**u	41

Committer Domains (Top 20 + Academic)

polytechnique.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 34 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

pypi.org: multilang-probe

A Python package for analyzing multilingual text.

Documentation: https://multilang-probe.readthedocs.io/
License: MIT
Latest release: 0.1.7
published about 1 year ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 34 Last month

Rankings

Dependent packages count: 9.9%

Average: 32.8%

Dependent repos count: 55.7%

Maintainers (1)

FlorianCafiero

Last synced: 6 months ago

multilang-probe

Science Score: 77.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

multilang-probe

Overview

Features

Character Set Classification:

Example: Character Detection

Sample text with multiple languages/scripts

Classify the text

Print the proportions

Language Detection:

Example: Language Detection

Output example: "fr: 99.2%, en: 0.8%"

Corpus Analysis:

Example: Analyze and Detect Multilingual Passages

Example: Filter Passages by Character Types

Example: Extract Passages by Language with Threshold

Supported Character Sets

Dependencies

License

Contributing

Author

Future Features

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: multilang-probe

Rankings

Maintainers (1)