Tashaphyne

Tashaphyne: A Python package for Arabic Light Stemming - Published in JOSS (2024)

https://github.com/linuxscout/tashaphyne

Scientific Fields

Engineering Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Tashaphyne: Arabic Light Stemmer

Basic Info

Host: GitHub
Owner: linuxscout
License: gpl-3.0
Language: Python
Default Branch: master
Size: 842 KB

Statistics

Stars: 100
Watchers: 8
Forks: 21
Open Issues: 0
Releases: 2

Created about 9 years ago · Last pushed over 1 year ago

Metadata Files

Readme Funding License Support

Tashaphyne

downloads

Tashaphyne: Arabic Light Stemmer تاشفين: التجذيع الخفيف للنصوص العربية

تاشفين برنامج تجذيع عربي خفيف ومقطع للكلمات. يدعم بشكل أساسي التجذيع الخفيف (إزالة السوابق واللواحق) ويعطي الجذوع الممكنة. يستخدم ألة ذات وضعيات محدودة معدّلة، مما يسمح له باستخلاص كل الجذوع الممكنة.

يوفر تاشفين استخلاص الجذع والجذر من الكلمة في نفس الوقت، على عكس برامج التجذيع مثل Khoja وISRI وAssem وFarasa.

تاشفين يأتي بقائمة افتراضية للسوابق واللواحق، ويقبل استخدام قوائم مخصصة للزوائد، مما يسمح له بالتعامل مع المزيد من الجوانب الصرفية، وإنشاء زوائد مخصصة دون تغيير الكود.

تاشفين هي مكتبة بايثون، وهي متاحة للتجربة في برنامج مشكال على Mishkal، اختر أدوات/تحليل والمصدر مفتوح على Github Tashaphyne is an Arabic light stemmer and segmentor. It mainly supports light stemming (removing prefixes and suffixes) and gives all possible segmentations. It uses a modified finite state automaton, which allows it to generate all segmentations.

It offers stemming and root extraction at the same time, unlike the Khoja stemmer, ISRI stemmer, Assem stemmer, and Farasa stemmer.

Tashaphyne comes with default prefixes and suffixes, and accepts the use of customized prefixes and suffixes lists, which allow it to handle more aspects and make customized stemmers without changing code.

Tashaphyne is a python library, it's available as a demo on Mishkal, choose Tools/Analysis and as source code on Github

Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com

---------|--------------------------------------------------------------------------------- Features | value ---------|--------------------------------------------------------------------------------- Authors | Authors.md Release | 0.3.7 License |GPL Tracker |linuxscout/tashaphyne/Issues Website |https://pypi.python.org/pypi/Tashaphyne Doc |package Documentaion Source |Github Download |sourceforge Feedbacks |Comments Accounts |@Twitter @Sourceforge

Citation

If you would cite it in academic work, can you use this citation

T. Zerrouki‏, Tashaphyne, Arabic light stemmer‏, https://pypi.python.org/pypi/Tashaphyne/0.2
Zerrouki, T. (2024). Tashaphyne: A python package for arabic light stemming. Journal of Open Source Software, 9(93), 6063. doi: http://doi.org/10.21105/joss.06063
Alkhatib, R. M., Zerrouki, T., Shquier, M. M. A., & Balla, A. (2023). Tashaphyne0.4: A new arabic light stemmer based on rhyzome modeling approach. Information Retrieval Journa, 26(14). doi: https://doi.org/10.1007/s10791-023-09429-y
Alkhatib, R. M., Zerrouki, T., Shquier, M. M. A., Balla, A., & Al-Khateeb, A. (2021). A new enhanced arabic light stemmer for ir in medical documents. CMC-COMPUTERS MATERIALS & CONTINUA, 68(1), 1255–1269.

or in bibtex format bibtex @misc{zerrouki2012tashaphyne, title={Tashaphyne, Arabic light stemmer}, author={Zerrouki, Taha}, url={https://pypi.python.org/pypi/Tashaphyne/0.2}, year={2012} }

** bibtex bibtex @article{Zerrouki2024, title = {Tashaphyne: A Python package for Arabic Light Stemming}, author = {Taha Zerrouki}, year = 2024, journal = {Journal of Open Source Software}, publisher = {The Open Journal}, volume = 9, number = 93, pages = 6063, doi = {10.21105/joss.06063}, url = {https://doi.org/10.21105/joss.06063} }

```bibtex @article{raed20223, title={Tashaphyne0.4: a new arabic light stemmer based on rhyzome modeling approach}, author={Alkhatib, Read M and Zerrouki, Taha and Shquier, Mohammed M Abu and Balla, Amar}, journal={Information Retrieval Journa}, year={2023}, pages={}, volume={26}, number={14}, doi={https://doi.org/10.1007/s10791-023-09429-y} }

@article{raed2021, title={A New Enhanced Arabic Light Stemmer for IR in Medical Documents}, author={Alkhatib, Read M and Zerrouki, Taha and Shquier, Mohammed M Abu and Balla, Amar and Al-Khateeb, Asef}, journal={CMC-COMPUTERS MATERIALS & CONTINUA}, year={2021}, pages={1255-1269}, volume={68}, number={1} } ```

مزايا

تجذيع الكلمة العربية إلى أبسط جذع ممكن
إمكانية استخراج الجذر
تقطيع الكلمة إلى جميع الحالات الممكنة.
تنميط الكلمة ( توحيد الحروف ذات الأشكال المختلفة.
قائمة مسبقة للزوائد العربية، وحروف الزيادة
إمكانية ضبط إعدادات المجذع والمقطع، من خلال تعديل قوائم الزوائد.

Features

Arabic word Light Stemming.
Root Extraction.
Word Segmentation
Word normalization
Default Arabic Affixes list.
An customizable Light stemmer: possibility of change stemmer options and data.
Data independent stemmer.

Applications

Stemming texts
Text Classification and categorization
Sentiment Analysis
Named Entities Recognition

Installation

pip install tashaphyne

Usage

Tahsphyne is a finite state automaton stem-based; it extracts affixes (prefixes and suffixes) from a predefined affix list.

It extracts all possible affixations from a word and cites all possible configurations stemming from a given word.

Functions الدوال

تجذيع الكلمة

تجذيع الكلمة واستخلاص كل المعلومات منها بواسطة الدوال المناسبة

Stemming function: stem an Arabic word and return a stem. This function stores in the instance the stemming positions (left, right), and then it's possible to get other calculated attributes like stem, prefix, suffix, and root.

```python

from tashaphyne.stemming import ArabicLightStemmer ArListem = ArabicLightStemmer() word = 'أفتضاربانني'

stemming word

... stem = ArListem.light_stem(word)

extract stem

... print(ArListem.get_stem()) ضارب

extract root

... print(ArListem.get_root()) ضرب

get prefix position index

... print(ArListem.get_left()) 3

get prefix

... print(ArListem.get_prefix()) أفت

get prefix with a specific index

... print(ArListem.get_prefix(2))
أف

get suffix position index

... print(ArListem.get_right()) 7

get suffix

... print(ArListem.get_suffix())
انني

get suffix with a specific index

... print(ArListem.get_suffix(10))
ي

get affix

print(ArListem.get_affix()) أفت-انني

get affix tuple

... print(ArListem.getaffixtuple()) {'prefix': 'أفت', 'root': '', 'stem': '', 'suffix': 'أفتضاربانني'}

star words

... print(ArListem.get_starword()) أفتا*انني

get star stem

... print(ArListem.get_starstem()) ا*

get unvocalized word

... print(ArListem.get_unvocalized()) أفتضاربانني ```

function | Description | وصف| ---------|-------------|----| getroot()|Get the root of the treated word by the stemmer. |استخلاص الجذر| getstem()|Get the stem of the treated word by the stemmer.|استخلاص الجذع يمكن استخلاص الجذع التلقائي مباشرة، عند الرغبة في الحصول على جذع معين، نحدد دليل السابق، ودليل اللاحق.| getleft()| Get the prefix end position | موضع نهاية السابقة| getright()|Get the suffix start position| موضع بداية اللاحقة | getprefix()|return the prefix/suffix of the treated word by the stemmer.|استرجاع السابقة التلقائية أو سابقة معينة بموضع| getsuffix()| Get default suffix, or suffix by suffix index| استرجاع اللاحقة التلقائية أو بواسطة دليل اللاحقة getaffix()|Get default Affix or specific by left and right indexes|استرجاع الزائدة التلقائية أو المعينة بدليلي السابق واللاحق| getaffixtuple()|Get affixe tuple | استرجاع الزائدة بتفاصيلها getstarword()|Get starred word, radical letters replaced by ""|استرجاع الكلمة المنجمة، الحروف الأصلية مخفية بنجوم get_starstem()|Get starred stem, radical letters replaced by ""|استرجاع الجذع المنجم، الحروف الأصلية مخفية بنجوم get_unvocalized()|return the unvocalized form of the treated word by the stemmer. Harakat are striped.| استرجاع الكلمة غير مشكولة|

استخلاص كل التقسيمات المحتملة
تقسيم الكلمة إلى كل الزوائد المحتملة

Generate a list of all possible segmentation positions (left, right) of the treated word by the stemmer.

```python

word = 'أفتضاربانني'

Detect all possible segmentation

... print(ArListem.segment(word)) set([(2, 7), (3, 8), (0, 8), (2, 9), (2, 8), (3, 10), (2, 11), (1, 8), (0, 7), (2, 10), (3, 11), (1, 10), (0, 11), (3, 9), (0, 10), (1, 7), (0, 9), (3, 7), (1, 11), (1, 9)])

Get all segment

print(ArListem.getsegmentlist()) set([(2, 7), (3, 8), (0, 8), (2, 9), (2, 8), (3, 10), (2, 11), (1, 8), (0, 7), (2, 10), (3, 11), (1, 10), (0, 11), (3, 9), (0, 10), (1, 7), (0, 9), (3, 7), (1, 11), (1, 9)])

get affix list

... print(ArListem.getaffixlist()) [{'prefix': 'أف', 'root': 'ضرب', 'stem': 'تضارب', 'suffix': 'انني'}, {'prefix': 'أفت', 'root': 'ضرب', 'stem': 'ضاربا', 'suffix': 'نني'}, {'prefix': '', 'root': 'أفضرب', 'stem': 'أفتضاربا', 'suffix': 'نني'}, {'prefix': 'أف', 'root': 'ضربن', 'stem': 'تضاربان', 'suffix': 'ني'}, {'prefix': 'أف', 'root': 'ضرب', 'stem': 'تضاربا', 'suffix': 'نني'}, {'prefix': 'أفت', 'root': 'ضربنن', 'stem': 'ضاربانن', 'suffix': 'ي'}, ...]

``` * segment() / getsegmentlist() استخلاص قائمة مواضع كل التقسيمات المحتملة على شكل أعداد return a list of segmentation positions (left, right) of the treated word by the stemmer.

getaffixlist

استخلاص قائمة كل الزوائد المحتملة

return a list of affix tuple of the treated word by the stemmer.

Customized Affix list

تخصيص قوائم الزوائد يمكنن تخصيص قوائم السوابق واللواحق للحصول على نتائج افضل حسب السياق

في المثال الموالي، سنستعمل مجذع تاشفين حسب قوائمه التلقائية، ثم نصنع مجذعا آخر يعطي نتائج مختلفة بتخصيص قوائم السوابق واللواحق

You can modify and customize the default affixes list by

```python

import tashaphyne.stemming

CUSTOMPREFIXLIST = [u'كال', 'أفبال', 'أفك', 'فك', 'أولل', '', 'أف', 'ول', 'أوال', 'ف', 'و', 'أو', 'ولل', 'فب', 'أول', 'ألل', 'لل', 'ب', 'وكال', 'أوب', 'بال', 'أكال', 'ال', 'أب', 'وب', 'أوبال', 'أ', 'وبال', 'أك', 'فكال', 'أوك', 'فلل', 'وك', 'ك', 'أل', 'فال', 'وال', 'أوكال', 'أفلل', 'أفل', 'فل', 'أال', 'أفكال', 'ل', 'أبال', 'أفال', 'أفب', 'فبال'] CUSTOMSUFFIXLIST = [u'كما', 'ك', 'هن', 'ي', 'ها', '', 'ه', 'كم', 'كن', 'هم', 'هما', 'نا']

simple stemmer with default affixes list

... simple_stemmer = tashaphyne.stemming.ArabicLightStemmer()

create a cعstomized stemmer object for stemming enclitics and procletics

... custom_stemmer = tashaphyne.stemming.ArabicLightStemmer()

configure the stemmer object

... customstemmer.setprefixlist(CUSTOMPREFIXLIST) customstemmer.setsuffixlist(CUSTOMSUFFIXLIST)

word = "بالمدرستين"

segment word as

... simplestemmer.segment(word) set([(4, 10), (4, 7), (4, 9), (4, 8), (3, 10), (0, 7), (3, 8), (1, 10), (1, 8), (3, 9), (0, 10), (1, 7), (0, 9), (3, 7), (0, 8), (1, 9)]) print(simplestemmer.getaffixlist()) [{'prefix': 'بالم', 'root': 'درستين', 'stem': 'درستين', 'suffix': ''}, {'prefix': 'بالم', 'root': 'درس', 'stem': 'درس', 'suffix': 'تين'}, {'prefix': 'بالم', 'root': 'درستي', 'stem': 'درستي', 'suffix': 'ن'}, {'prefix': 'بالم', 'root': 'درست', 'stem': 'درست', 'suffix': 'ين'}, {'prefix': 'بال', 'root': 'مدرستين', 'stem': 'مدرستين', 'suffix': ''}, {'prefix': '', 'root': 'بالمدرس', 'stem': 'بالمدرس', 'suffix': 'تين'}, ...]

custom_stemmer.segment(word) set([(1, 10), (3, 10), (0, 10)])

print(customstemmer.getaffix_list()) [{'prefix': 'ب', 'root': 'المدرستين', 'stem': 'المدرستين', 'suffix': ''}, {'prefix': 'بال', 'root': 'مدرستين', 'stem': 'مدرستين', 'suffix': ''}, {'prefix': '', 'root': 'بالمدرستين', 'stem': 'بالمدرستين', 'suffix': ''}]

```

This command setprefixlist and *setsuffixlist" will rebuild the Finite state automaton to consider new affixes list.

Stemming a text

To stem all words in a text, we use tokenization preprocessing: ```

import pyarabic.araby as araby from tashaphyne.stemming import ArabicLightStemmer stemmer = ArabicLightStemmer() text = "الأطفال يستريحون في المكتبة للمطالعة" tokens = araby.tokenize(text) tokens ['الأطفال', 'يستريحون', 'في', 'المكتبة', 'للمطالعة'] for tok in tokens: ... stem = stemmer.light_stem(tok) ... print(tok, stem) ... الأطفال أطفال يستريحون يستريح في في المكتبة مكتب للمطالعة مطالع

```

Package Documentation

Files

file/directory category description
[docs] docs/ docs documentation
[support]
- pyarabic : basic arabic library
[test]
- output/ test test output
- samples/ test sample files
- tools/ test script to use tashaphyne

Featured Posts

If you would cite it in academic work, can you use this citation T. Zerrouki‏, Tashaphyne, Arabic light stemmer‏, https://pypi.python.org/pypi/Tashaphyne/0.2 or in bibtex format bibtex @misc{zerrouki2012tashaphyne, title={Tashaphyne, Arabic light stemmer}, author={Zerrouki, Taha}, url={https://pypi.python.org/pypi/Tashaphyne/0.2}, year={2012} }

Owner

Name: Taha Zerrouki (طه زروقي )
Login: linuxscout
Kind: user
Location: Bouira, Algeria
Company: Bouira University

Website: tahadz.com
Twitter: linuxscout
Repositories: 22
Profile: https://github.com/linuxscout

PhD, Computer Science Professor, Interest : Arabic Natural Language processing

JOSS Publication

Tashaphyne: A Python package for Arabic Light Stemming

Published

January 30, 2024

DOI

10.21105/joss.06063

Volume 9, Issue 93, Page 6063

Authors

Taha Zerrouki

Bouira University, Bouira, Algeria

Editor

Samuel Forbes

GitHub Events

Total

Watch event: 8
Fork event: 1

Last Year

Watch event: 8
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 61
Total Committers: 3
Avg Commits per committer: 20.333
Development Distribution Score (DDS): 0.033

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
linuxscout	t**i@h**m	59
Moutaz Al Khatib	m****z	1
et	b**7@a**u	1

Committer Domains (Top 20 + Academic)

aus.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 8
Total pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: about 15 hours
Total issue authors: 6
Total pull request authors: 3
Average comments per issue: 2.13
Average comments per pull request: 0.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

SamHames (3)
hamzam0n (1)
maryam1ou (1)
ahalghamdi (1)
robertmiles3 (1)
MagedSaeed (1)

Pull Request Authors

arfon (2)
ELHoussineT (1)
muotaz (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 2,434 last-month

Total dependent packages: 2
Total dependent repositories: 5
Total versions: 9
Total maintainers: 1

pypi.org: tashaphyne

Tashaphyne Arabic Light Stemmer

Homepage: http://github.com/linuxscout/tashaphyne/
Documentation: https://tashaphyne.readthedocs.io/
License: GPL
Latest release: 0.3.6
published about 4 years ago

Versions: 9
Dependent Packages: 2
Dependent Repositories: 5
Downloads: 2,434 Last month

Rankings

Dependent packages count: 4.7%

Dependent repos count: 6.7%

Downloads: 6.7%

Average: 6.9%

Stargazers count: 7.5%

Forks count: 8.9%

Maintainers (1)

linuxscout

Last synced: 6 months ago

Tashaphyne

Science Score: 95.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Tashaphyne

Citation

مزايا

Features

Applications

Installation

Usage

Functions الدوال

stemming word

extract stem

extract root

get prefix position index

get prefix

get prefix with a specific index

get suffix position index

get suffix

get suffix with a specific index

get affix

get affix tuple

star words

get star stem

get unvocalized word

Detect all possible segmentation

Get all segment

get affix list

Customized Affix list

simple stemmer with default affixes list

create a cعstomized stemmer object for stemming enclitics and procletics

configure the stemmer object

segment word as

Stemming a text

Package Documentation

Files

Featured Posts

Owner

JOSS Publication

Tashaphyne: A Python package for Arabic Light Stemming

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: tashaphyne

Rankings

Maintainers (1)

Dependencies