persian-summarization

Statistical and Semantical Text Summarizer in Persian Language

https://github.com/minasmz/persian-summarization

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary

Keywords

doc2vec-model gensim nlp persian-language persian-nlp text-summarization textrank-algorithm
Last synced: 6 months ago · JSON representation ·

Repository

Statistical and Semantical Text Summarizer in Persian Language

Basic Info
  • Host: GitHub
  • Owner: minasmz
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 14.6 MB
Statistics
  • Stars: 59
  • Watchers: 2
  • Forks: 14
  • Open Issues: 1
  • Releases: 3
Topics
doc2vec-model gensim nlp persian-language persian-nlp text-summarization textrank-algorithm
Created about 8 years ago · Last pushed about 3 years ago
Metadata Files
Readme Citation

README.md

For more details you can refer to paper in the following link If you find this repository helpful, please cite the paper

Persian-Summarization

Statistical and semantical text summarizer in Persian language

It’s a project for text summarization in Persian language. It uses text summarization of Gensim python library for implementing TextRank algorithm. This algorithm assumes each sentence a node in a graph and returns nodes with highest relation with other nodes (sentences). In other words it returns most important nodes with some statistical calculation and does not include any semantics of the sentences. For instance if you use different words for the same meaning it won’t recognize and assumes they are different which in reality they are not. For solving this problem and including semantic in the result I trained a doc2vec model by doc2vec.py in Genism with Hamshahri corpus as training set. The doc2vec model is included in the repository (mymodelsentsfromres2.doc2vec). I used this model for calculating similarity of two sentences for weighting the graph edges. (instead of weighting based on some tf-idf algorithm which is used in Gensim) and return the result by TextRank algorithm.

Some modification is made on Gensim library for making it compatible with Persian language, I used Hazm library for text normalizing, sentence tokenizing and POS tagging.

Python pagages versions you need to install on your device

pip install six == 1.11.0

pip install gensim == 3.1.0

pip install numpy == 1.11.3

pip install scipy == 1.0.0

pip install hazm==0.5.2

How to start

copy summarization file and replace it with the one in Gensim library. In play.py you can see an example of text summarization with the command below:

summarize(text, ratio, word_count)

ratio is 0.2 and wordcount is None by default. ratio returns the fraction of the input text you want to summarize and wordcount specify minimum number of words you want in the result summarization.

You can train your own doc2vec model and load that in your project instead of the file included in project also POS tagger model in resource folder as well. The stopwords in STOPWORD file is obtained from persian-stopwords

Thanks

I developed this project at Irsapardaz Pasargad. Thanks to Mr. Amin Mozhgani for his selfless helps during this project.

Contact

mina.smz2016@gmail.com

Owner

  • Name: Mina Samizadeh
  • Login: minasmz
  • Kind: user
  • Location: United States

Ph.D. Computer Science Student at University of Delaware

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Samizadeh"
  given-names: "Mina"
  orcid: "https://orcid.org/0000-0002-1082-966X"
title: "Persian-Summarization"
version: 1.0.0
doi: 10.5281/zenodo.6862128
date-released: 2017-11-28
url: "https://github.com/minasmz/Persian-Summarization"

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4