tlaf

A comprehensive tool for linguistic analysis of communities

https://github.com/tusharsarkar3/tla

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

hacktoberfest machine-learning nlp pytorch sentiment-analysis text-classification

Last synced: 9 months ago · JSON representation

Repository

A comprehensive tool for linguistic analysis of communities

Basic Info

Host: GitHub
Owner: tusharsarkar3
License: mit
Language: Python
Default Branch: master
Homepage: https://pypi.org/project/TLAF/
Size: 6.56 MB

Statistics

Stars: 49
Watchers: 1
Forks: 9
Open Issues: 1
Releases: 0

Topics

hacktoberfest machine-learning nlp pytorch sentiment-analysis text-classification

Created almost 5 years ago · Last pushed over 4 years ago

Metadata Files

Readme Contributing License

TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

TLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning techniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages. The analysis provided by TLA will also go a long way in understanding the sentiments of different linguistic communities and come up with new and innovative solutions for their problems based on the analysis. List of languages our library provides support for are listed as follows:

| Language | Code | Language | Code | | ---------------- | ---------------- | ---------------- | ---------------- | | English | en | Hindi | hi | | Swedish | sv | Thai | th | | Dutch | nl | Japanese | ja | | Turkish | tr | Urdu | ur | | Indonesian | id |Portuguese | pt | | French | fr | Chinese | zn-ch | | Spanish | es | Persian | fa | | Romainain | ro | Russian | ru |

Features

Provides 16 labeled Datasets for different languages for analysis.
Implements Bert based architecture to identify languages.
Provides Functionalities to Extract,process and label tweets from twitter.
Provides a Random Forest classifier to implement sentiment analysis on any string.

Installation :

``` pip install --upgrade https://github.com/tusharsarkar3/TLA.git

```

Overview

Extract data

``` from TLA.Data.get_data import store_data store_data('en',False) ``` This will extract and store the unlabeled data in a new directory inside data named datasets.

Label data

``` from TLA.Datasets.get_lang_data import language_data df = language_data('en') print(df) ``` This will print the labeled data that we have already collected.

Classify languages

Training

Training can be done in the following way: ``` from TLA.Lang_Classify.train import train_lang train_lang(path_to_dataset,epochs) ```

Prediction

Inference is done in the following way: ``` from TLA.Lang_Classify.predict import predict model = get_model(path_to_weights) preds = predict(dataframe_to_be_used,model) ```

Analyse

Training

Training can be done in the following way: ``` from TLA.Analyse.train_rf import train_rf train_rf(path_to_dataset) ``` This will store all the vectorizers and models in a seperate directory named saved_rf and saved_vec and they are present inside Analysis directory. Further instructions for training multiple languages is given in the next section which shows how to run the commands using CLI

Final Analysis

Analysis is done in the following way: ``` from TLA.Analysis.analyse import analyse_data analyse_data(path_to_weights) ``` This will store the final analysis as .csv inside a new directory named analysis.

Overview with Git

Installation another method

``` git clone https://github.com/tusharsarkar3/TLA.git ```

Extract data

Navigate to the required directory ``` cd Data ``` Run the following command: ``` python get_data.py --lang en --process True ``` Lang flag is used to input the language of the dataset that is required and process flag shows where pre-processing should be done before returning the data. Give the following codes in the lang flag wrt the required language:

Loading Dataset

To load a dataset run the following command in python. ``` df= pd.read_csv("TLA/TLA/Datasets/get_data_en.csv") ``` The command will return a dataframe consisting of the data for the specific language requested. In the phrase get_data_en, en can be sunstituted by the desired language code to load the dataframe for the specific language.

Pre-Processing

To preprocess a given string run the following command. In your terminal use code ``` cd Data ``` then run the command in python ``` from TLA.Data import Pre_Process_Tweets df=Pre_Process_Tweets.pre_process_tweet(df) ``` Here the function pre_process_tweet takes an input as a dataframe of tweets and returns an output of a dataframe with the list of preprocessed words for a particular tweet next to the tweet in the dataframe.

Analysis

Training

To train a random forest classifier for the purpose of sentiment analysis run the following command in your terminal. ``` cd Analysis ``` then ``` python train.rf --path "path to your datafile" --train_all_datasets False ``` here the --path flag represents the path to the required dataset you want to train the Random Forest Classifier on the --train_all_datasets flag is a boolean which can be used to train the model on multiple datasets at once. The output is a file with the a .pkl file extention saved in the folder at location "TLA\Analysis\saved_rf\{}.pkl" The output for vectorization of is stored in a .pkl file in the directory "TLA\Analysis\saved_vec\{}.pkl"

Get Sentiment

To get the sentiment of any string use the following code. In your terminal type ``` cd Analysis ``` then in your terminal type ``` python get_sentiment.py --prediction "Your string for prediction to be made upon" --lang "en" ``` here the --prediction flag collects the string for which you want to get the sentiment for. the --lang represents the language code representing the language you typed your string in. The output is a sentiment which is either positive or negative depending on your string.

Statistics

To get a comprehensive statistic on sentiment of datasets run the following command. In your terminal type ``` cd Analysis ``` then ``` python analyse.py ``` This will give you an output of a table1.csv file at the location 'TLA\Analysis\analysis\table1.csv' comprising of statistics relating to the percentage of positive or negative tweets for a given language dataset. It will also give a table2.csv file at 'TLA\Analysis\analysis\table2.csv' comprising of statistics for all languages combined.

Language Classification

Training

To train a model for language classfication on a given dataset run the following commands. In your terminal run ``` cd Lang_Classify ``` then run ``` python train.py --data "path for your dataset" --model "path to weights if pretrained" --epochs 4 ``` The --data flag requires the path to your training dataset. The --model flag requires the path to the model you want to implement The --epoch flag represents the epochs you want to train your model for. The output is a file with a .pt extention named saved_wieghts_full.pt where your trained wieghst are stored.

Prediction

To make prediction on any given string Us ethe following code. In your terminal type ``` cd Lang_Classify ``` then run the code ``` python predict.py --predict "Text/DataFrame for language to predicted" --weights " Path for the stored weights of your model " ``` The --predict flag requires the string you want to get the language for. The --wieghts flag is the path for the stored wieghts you want to run your model on to make predictions. The outputs is the language your string was typed in.

Results:

Performance of TLA ( Loss vs epochs)

|Language | Total tweets | Positive Tweets Percentage | Negative Tweets Percentage | | ---------------- | ---------------- | ---------------- | ---------------- | |English | 500 | 66.8 | 33.2 | |Spanish | 500 | 61.4 | 38.6 | |Persian | 50 | 52 | 48 |
|French | 500 | 53 | 47 | |Hindi | 500 | 62 | 38 | |Indonesian | 500 | 63.4 | 36.6| |Japanese | 500 | 85.6 | 14.4 |
|Dutch | 500 | 84.2 | 15.8 | |Portuguese| 500 | 61.2 | 38.8| |Romainain| 457 | 85.55 | 14.44| |Russian| 213 | 62.91 | 37.08 | |Swedish| 420 | 80.23 | 19.76 | |Thai| 424 | 71.46 | 28.53 | |Turkish| 500 | 67.8 | 32.2 | |Urdu| 42 | 69.04 | 30.95 | |Chinese| 500 | 80.6 | 19.4 |

Reference:

@misc{sarkar2021tla, title={TLA: Twitter Linguistic Analysis}, author={Tushar Sarkar and Nishant Rajadhyaksha}, year={2021}, eprint={2107.09710}, archivePrefix={arXiv}, primaryClass={cs.CL} }

@misc{640cba8b-35cb-475e-ab04-62d079b74d13, title = {TLA: Twitter Linguistic Analysis}, author = {Tushar Sarkar and Nishant Rajadhyaksha}, journal = {Software Impacts}, doi = {10.24433/CO.6464530.v1}, howpublished = {\url{https://www.codeocean.com/}}, year = 2021, month = {6}, version = {v1} }

#### Features to be added : - Access to more language - Creating GUI based system for better accesibility - Improving performance of the baseline model

Developed by Tushar Sarkar and Nishant Rajadhyaksha

Owner

Name: Tushar Sarkar

Login: tusharsarkar3

Kind: user

Repositories: 3

Profile: https://github.com/tusharsarkar3

I love solving problems with data

GitHub Events

Total

Last Year

Committers

Last synced: about 3 years ago

All Time

Total Commits: 41

Total Committers: 3

Avg Commits per committer: 13.667

Development Distribution Score (DDS): 0.293

Top Committers

Name Email Commits

Nishant Rajadhyaksha 7****1@u****m 29

tusharsarkar3 t****r@s****u 10

Tushar Sarkar 5****3@u****m 2

Committer Domains (Top 20 + Academic)

somaiya.edu: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1

Total pull requests: 0

Average time to close issues: N/A

Average time to close pull requests: N/A

Total issue authors: 1

Total pull request authors: 0

Average comments per issue: 0.0

Average comments per pull request: 0

Merged pull requests: 0

Bot issues: 0

Bot pull requests: 0

Past Year

Issues: 0

Pull requests: 0

Average time to close issues: N/A

Average time to close pull requests: N/A

Issue authors: 0

Pull request authors: 0

Average comments per issue: 0

Average comments per pull request: 0

Merged pull requests: 0

Bot issues: 0

Bot pull requests: 0

View more stats

Top Authors

Issue Authors

bruel-gabrielsson (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1

Total downloads:

pypi 10 last-month

Total dependent packages: 0

Total dependent repositories: 1

Total versions: 5

Total maintainers: 1

pypi.org: tlaf

TLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning techniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages.

Homepage: https://github.com/tusharsarkar3/TLA

Documentation: https://tlaf.readthedocs.io/

License: MIT

Latest release: 0.1.2
published almost 5 years ago

Versions: 5

Dependent Packages: 0

Dependent Repositories: 1

Downloads: 10 Last month

Rankings

Stargazers count: 9.6%

Dependent packages count: 10.0%

Forks count: 11.4%

Average: 18.8%

Dependent repos count: 21.7%

Downloads: 41.3%

Maintainers (1)

tushar3

Last synced: 9 months ago

Dependencies

TLAF.egg-info/requires.txt pypi

Pillow ==8.3.1

PySocks ==1.7.1

PyYAML ==5.4.1

beautifulsoup4 ==4.9.3

certifi ==2021.5.30

charset-normalizer ==2.0.3

click ==8.0.1

colorama ==0.4.4

cycler ==0.10.0

filelock ==3.0.12

huggingface-hub ==0.0.12

idna ==3.2

joblib ==1.0.1

kiwisolver ==1.3.1

lxml ==4.6.3

matplotlib ==3.4.2

nltk ==3.6.2

numpy ==1.21.1

packaging ==21.0

pandas ==1.3.0

pyparsing ==2.4.7

python-dateutil ==2.8.2

pytz ==2021.1

regex ==2021.7.6

requests ==2.26.0

sacremoses ==0.0.45

scikit-learn ==0.24.2

scipy ==1.7.0

six ==1.16.0

sklearn ==0.0

snscrape ==0.3.4

soupsieve ==2.2.1

threadpoolctl ==2.2.0

tokenizers ==0.10.3

torch ==1.9.0

tqdm ==4.61.2

transformers ==4.8.2

typing-extensions ==3.10.0.0

urllib3 ==1.26.6

requirements.txt pypi

Pillow ==8.3.1

PySocks ==1.7.1

PyYAML ==5.4.1

beautifulsoup4 ==4.9.3

certifi ==2021.5.30

charset-normalizer ==2.0.3

click ==8.0.1

colorama ==0.4.4

cycler ==0.10.0

filelock ==3.0.12

huggingface-hub ==0.0.12

idna ==3.2

joblib ==1.0.1

kiwisolver ==1.3.1

lxml ==4.6.3

matplotlib ==3.4.2

nltk ==3.6.2

numpy ==1.21.1

packaging ==21.0

pandas ==1.3.0

pyparsing ==2.4.7

python-dateutil ==2.8.2

pytz ==2021.1

regex ==2021.7.6

requests ==2.26.0

sacremoses ==0.0.45

scikit-learn ==0.24.2

scipy ==1.7.0

six ==1.16.0

sklearn ==0.0

snscrape ==0.3.4

soupsieve ==2.2.1

threadpoolctl ==2.2.0

tokenizers ==0.10.3

torch ==1.9.0

tqdm ==4.61.2

transformers ==4.8.2

typing-extensions ==3.10.0.0

urllib3 ==1.26.6

Name	Email	Commits
Nishant Rajadhyaksha	7**1@u**m	29
tusharsarkar3	t**r@s**u	10
Tushar Sarkar	5**3@u**m	2

tlaf

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

Features

Installation :

```

Overview

Overview with Git

Results:

Reference:

Developed by Tushar Sarkar and Nishant Rajadhyaksha

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: tlaf

Rankings

Maintainers (1)

Dependencies