https://github.com/argilla-io/get_started_with_deep_learning_for_text_with_allennlp

Getting started with AllenNLP and PyTorch by training a tweet classifier

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

artificial-intelligence natural-language-processing neural-networks pytorch pytorch-tutorial

Last synced: 5 months ago · JSON representation

Repository

Getting started with AllenNLP and PyTorch by training a tweet classifier

Basic Info

Host: GitHub
Owner: argilla-io
Language: Python
Default Branch: master
Homepage:
Size: 193 KB

Statistics

Stars: 66
Watchers: 5
Forks: 17
Open Issues: 0
Releases: 0

Topics

artificial-intelligence natural-language-processing neural-networks pytorch pytorch-tutorial

Created over 8 years ago · Last pushed over 8 years ago

Metadata Files

Readme

Introduction

This repository contains code and experiments using PyTorch, AllenNLP and spaCy and is intended as a learning resource for getting started with this libraries and with deep learning for NLP technologies.

In particular, it contains:

Custom modules for defining a SequenceClassifier and its Predictor.
A basic custom DataReader for reading CSV files.
An experiments folder containing several experiment JSON files to show how to define a baseline and refine it with more sophisticated approaches.

The overall goal is to classify tweets in Spanish corresponding to the COSET challenge dataset: a collection of tweets for a recent Spanish Election. The winning approach of the challenge is described in the following paper: http://ceur-ws.org/Vol-1881/COSETpaper7.pdf.

Setup

Use a virtual environment, Conda for example:

shell conda create -n allennlp_spacy

shell source activate allennlp_spacy

Install PyTorch for your platform: shell pip install http://download.pytorch.org/whl/torch-0.2.0.post3-cp36-cp36m-macosx_10_7_x86_64.whl

Install spaCy Spanish model: shell python -m spacy download es

Install AllenNLP and other dependencies: shell pip install -r requirements.txt

Install custom module for running AllenNLP commands with custom models: shell python setup.py develop

Install Tensorboard: shell pip install tensorboard

Download pre-trained and prepare word vectors from fastText project: shell download_prepare_fasttext.sh

Goals

Understand the basic components of AllenNLP and PyTorch.
Understand how to configure AllenNLP to use spaCy models in different languages, in this case Spanish model.
Understand how to create and connect custom models using AllenNLP and extending its command-line.
Design and compare several experiments on a simple Tweet classification tasks in Spanish. Start by defining a simple baseline and progressively use more complex models.
Use Tensorboard for monitoring the experiments.
Compare your results with existing literature (i.e., results of the COSET Tweet classification challenge)
Learn how to prepare and use external pre-trained word embeddings, in this case fastText's wikipedia-based word vectors.

Exercises

Inspecting Seq2VecEncoders and understanding the basic building blocks of AllenNLP:

Check the basic structure of these modules in AllenNLP.

Defining and running our baseline:

In the folder experiments/definitions/ you can find the definition of our baseline, using a BagOfEmbeddingsEncoder.

Run the experiment using: shell python -m recognai.run train experiments/definitions/baseline_boe_classifier.json -s experiments/output/baseline

Monitor your experiments using Tensorboard:

You can monitor your experiments by running TensorBoard and pointing it to the experiments output folder:

shell tensorboard --logdir=experiments/output

Defining and running a CNN classifier:

In the folder experiments/definitions/ you can find the definition of a CNN classifier. As you see, we only need to configure a new encoder using a CNN.

Run the experiment using:

shell python -m recognai.run train experiments/definitions/cnn_classifier.json -s experiments/output/cnn

Using pre-trained word embeddings:

Facebook fastText's team has made available pre-trained word embeddings for 294 languages (see https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md). Using the download_prepare_fasttext.sh script, you can download the Spanish vectors and use them as pre-trained weights in either of the models.

To use pre-trained embeddings, you can run the experiment using: shell python -m recognai.run train experiments/definitions/cnn_classifier_fasttext_embeddings_fixed.json -s experiments/output/cnn_embeddings_fixed

Or use pre-trained embeddings and let the network tune their weights, using: shell python -m recognai.run train experiments/definitions/cnn_classifier_fasttext_embeddings_tunable.json -s experiments/output/cnn_embeddings_tuned

Extra:

Check https://github.com/recognai/custommodelsallennlp/tree/master/experiments/tweet-classification-spanish and run an RNN classifier. How are the results? Tip: Initialization is key when training LSTMs.
The network quickly overfits, what strategies would you follow?

Owner

Name: Argilla
Login: argilla-io
Kind: organization
Email: contact@argilla.io

Website: https://argilla.io
Twitter: argilla_io
Repositories: 12
Profile: https://github.com/argilla-io

Building the open-source tool for data-centric NLP

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/argilla-io/get_started_with_deep_learning_for_text_with_allennlp

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Introduction

Setup

Goals

Exercises

Inspecting Seq2VecEncoders and understanding the basic building blocks of AllenNLP:

Defining and running our baseline:

Monitor your experiments using Tensorboard:

Defining and running a CNN classifier:

Using pre-trained word embeddings:

Extra:

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies