Embeddings.jl
Embeddings.jl: easy access to pretrained word embeddings from Julia - Published in JOSS (2019)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: springer.com -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords from Contributors
Repository
Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE)
Basic Info
Statistics
- Stars: 81
- Watchers: 7
- Forks: 19
- Open Issues: 11
- Releases: 7
Metadata Files
README.md
Embeddings
Introduction
Word Embeddings present words as high-dimensional vectors, where every dimension corresponds to some latent feature [1]. This makes it possible to utilize different mathematical operations between words. With these we can discover semantic relationships between words. E.g. when using Word2Vec embeddings and utilizing cosine similarity between vectors, the calculation vector(“Madrid”) - vector(“Spain”) + vector(“France”) gives as an answer the vector for word “Paris” [2].
Pretraining Word Embeddings are commonly uses to initialize the bottom layer of a more advanced NLP method, such as a LSTM [3].
Simply summing the embeddings in a sentence or phrase can in and of itself be a surprisingly powerful way to represent the sentence/phrase, and can be used as a input to simple ML models like SVM 4].
This package gives access to pretrained embeddings. At its current state it includes following word embeddings: Word2Vec (English), GloVe (English), and FastText (hundreds of languages).
Installation
The package can be installed using the julia package manager in the normal way.
Open the REPL, press ] to enter package mode, and then:
julia
pkg> add Embeddings
There are no further steps.
Pretrained embeddings will be downloaded the first time you use them.
Details
load_embeddings
load_embeddings(EmbeddingSystem, [embedding_file|default_file_number])
load_embeddings(EmbeddingSystem{:lang}, [embedding_file|default_file_number])
Loaded the embeddings from a embedding file. The embeddings should be of the type given by the Embedding system.
If the embedding file is not provided, a default embedding file will be used.
(It will be automatically installed if required).
EmbeddingSystems have a language type parameter.
For example FastText_Text{:fr} or Word2Vec{:en}, if that language parameter is not given it defaults to English.
(I am sorry for the poor state of the NLP field that many embedding formats are only available pretrained in English.)
Using this the correct default embedding file will be installed for that language.
For some languages and embedding systems there are multiple possible files.
You can check the list of them using for example language_files(FastText_Text{:de}).
The first is nominally the most popular, but if you want to default to another you can do so by setting the default_file_num.
This returns an EmbeddingTable object.
This has 2 fields.
embeddingsis a matrix, each column is the embedding for a word.vocabis a vector of strings, ordered as per the columns ofembeddings, such that the first string in vocab is the first column ofembeddingsetc
We do not include a method for getting the index of a column from a word.
This is trivial to define in code (vocab2ind(vocab)=Dict(word=>ii for (ii,word) in enumerate(vocab))),
and you might like to be doing this in a more consistant way, e.g using MLLabelUtils.jl,
or you might like to build a much faster Dict solution on top of InternedStrings.jl
Configuration
This package is build on top of DataDeps.jl. To configure, e.g., where downloaded files save to, and read from (and to understand how that works), see the DataDeps.jl readme.
Examples
Load the package with
julia> using Embeddings
Basic example
The Following script loads up the embeddings,
and defines a Dict to map from vocabulary word to index, in the embedding matrix,
and a function that used it to get an embedding vector.
This is a basic way to access the embedding for a word.
``` using Embeddings const embtable = loadembeddings(Word2Vec) # or loadembeddings(FastText_Text) or ...
const getwordindex = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
function getembedding(word) ind = getword_index[word] emb = embtable.embeddings[:,ind] return emb end ```
This can be used like so:
julia> get_embedding("blue")
300-element Array{Float32,1}:
0.01540828
0.03409082
0.0882124
0.04680265
-0.03409082
...
Loading different Embeddings
load up the default word2vec embeddings:
julia> load_embeddings(Word2Vec)
Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[0.0673199 0.0529562 … -0.21143 0.0136373; -0.0534466 0.0654598 … -0.0087888 -0.0742876; … ; -0.00733469 0.0108946 … -0.00405157 0.0156112; -0.00514565 -0.0470722 … -0.0341579 0.0396559], String["</s>", "in", "for", "that", "is", "on", "##", "The", "with", "said" … "#-###-PA-PARKS", "Lackmeyer", "PERVEZ", "KUNDI", "Budhadeb", "Nautsch", "Antuane", "tricorne", "VISIONPAD", "RAFFAELE"])
Load up the first 100 embeddings from the default French FastText embeddings:
julia> load_embeddings(FastText_Text{:fr}; max_vocab_size=100)
Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[0.0058 -0.0842 … -0.062 -0.0687; 0.0478 -0.0388 … 0.0064 -0.339; … ; 0.023 -0.0106 … -0.022 -0.1581; 0.0378 0.0579 … 0.0417 0.0714], String[",", "de", ".", "</s>", "la", "et", ":", "à", "le", "\"" … "faire", "c'", "aussi", ">", "leur", "%", "si", "entre", "qu", "€"])
List all the default files for FastText in English:
julia> language_files(FastText_Text{:en}) # List all the possible default files for FastText in English
3-element Array{String,1}:
"FastText Common Crawl/crawl-300d-2M.vec"
"FastText Wiki News/wiki-news-300d-1M.vec"
"FastText en Wiki Text/wiki.en.vec"
From the second of those default files, load the embeddings just for "red", "green", and "blue":
julia> load_embeddings(FastText_Text{:en}, 2; keep_words=Set(["red", "green", "blue"]))
Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[-0.0054 0.0404 -0.0293; 0.0406 0.0621 0.0224; … ; 0.218 0.1542 0.2256; 0.1315 0.1528 0.1051], String["red", "green", "blue"])
List all the default files for GloVe in English:
julia> language_files(GloVe{:en})
10-element Array{String,1}:
"glove.6B/glove.6B.50d.txt"
"glove.6B/glove.6B.100d.txt"
"glove.6B/glove.6B.200d.txt"
"glove.6B/glove.6B.300d.txt"
"glove.42B.300d/glove.42B.300d.txt"
"glove.840B.300d/glove.840B.300d.txt"
"glove.twitter.27B/glove.twitter.27B.25d.txt"
"glove.twitter.27B/glove.twitter.27B.50d.txt"
"glove.twitter.27B/glove.twitter.27B.100d.txt"
"glove.twitter.27B/glove.twitter.27B.200d.txt"
Load the 200d GloVe embedding matrix for the top 10000 words trained on 6B words: ``` julia> glove = loadembeddings(GloVe{:en}, 3, maxvocab_size=10000) Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[-0.071549 0.17651 … 0.19765 -0.22419; 0.093459 0.29208 … -0.31357 0.039311; … ; 0.030591 -0.23189 … -0.72917 0.49645; 0.25577 -0.10814 … 0.07403 0.41581], ["the", ",", ".", "of", "to", "and", "in", "a", "\"", "'s" … "slashed", "23-year", "communique", "hawk", "necessity", "petty", "stretching", "taxpayer", "resistant", "quinn"])
julia> size(glove) (200, 10000) ```
Contributing
Contributions, in the form of bug-reports, pull requests, additional documentation are encouraged. They can be made to the Github repository.
All contributions and communications should abide by the Julia Community Standards.
The following software contributions would particularly be appreciated:
- Provide Hashstrings: I have only filled in the checksums for the FastText Embeddings that I have downloaded, which is only a small fraction. If you using embeddings files for a language that doesn't have its hashstring set, then DataDeps.jl will tell you the hashstring that need to be added to the file. It is a quick and easy PR.
- Provide Implementations of other loaders: If you have implementations of code to load another format (e.g. Binary FastText) it would be great if you could contribute them. I know I have a few others kicking around somewhere.
Software contributions should follow the prevailing style within the code-base. If your pull request (or issues) are not getting responses within a few days do not hesitate to "bump" them, by posting a comment such as "Any update on the status of this?". Sometimes Github notifications get lost.
Support
Feel free to ask for help on the Julia Discourse forum,
or in the #natural-language channel on julia-slack. (Which you can join here).
You can also raise issues in this repository to request improvements to the documentation.
Sources
Owner
- Name: JuliaText
- Login: JuliaText
- Kind: organization
- Repositories: 12
- Profile: https://github.com/JuliaText
JuliaLang Organisation for Natual Language Processing, (textual) Information Retrieval, and Computational Linguistics
JOSS Publication
Embeddings.jl: easy access to pretrained word embeddings from Julia
Tags
julialang opendata NLP word embeddings machine learningGitHub Events
Total
- Watch event: 1
- Issue comment event: 1
Last Year
- Watch event: 1
- Issue comment event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Lyndon White | o****x@u****u | 52 |
| Tejas Vaidhya | 3****v | 13 |
| David Ellison | d****n@g****m | 7 |
| theogf | t****u@g****m | 3 |
| Ayushk4 | a****4@g****m | 3 |
| femtocleaner[bot] | f****] | 1 |
| Thiago Galery | t****y@g****m | 1 |
| Logan Kilpatrick | 2****3@g****m | 1 |
| Leevi Rantala | l****a@g****m | 1 |
| Julia TagBot | 5****t | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 23
- Total pull requests: 25
- Average time to close issues: 19 days
- Average time to close pull requests: 12 days
- Total issue authors: 14
- Total pull request authors: 11
- Average comments per issue: 3.13
- Average comments per pull request: 2.44
- Merged pull requests: 22
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- oxinabox (7)
- aviks (2)
- SeniorCtrlPlayer (2)
- robertfeldt (2)
- RasmusEdvardsen (1)
- Vetii (1)
- ssfrr (1)
- MariaHei (1)
- deno2 (1)
- Jakobhenningjensen (1)
- SebastianCallh (1)
- dimka11 (1)
- tgalery (1)
Pull Request Authors
- oxinabox (12)
- dellison (4)
- Ayushk4 (2)
- tejasvaidhyadev (1)
- femtocleaner[bot] (1)
- Lrantala (1)
- JuliaTagBot (1)
- theogf (1)
- zgornel (1)
- logankilpatrick (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- julia 9 total
-
Total dependent packages: 3
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 22
proxy.golang.org: github.com/juliatext/embeddings.jl
- Documentation: https://pkg.go.dev/github.com/juliatext/embeddings.jl#section-documentation
- License: mit
-
Latest release: v0.4.2
published about 6 years ago
Rankings
proxy.golang.org: github.com/JuliaText/Embeddings.jl
- Documentation: https://pkg.go.dev/github.com/JuliaText/Embeddings.jl#section-documentation
- License: mit
-
Latest release: v0.4.2
published about 6 years ago
Rankings
juliahub.com: Embeddings
Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE)
- Documentation: https://docs.juliahub.com/General/Embeddings/stable/
- License: MIT
-
Latest release: 0.4.6
published almost 2 years ago
Rankings
Dependencies
- AutoHashEquals *
- DataDeps 0.5.1
- GoogleDrive *
- julia 0.7
- JuliaRegistries/TagBot v1 composite
