dutch-word-embeddings

Dutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.

https://github.com/coosto/dutch-word-embeddings

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary

Keywords

coosto dutch nlp word2vec word2vec-model wordembeddings
Last synced: 6 months ago · JSON representation ·

Repository

Dutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.

Basic Info
  • Host: GitHub
  • Owner: coosto
  • License: other
  • Language: Python
  • Default Branch: master
  • Size: 13.7 KB
Statistics
  • Stars: 44
  • Watchers: 5
  • Forks: 3
  • Open Issues: 1
  • Releases: 1
Topics
coosto dutch nlp word2vec word2vec-model wordembeddings
Created over 7 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Citation

README.md

Dutch Word2Vec Model

This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora. Finding pre-trained Dutch models online can often be quite difficult, especially since most online models are trained on neatly written texts like Wikipedia or newspaper archives. When working with noisy text sources these models usually underperform due to the large number of out-of-vocabulary words used on social platforms and their short-message writing style. By training on a combination of both large and short texts from multiple online sources we've tried to create a model more suited for these types of texts.

We are sharing this model to help research using Dutch data sources, so feel free to use it for your projects! If you would like a more up-to-date model, or a model with specific preprocessing steps, we'd be happy to help! Please contact the current maintainer (@Alexander Nieuwenhuijse) if you'd like to use this model for commercial products.

Installation

The model can be downloaded using the provided utils script sh $ git clone https://github.com/coosto/dutch-word-embeddings.git $ cd dutch-word2vec-model $ python3 utils.py download Or directly as an asset from release page.

Usage

To run a demo (using gensim) for the downloaded model run following command. It will first output some example analogies and afterward present an interactive prompt to query for nearest neighbour terms. sh $ python3 utils.py demo --model model.bin Loading model... Model loaded ... `

Examples

If we query for "tomaat" (Dutch for tomato) we get a lot of Dutch vegetables:

Enter word or sentence: tomaat

Term | Distance

paprika |0.8452869653701782 (Bell pepper) komkommer |0.7932491898536682 (Cucumber) courgette |0.771128237247467 (Zucchini) spinazie |0.7697550058364868 (Spinach) aubergine |0.7646535634994507 (Eggplant) rucola |0.7631270885467529 (Arugula) avocado |0.7610437273979187 (Avocado) radijs |0.7554484605789185 (Radish) tomaatjes |0.7549760341644287 (Tomatoes) tomaten |0.7525067925453186 (Tomatoes)

Another interesting case is "lidl" (A supermarket chain), which returns a list of other supermarkets:

Enter word or sentence: lidl

Term | Distance

aldi |0.9053274989128113 jumbo |0.7790680527687073 albert_heijn |0.7582876086235046 supermarkt |0.7522222995758057

lidl |0.7320866584777832

alberthein |0.7161234617233276 albertheyn |0.7076783180236816 ekoplaza |0.6897568702697754 vomar |0.6830324530601501 nettorama |0.6739382147789001

This example shows some common typographical errors for a supermarket chain called Albert Heijn. This type of errors would not be in the model is it was trained only on neatly written text, like the Dutch Wikipedia data or newspaper articles, but is included in this model because these errors are made a lot on social media.


Data selection & Modeling

The model was trained using ~600 million individual messages, comprised of Dutch social media messages (624 million messages) and Dutch news, blog and fora posts (36 million messages). All messages were published between 01/01/2017 and 31/12/2017. To improve the quality of the model some basic preprocessing was applied to all the messages, described below.

Splitting sentences

Every individual message was split into separate sentences by searching for punctuation marks, which are only considered to be an end-of-line character if it is not used in the following exceptions: * (Roman) Numerals * Single letters * Abbriviations

Clean up

Given that social media messages are usually not neatly formatted some cleaning of the text is applied: 1. Converting to lowercase 2. Removing HTML/XML tags 3. Replacing URLs with the <url> token 4. Replacing @-mentions with the <mention> token 5. Removing punctuation marks, emojis and unwanted unicode characters 7. Removing sentences with less than 5 tokens.

The URLs and @-mentions are removed for both privacy and performance reasons. The last step is applied to remove very small text messages, since they usually do not provide enough relevant context to learn from.

Deduplication

Lastly the entire set of cleaned up sentences is removed from any duplicate training examples and shuffeled into a random order. This results in a training set of 490 million unique preprocessed sentences.

Training

The model was trained using Google's Word2Vec implementation. We've selected the Continuous Bag-of-Words (CBOW) model and generate vectors of size 300. The min-count parameter was chosen based on manual inspection of the vocabulary and to limit the size of the model. sh word2vec -train input.txt -output model.bin -size 300 -window 10 -negative 10 -hs 0 -cbow 1 -sample 1e-5 -iter 5 -min-count 300 The resulting model contains 250479 vectors and was not pruned or altered in any way.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .

Please contact the current maintainer (@Alexander Nieuwenhuijse) if you wish to use this model with a different license.

Owner

  • Name: Coosto
  • Login: coosto
  • Kind: organization
  • Location: Eindhoven

Citation (CITATION.cff)

cff-version: 1.2.0
title: "Coosto - Dutch Word Embeddings"
message: "If you use this dataset, please cite it using the metadata from this file."
type: dataset
date-released: 2018-05-05
url: "https://github.com/coosto/dutch-word-embeddings"
authors:
  - given-names: Alexander
    family-names: Nieuwenhuijse
    email: alexander.nieuwenhuijse@coosto.com
    affiliation: Coosto

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1