https://github.com/adbar/py3langid

Faster, modernized fork of the language identification tool langid.py

Keywords

detect-language langid language-detection language-identification language-recognition nlp whatlang

Keywords from Contributors

transformation datetime lemmatization tokenization

Last synced: 6 months ago · JSON representation

Repository

Faster, modernized fork of the language identification tool langid.py

Basic Info

Host: GitHub
Owner: adbar
License: other
Language: Python
Default Branch: master
Homepage: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
Size: 12.3 MB

Statistics

Stars: 56
Watchers: 2
Forks: 9
Open Issues: 2
Releases: 0

Fork of saffsd/langid.py

Topics

detect-language langid language-detection language-identification language-recognition nlp whatlang

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

README.rst

=============
``py3langid``
=============


``py3langid`` is a fork of the standalone language identification tool ``langid.py`` by Marco Lui.

Original license: BSD-2-Clause. Fork license: BSD-3-Clause.



Changes in this fork
--------------------

Execution speed has been improved and the code base has been optimized for Python 3.6+:

- Import: Loading the package (``import py3langid``) is about 30% faster
- Startup: Loading the default classification model is 25-30x faster
- Execution: Language detection with ``langid.classify`` is 5-6x faster on paragraphs (less on longer texts)

For implementation details see this blog post: `How to make language detection with langid.py faster `_.

For more information and older Python versions see `changelog `_.


Usage
-----

Drop-in replacement
~~~~~~~~~~~~~~~~~~~


1. Install the package:

   * ``pip3 install py3langid`` (or ``pip`` where applicable)

2. Use it:

   * with Python: ``import py3langid as langid``
   * on the command-line: ``langid``


With Python
~~~~~~~~~~~

Basics:

.. code-block:: python

    >>> import py3langid as langid
    
    >>> text = 'This text is in English.'
    # identified language and probability
    >>> langid.classify(text)
    ('en', -56.77429)
    # unpack the result tuple in variables
    >>> lang, prob = langid.classify(text)
    # all potential languages
    >>> langid.rank(text)


More options:

.. code-block:: python

    >>> from py3langid.langid import LanguageIdentifier, MODEL_FILE

    # subset of target languages
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
    >>> identifier.set_languages(['de', 'en', 'fr'])
    # this won't work well...
    >>> identifier.classify('这样不好')
    ('en', -81.831665)

    # normalization of probabilities to an interval between 0 and 1
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
    >>> identifier.classify('This should be enough text.')
    ('en', 1.0)


Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:

.. code-block:: python

    >>> langid.classify(text, datatype='uint32')


On the command-line
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # basic usage with probability normalization
    $ echo "This should be enough text." | langid -n
    ('en', 1.0)

    # define a subset of target languages
    $ echo "This won't be recognized properly." | langid -n -l fr,it,tr
    ('it', 0.97038305)


Legacy documentation
--------------------


**The docs below are provided for reference, only part of the functions are currently tested and maintained.**


Introduction
------------

``langid.py`` is a standalone Language Identification (LangID) tool.

The design principles are as follows:

1. Fast
2. Pre-trained over a large number of languages (currently 97)
3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
4. Single .py file with minimal dependencies
5. Deployable as a web service

All that is required to run ``langid.py`` is Python >= 3.6 and numpy. 

The accompanying training tools are still Python2-only.

``langid.py`` is WSGI-compliant.  ``langid.py`` will use ``fapws3`` as a web server if 
available, and default to ``wsgiref.simple_server`` otherwise.

``langid.py`` comes pre-trained on 97 languages (ISO 639-1 codes given):

    af, am, an, ar, as, az, be, bg, bn, br, 
    bs, ca, cs, cy, da, de, dz, el, en, eo, 
    es, et, eu, fa, fi, fo, fr, ga, gl, gu, 
    he, hi, hr, ht, hu, hy, id, is, it, ja, 
    jv, ka, kk, km, kn, ko, ku, ky, la, lb, 
    lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, 
    nb, ne, nl, nn, no, oc, or, pa, pl, ps, 
    pt, qu, ro, ru, rw, se, si, sk, sl, sq, 
    sr, sv, sw, ta, te, th, tl, tr, ug, uk, 
    ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

* JRC-Acquis 
* ClueWeb 09
* Wikipedia
* Reuters RCV2
* Debian i18n


Usage
-----

    langid [options]

optional arguments:
  -h, --help            show this help message and exit
  -s, --serve           launch web service
  --host=HOST           host/ip to bind to
  --port=PORT           port to listen on
  -v                    increase verbosity (repeat for greater effect)
  -m MODEL              load model from file
  -l LANGS, --langs=LANGS
                        comma-separated set of target ISO639 language codes
                        (e.g en,de)
  -r, --remote          auto-detect IP address for remote access
  -b, --batch           specify a list of files on the command line
  -d, --dist            show full distribution over languages
  -u URL, --url=URL     langid of URL
  --line                process pipes line-by-line rather than as a document
  -n, --normalize       normalize confidence scores to probability values


The simplest way to use ``langid.py`` is as a command-line tool, and you can 
invoke using ``python langid.py``. If you installed ``langid.py`` as a Python 
module (e.g. via ``pip install langid``), you can invoke ``langid`` instead of 
``python langid.py -n`` (the two are equivalent).  This will cause a prompt to 
display. Enter text to identify, and hit enter::

  >>> This is a test
  ('en', -54.41310358047485)
  >>> Questa e una prova
  ('it', -35.41771221160889)


``langid.py`` can also detect when the input is redirected (only tested under Linux), and in this
case will process until EOF rather than until newline like in interactive mode::

  python langid.py < README.rst 
  ('en', -22552.496054649353)


The value returned is the unnormalized probability estimate for the language. Calculating 
the exact probability estimate is disabled by default, but can be enabled through a flag::

  python langid.py -n < README.rst 
  ('en', 1.0)

More details are provided in this README in the section on `Probability Normalization`.

You can also use ``langid.py`` as a Python library::

  # python
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("This is a test")
  ('en', -54.41310358047485)
  
Finally, ``langid.py`` can use Python's built-in ``wsgiref.simple_server`` (or ``fapws3`` if available) to
provide language identification as a web service. To do this, launch ``python langid.py -s``, and
access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed
with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example::

  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service::

  # curl -d "q=This is a test" localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT::

  # curl -T readme.rst localhost:9008/detect
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
  {"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, ``langid.py`` will interpret the entire
file as a single query. This allows for redirection via curl::

  # echo "This is a test" | curl -d @- localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

``langid.py`` will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even 
though the machine has a different external IP address. ``langid.py`` can attempt to automatically discover the external
IP address. To enable this functionality, start ``langid.py`` with the ``-r`` flag.

``langid.py`` supports constraining of the output language set using the ``-l`` flag and a comma-separated list of ISO639-1 
language codes (the ``-n`` flag enables probability normalization)::

  # python langid.py -n -l it,fr
  >>> Io non parlo italiano
  ('it', 0.99999999988965627)
  >>> Je ne parle pas français
  ('fr', 1.0)
  >>> I don't speak english
  ('it', 0.92210605672341062)

When using ``langid.py`` as a library, the set_languages method can be used to constrain the language set::

  python                      
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("I do not speak english")
  ('en', 0.57133487679900674)
  >>> langid.set_languages(['de','fr','it'])
  >>> langid.classify("I do not speak english")
  ('it', 0.99999835791478453)
  >>> langid.set_languages(['en','it'])
  >>> langid.classify("I do not speak english")
  ('en', 0.99176190378750373)


Batch Mode
----------

``langid.py`` supports batch mode processing, which can be invoked with the ``-b`` flag.
In this mode, ``langid.py`` reads a list of paths to files to classify as arguments.
If no arguments are supplied, ``langid.py`` reads the list of paths from ``stdin``,
this is useful for using ``langid.py`` with UNIX utilities such as ``find``.

In batch mode, ``langid.py`` uses ``multiprocessing`` to invoke multiple instances of
the classifier, utilizing all available CPUs to classify documents in parallel. 


Probability Normalization
-------------------------

The probabilistic model implemented by ``langid.py`` involves the multiplication of a
large number of probabilities. For computational reasons, the actual calculations are
implemented in the log-probability space (a common numerical technique for dealing with
vanishingly small probabilities). One side-effect of this is that it is not necessary to
compute a full probability in order to determine the most probable language in a set
of candidate languages. However, users sometimes find it helpful to have a "confidence"
score for the probability prediction. Thus, ``langid.py`` implements a re-normalization
that produces an output in the 0-1 range.

``langid.py`` disables probability normalization by default. For
command-line usages of ``langid.py``, it can be enabled by passing the ``-n`` flag. For
probability normalization in library use, the user must instantiate their own 
``LanguageIdentifier``. An example of such usage is as follows::
  
  >> from py3langid.langid import LanguageIdentifier, MODEL_FILE
  >> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
  >> identifier.classify("This is a test")
  ('en', 0.9999999909903544)


Training a model
----------------

So far Python 2.7 only, see the `original instructions `_.


Read more
---------

``langid.py`` is based on published research. [1] describes the LD feature selection technique in detail,
and [2] provides more detail about the module ``langid.py`` itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, 
In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), 
Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, 
In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 
Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

Owner

Name: Adrien Barbaresi
Login: adbar
Kind: user
Location: Berlin
Company: Berlin-Brg. Academy of Sciences (BBAW)

Website: adrien.barbaresi.eu
Twitter: adbarbaresi
Repositories: 37
Profile: https://github.com/adbar

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub Events

Total

Issues event: 3
Watch event: 9
Delete event: 1
Issue comment event: 2
Push event: 2
Pull request event: 2
Fork event: 1
Create event: 1

Last Year

Issues event: 3
Watch event: 9
Delete event: 1
Issue comment event: 2
Push event: 2
Pull request event: 2
Fork event: 1
Create event: 1

Committers

Last synced: about 2 years ago

All Time

Total Commits: 261
Total Committers: 11
Avg Commits per committer: 23.727
Development Distribution Score (DDS): 0.192

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Marco Lui	s**d@g**m	211
Adrien Barbaresi	b**i@b**e	35
Aitzol Naberan	a**n@c**m	6
Joel Nothman	j**n@g**m	2
Marco Lui	m**i@n**u	1
Bartek Ćwikłowski	p**a@g**m	1
Sourcery AI		1
Martin Thurau	m**u@g**m	1
tripleee	e**d@i**i	1
Quentin Pradet	q**n@c**m	1
Marco Lui	m**i@m**)	1

Committer Domains (Top 20 + Academic)

mlui-1005pe.(none): 1 clustree.com: 1 iki.fi: 1 not.csse.unimelb.edu.au: 1 codesyntax.com: 1 bbaw.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 13
Average time to close issues: about 2 months
Average time to close pull requests: 14 days
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 1.25
Average comments per pull request: 0.46
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 6

Past Year

Issues: 3
Pull requests: 2
Average time to close issues: 3 months
Average time to close pull requests: 26 days
Issue authors: 3
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Aprilistic (1)
debuggio (1)
aleksandra-miletic (1)
bestlee666 (1)

Pull Request Authors

adbar (12)
sourcery-ai[bot] (6)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 174,258 last-month
Total docker downloads: 67

Total dependent packages: 10
Total dependent repositories: 10
Total versions: 7
Total maintainers: 1

pypi.org: py3langid

Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times.

Homepage: https://github.com/adbar/py3langid
Documentation: https://py3langid.readthedocs.io/
License: BSD
Latest release: 0.3.0
published over 1 year ago

Versions: 7
Dependent Packages: 10
Dependent Repositories: 10
Downloads: 174,258 Last month
Docker Downloads: 67

Rankings

Downloads: 1.6%

Dependent packages count: 1.8%

Docker downloads count: 3.0%

Dependent repos count: 4.6%

Average: 6.2%

Stargazers count: 13.1%

Forks count: 13.3%

Maintainers (1)

adbar

Last synced: 6 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/adbar/py3langid

Science Score: 10.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: py3langid

Rankings

Maintainers (1)