https://github.com/adbar/py3langid

Faster, modernized fork of the language identification tool langid.py

https://github.com/adbar/py3langid

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 11 committers (9.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

detect-language langid language-detection language-identification language-recognition nlp whatlang

Keywords from Contributors

transformation datetime lemmatization tokenization
Last synced: 6 months ago · JSON representation

Repository

Faster, modernized fork of the language identification tool langid.py

Basic Info
Statistics
  • Stars: 56
  • Watchers: 2
  • Forks: 9
  • Open Issues: 2
  • Releases: 0
Fork of saffsd/langid.py
Topics
detect-language langid language-detection language-identification language-recognition nlp whatlang
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.rst

=============
``py3langid``
=============


``py3langid`` is a fork of the standalone language identification tool ``langid.py`` by Marco Lui.

Original license: BSD-2-Clause. Fork license: BSD-3-Clause.



Changes in this fork
--------------------

Execution speed has been improved and the code base has been optimized for Python 3.6+:

- Import: Loading the package (``import py3langid``) is about 30% faster
- Startup: Loading the default classification model is 25-30x faster
- Execution: Language detection with ``langid.classify`` is 5-6x faster on paragraphs (less on longer texts)

For implementation details see this blog post: `How to make language detection with langid.py faster `_.

For more information and older Python versions see `changelog `_.


Usage
-----

Drop-in replacement
~~~~~~~~~~~~~~~~~~~


1. Install the package:

   * ``pip3 install py3langid`` (or ``pip`` where applicable)

2. Use it:

   * with Python: ``import py3langid as langid``
   * on the command-line: ``langid``


With Python
~~~~~~~~~~~

Basics:

.. code-block:: python

    >>> import py3langid as langid
    
    >>> text = 'This text is in English.'
    # identified language and probability
    >>> langid.classify(text)
    ('en', -56.77429)
    # unpack the result tuple in variables
    >>> lang, prob = langid.classify(text)
    # all potential languages
    >>> langid.rank(text)


More options:

.. code-block:: python

    >>> from py3langid.langid import LanguageIdentifier, MODEL_FILE

    # subset of target languages
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
    >>> identifier.set_languages(['de', 'en', 'fr'])
    # this won't work well...
    >>> identifier.classify('这样不好')
    ('en', -81.831665)

    # normalization of probabilities to an interval between 0 and 1
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
    >>> identifier.classify('This should be enough text.')
    ('en', 1.0)


Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:

.. code-block:: python

    >>> langid.classify(text, datatype='uint32')


On the command-line
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # basic usage with probability normalization
    $ echo "This should be enough text." | langid -n
    ('en', 1.0)

    # define a subset of target languages
    $ echo "This won't be recognized properly." | langid -n -l fr,it,tr
    ('it', 0.97038305)


Legacy documentation
--------------------


**The docs below are provided for reference, only part of the functions are currently tested and maintained.**


Introduction
------------

``langid.py`` is a standalone Language Identification (LangID) tool.

The design principles are as follows:

1. Fast
2. Pre-trained over a large number of languages (currently 97)
3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
4. Single .py file with minimal dependencies
5. Deployable as a web service

All that is required to run ``langid.py`` is Python >= 3.6 and numpy. 

The accompanying training tools are still Python2-only.

``langid.py`` is WSGI-compliant.  ``langid.py`` will use ``fapws3`` as a web server if 
available, and default to ``wsgiref.simple_server`` otherwise.

``langid.py`` comes pre-trained on 97 languages (ISO 639-1 codes given):

    af, am, an, ar, as, az, be, bg, bn, br, 
    bs, ca, cs, cy, da, de, dz, el, en, eo, 
    es, et, eu, fa, fi, fo, fr, ga, gl, gu, 
    he, hi, hr, ht, hu, hy, id, is, it, ja, 
    jv, ka, kk, km, kn, ko, ku, ky, la, lb, 
    lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, 
    nb, ne, nl, nn, no, oc, or, pa, pl, ps, 
    pt, qu, ro, ru, rw, se, si, sk, sl, sq, 
    sr, sv, sw, ta, te, th, tl, tr, ug, uk, 
    ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

* JRC-Acquis 
* ClueWeb 09
* Wikipedia
* Reuters RCV2
* Debian i18n


Usage
-----

    langid [options]

optional arguments:
  -h, --help            show this help message and exit
  -s, --serve           launch web service
  --host=HOST           host/ip to bind to
  --port=PORT           port to listen on
  -v                    increase verbosity (repeat for greater effect)
  -m MODEL              load model from file
  -l LANGS, --langs=LANGS
                        comma-separated set of target ISO639 language codes
                        (e.g en,de)
  -r, --remote          auto-detect IP address for remote access
  -b, --batch           specify a list of files on the command line
  -d, --dist            show full distribution over languages
  -u URL, --url=URL     langid of URL
  --line                process pipes line-by-line rather than as a document
  -n, --normalize       normalize confidence scores to probability values


The simplest way to use ``langid.py`` is as a command-line tool, and you can 
invoke using ``python langid.py``. If you installed ``langid.py`` as a Python 
module (e.g. via ``pip install langid``), you can invoke ``langid`` instead of 
``python langid.py -n`` (the two are equivalent).  This will cause a prompt to 
display. Enter text to identify, and hit enter::

  >>> This is a test
  ('en', -54.41310358047485)
  >>> Questa e una prova
  ('it', -35.41771221160889)


``langid.py`` can also detect when the input is redirected (only tested under Linux), and in this
case will process until EOF rather than until newline like in interactive mode::

  python langid.py < README.rst 
  ('en', -22552.496054649353)


The value returned is the unnormalized probability estimate for the language. Calculating 
the exact probability estimate is disabled by default, but can be enabled through a flag::

  python langid.py -n < README.rst 
  ('en', 1.0)

More details are provided in this README in the section on `Probability Normalization`.

You can also use ``langid.py`` as a Python library::

  # python
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("This is a test")
  ('en', -54.41310358047485)
  
Finally, ``langid.py`` can use Python's built-in ``wsgiref.simple_server`` (or ``fapws3`` if available) to
provide language identification as a web service. To do this, launch ``python langid.py -s``, and
access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed
with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example::

  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service::

  # curl -d "q=This is a test" localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT::

  # curl -T readme.rst localhost:9008/detect
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
  {"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, ``langid.py`` will interpret the entire
file as a single query. This allows for redirection via curl::

  # echo "This is a test" | curl -d @- localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

``langid.py`` will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even 
though the machine has a different external IP address. ``langid.py`` can attempt to automatically discover the external
IP address. To enable this functionality, start ``langid.py`` with the ``-r`` flag.

``langid.py`` supports constraining of the output language set using the ``-l`` flag and a comma-separated list of ISO639-1 
language codes (the ``-n`` flag enables probability normalization)::

  # python langid.py -n -l it,fr
  >>> Io non parlo italiano
  ('it', 0.99999999988965627)
  >>> Je ne parle pas français
  ('fr', 1.0)
  >>> I don't speak english
  ('it', 0.92210605672341062)

When using ``langid.py`` as a library, the set_languages method can be used to constrain the language set::

  python                      
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("I do not speak english")
  ('en', 0.57133487679900674)
  >>> langid.set_languages(['de','fr','it'])
  >>> langid.classify("I do not speak english")
  ('it', 0.99999835791478453)
  >>> langid.set_languages(['en','it'])
  >>> langid.classify("I do not speak english")
  ('en', 0.99176190378750373)


Batch Mode
----------

``langid.py`` supports batch mode processing, which can be invoked with the ``-b`` flag.
In this mode, ``langid.py`` reads a list of paths to files to classify as arguments.
If no arguments are supplied, ``langid.py`` reads the list of paths from ``stdin``,
this is useful for using ``langid.py`` with UNIX utilities such as ``find``.

In batch mode, ``langid.py`` uses ``multiprocessing`` to invoke multiple instances of
the classifier, utilizing all available CPUs to classify documents in parallel. 


Probability Normalization
-------------------------

The probabilistic model implemented by ``langid.py`` involves the multiplication of a
large number of probabilities. For computational reasons, the actual calculations are
implemented in the log-probability space (a common numerical technique for dealing with
vanishingly small probabilities). One side-effect of this is that it is not necessary to
compute a full probability in order to determine the most probable language in a set
of candidate languages. However, users sometimes find it helpful to have a "confidence"
score for the probability prediction. Thus, ``langid.py`` implements a re-normalization
that produces an output in the 0-1 range.

``langid.py`` disables probability normalization by default. For
command-line usages of ``langid.py``, it can be enabled by passing the ``-n`` flag. For
probability normalization in library use, the user must instantiate their own 
``LanguageIdentifier``. An example of such usage is as follows::
  
  >> from py3langid.langid import LanguageIdentifier, MODEL_FILE
  >> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
  >> identifier.classify("This is a test")
  ('en', 0.9999999909903544)


Training a model
----------------

So far Python 2.7 only, see the `original instructions `_.


Read more
---------

``langid.py`` is based on published research. [1] describes the LD feature selection technique in detail,
and [2] provides more detail about the module ``langid.py`` itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, 
In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), 
Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, 
In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 
Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

Owner

  • Name: Adrien Barbaresi
  • Login: adbar
  • Kind: user
  • Location: Berlin
  • Company: Berlin-Brg. Academy of Sciences (BBAW)

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub Events

Total
  • Issues event: 3
  • Watch event: 9
  • Delete event: 1
  • Issue comment event: 2
  • Push event: 2
  • Pull request event: 2
  • Fork event: 1
  • Create event: 1
Last Year
  • Issues event: 3
  • Watch event: 9
  • Delete event: 1
  • Issue comment event: 2
  • Push event: 2
  • Pull request event: 2
  • Fork event: 1
  • Create event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 261
  • Total Committers: 11
  • Avg Commits per committer: 23.727
  • Development Distribution Score (DDS): 0.192
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Marco Lui s****d@g****m 211
Adrien Barbaresi b****i@b****e 35
Aitzol Naberan a****n@c****m 6
Joel Nothman j****n@g****m 2
Marco Lui m****i@n****u 1
Bartek Ćwikłowski p****a@g****m 1
Sourcery AI 1
Martin Thurau m****u@g****m 1
tripleee e****d@i****i 1
Quentin Pradet q****n@c****m 1
Marco Lui m****i@m****) 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 13
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 14 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.25
  • Average comments per pull request: 0.46
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 6
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: 3 months
  • Average time to close pull requests: 26 days
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Aprilistic (1)
  • debuggio (1)
  • aleksandra-miletic (1)
  • bestlee666 (1)
Pull Request Authors
  • adbar (12)
  • sourcery-ai[bot] (6)
Top Labels
Issue Labels
bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 174,258 last-month
  • Total docker downloads: 67
  • Total dependent packages: 10
  • Total dependent repositories: 10
  • Total versions: 7
  • Total maintainers: 1
pypi.org: py3langid

Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times.

  • Versions: 7
  • Dependent Packages: 10
  • Dependent Repositories: 10
  • Downloads: 174,258 Last month
  • Docker Downloads: 67
Rankings
Downloads: 1.6%
Dependent packages count: 1.8%
Docker downloads count: 3.0%
Dependent repos count: 4.6%
Average: 6.2%
Stargazers count: 13.1%
Forks count: 13.3%
Maintainers (1)
Last synced: 6 months ago