html5
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 65 committers (1.5%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords from Contributors
Repository
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
Basic Info
Statistics
- Stars: 1,203
- Watchers: 50
- Forks: 297
- Open Issues: 90
- Releases: 0
Metadata Files
README.rst
html5lib
========
.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg
:target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml
html5lib is a pure-python library for parsing HTML. It is designed to
conform to the WHATWG HTML specification, as is implemented by all major
web browsers.
Usage
-----
Simple usage follows this pattern:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)
or:
.. code-block:: python
import html5lib
document = html5lib.parse("Hello World!")
By default, the ``document`` will be an ``xml.etree`` element instance.
Whenever possible, html5lib chooses the accelerated ``ElementTree``
implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
Two other tree types are supported: ``xml.dom.minidom`` and
``lxml.etree``. To use an alternative format, specify the name of
a treebuilder:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with ``urllib2`` (Python 2), the charset from HTTP should be
pass into html5lib as follows:
.. code-block:: python
from contextlib import closing
from urllib2 import urlopen
import html5lib
with closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
When using with ``urllib.request`` (Python 3), the charset from HTTP
should be pass into html5lib as follows:
.. code-block:: python
from urllib.request import urlopen
import html5lib
with urlopen("http://example.com/") as f:
document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.
For instance, to make the parser raise exceptions on parse errors, use:
.. code-block:: python
import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)
When you're instantiating parser objects explicitly, pass a treebuilder
class as the ``tree`` keyword argument to use an alternative document
format:
.. code-block:: python
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("
Hello World!")
More documentation is available at https://html5lib.readthedocs.io/.
Installation
------------
html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:
.. code-block:: bash
$ pip install html5lib
The goal is to support a (non-strict) superset of the versions that `pip
supports
`_.
Optional Dependencies
---------------------
The following third-party libraries may be used for additional
functionality:
- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);
- ``genshi`` has a treewalker (but not builder); and
- ``chardet`` can be used as a fallback when character encoding cannot
be determined.
Bugs
----
Please report any bugs on the `issue tracker
`_.
Tests
-----
Unit tests require the ``pytest`` and ``mock`` libraries and can be
run using the ``pytest`` command in the root directory.
Test data are contained in a separate `html5lib-tests
`_ repository and included
as a submodule, thus for git checkouts they must be initialized::
$ git submodule init
$ git submodule update
If you have all compatible Python implementations available on your
system, you can run tests on all of them using the ``tox`` utility,
which can be found on PyPI.
Questions?
----------
Check out `the docs `_. Still
need help? Go to our `GitHub Discussions
`_.
You can also browse the archives of the `html5lib-discuss mailing list
`_.
Owner
- Name: html5lib
- Login: html5lib
- Kind: organization
- Repositories: 6
- Profile: https://github.com/html5lib
GitHub Events
Total
- Issues event: 3
- Watch event: 77
- Issue comment event: 7
- Fork event: 11
Last Year
- Issues event: 3
- Watch event: 77
- Issue comment event: 7
- Fork event: 11
Committers
Last synced: 12 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Geoffrey Sneddon | g****s@g****m | 380 |
| James Graham | j****s@h****k | 324 |
| Anne van Kesteren | a****k@a****l | 295 |
| Sam Ruby | r****s@i****t | 106 |
| James Graham | j****m@o****m | 75 |
| Thomas Broyer | t****r@l****t | 71 |
| Philip Taylor | p****p@z****k | 40 |
| Mark Pilgrim | m****k@d****g | 29 |
| lantis63 | l****3@g****m | 26 |
| Łukasz Langa | l****z@l****l | 26 |
| Tom Most | t****m@f****t | 23 |
| Jon Dufresne | j****e@g****m | 19 |
| Will Kahn-Greene | w****g | 17 |
| Hugo | h****k | 17 |
| John Vandenberg | j****b@g****m | 8 |
| Donald Stufft | d****d@s****o | 5 |
| Edward Z. Yang ext:(%22) | e****g@t****m | 4 |
| Andy Wingo | w****o@p****m | 4 |
| Gabi Davar | g****o@g****m | 4 |
| Lachlan Hunt | l****t@l****u | 4 |
| Philip Jägenstedt | p****p@f****g | 4 |
| Ritwik Gupta | R****a | 4 |
| Simon Pieters | z****n@g****m | 4 |
| Christian Clauss | c****s@m****m | 2 |
| taha | m****r@g****m | 2 |
| Simon Sapin | s****n@e****g | 2 |
| Michael[tm] Smith | m****e@w****g | 2 |
| Kovid Goyal | k****d@k****t | 2 |
| Vitalik Verhovodov | k****r@g****m | 2 |
| Ms2ger | M****r@g****m | 2 |
| and 35 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 46
- Total pull requests: 74
- Average time to close issues: over 1 year
- Average time to close pull requests: about 1 year
- Total issue authors: 39
- Total pull request authors: 34
- Average comments per issue: 4.09
- Average comments per pull request: 1.95
- Merged pull requests: 30
- Bot issues: 0
- Bot pull requests: 11
Past Year
- Issues: 3
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 3
- Pull request authors: 4
- Average comments per issue: 1.0
- Average comments per pull request: 1.6
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- gsnedders (4)
- kloczek (4)
- jvanasco (2)
- willkg (1)
- aqeelat (1)
- thomasrockhu (1)
- leonardr (1)
- annevk (1)
- jayaddison (1)
- uranusjr (1)
- hroncok (1)
- theRealProHacker (1)
- frenzymadness (1)
- Deimos (1)
- aT0ngMu (1)
Pull Request Authors
- dependabot-preview[bot] (10)
- gsnedders (9)
- hugovk (8)
- jayaddison (7)
- ambv (5)
- cclauss (4)
- Mic92 (3)
- twm (3)
- zcorpan (2)
- jdufresne (2)
- ashleysommer (2)
- hroncok (2)
- frenzymadness (2)
- kuvandjiev (1)
- eli-schwartz (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 24
-
Total downloads:
- pypi 29,916,678 last-month
- Total docker downloads: 4,350,558,777
-
Total dependent packages: 538
(may contain duplicates) -
Total dependent repositories: 47,496
(may contain duplicates) - Total versions: 65
- Total maintainers: 7
- Total advisories: 2
pypi.org: html5lib
HTML parser based on the WHATWG HTML specification
- Homepage: https://github.com/html5lib/html5lib-python
- Documentation: https://html5lib.readthedocs.io/
- License: MIT License
-
Latest release: 1.0.1
published over 8 years ago
Rankings
Advisories (2)
alpine-v3.14: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r1
published about 5 years ago
Rankings
Maintainers (1)
alpine-v3.15: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r1
published about 5 years ago
Rankings
Maintainers (1)
alpine-v3.16: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r2
published over 4 years ago
Rankings
Maintainers (1)
alpine-v3.18: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r4
published about 3 years ago
Rankings
Maintainers (1)
alpine-v3.18: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r4
published about 3 years ago
Rankings
Maintainers (1)
pypi.org: html5
HTML parser based on the WHATWG HTML specification
- Homepage: https://github.com/html5lib/html5lib-python
- Documentation: https://html5.readthedocs.io/
- License: MIT License
-
Latest release: 0.0.9
published almost 10 years ago
Rankings
Maintainers (1)
alpine-v3.12: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.0.1-r4
published over 6 years ago
Rankings
Maintainers (1)
alpine-v3.13: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r0
published about 6 years ago
Rankings
Maintainers (1)
alpine-v3.17: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r2
published over 4 years ago
Rankings
Maintainers (1)
alpine-edge: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
spack.io: py-html5lib
HTML parser based on the WHATWG HTML specification.
- Homepage: https://github.com/html5lib/html5lib-python
- License: []
-
Latest release: 1.0.1
published about 4 years ago
Rankings
Maintainers (1)
conda-forge.org: html5lib
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.0.1
published over 3 years ago
Rankings
alpine-edge: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
anaconda.org: html5lib
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.0.1
published almost 8 years ago
Rankings
pypi.org: html5lib-modern
HTML parser based on the WHATWG HTML specification
- Homepage: https://github.com/html5lib/html5lib-python
- Documentation: https://html5lib-modern.readthedocs.io/
- License: Copyright (c) 2006-2013 James Graham and other contributors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 1.2
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.22: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.21: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.20: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r5
published about 2 years ago
Rankings
Maintainers (1)
alpine-v3.21: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.19: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r4
published about 3 years ago
Rankings
alpine-v3.22: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r6
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.20: py3-html5lib-pyc
Precompiled Python bytecode for py3-html5lib
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r5
published about 2 years ago
Rankings
Maintainers (1)
alpine-v3.19: py3-html5lib
A Python HTML parser
- Homepage: https://github.com/html5lib/html5lib-python
- License: MIT
-
Latest release: 1.1-r4
published about 3 years ago
Rankings
Dependencies
- chardet >=2.2
- genshi *
- lxml *
- coverage >=5.1,<6 test
- flake8 >=3.8.1,<3.9 test
- mock >=4.0.2,<5 test
- mock >=3.0.5,<4 test
- pytest >=5.4.2,<7 test
- pytest >=4.6.10,<5 test
- pytest-expect >=1.1.0,<2 test
- tox >=3.15.1,<4 test
- six >=1.9
- webencodings *
- six >=1.9
- webencodings *
- actions/checkout v2 composite
- actions/setup-python v2 composite
- chardet ==2.2.1
- coverage ==5.1
- flake8 ==5.0.4
- flake8 ==3.9.2
- genshi ==0.7.6
- genshi ==0.7.1
- lxml ==4.9.0
- lxml ==3.8.0
- mock ==4.0.2
- mock ==3.0.5
- pytest ==4.6.10
- pytest ==5.4.2
- pytest-expect ==1.1.0
- six ==1.9
- webencodings ==0.5.1