htmldate
htmldate: A Python package to extract publication dates from web pages - Published in JOSS (2020)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 15 DOI reference(s) in README and JOSS metadata -
○Academic publication links
-
✓Committers with academic emails
4 of 25 committers (16.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Keywords from Contributors
Repository
Fast and robust date extraction from web pages, with Python or on the command-line
Basic Info
- Host: GitHub
- Owner: adbar
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: https://htmldate.readthedocs.io
- Size: 30.1 MB
Statistics
- Stars: 138
- Watchers: 4
- Forks: 27
- Open Issues: 11
- Releases: 39
Topics
Metadata Files
README.md
Htmldate: Find the Publication Date of Web Pages

Find original and updated publication dates of any web page. It is often not possible to do it using just the URL or the server response.
On the command-line or with Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.
The package is used in production on millions of documents and integrated into thousands of projects.
In a nutshell

With Python
``` python
from htmldate import finddate finddate('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' ```
On the command-line
bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
Features
- Flexible input: URLs, HTML files, or HTML trees can be used as input (including batch processing).
- Customizable output: Any date format (defaults to ISO 8601 YMD).
- Detection of both original and updated dates.
- Multilingual.
- Compatible with all recent versions of Python.
How it works
Htmldate operates by sifting through HTML markup and if necessary text elements. It features the following heuristics:
- Markup in header: Common patterns are used to identify relevant
elements (e.g.
linkandmetaelements) including Open Graph protocol attributes. - HTML code: The whole document is searched for structural markers
like
abbrortimeelements and a series of attributes (e.g.postmetadata). - Bare HTML content: Heuristics are run on text and markup:
- In
fastmode the HTML page is cleaned and precise patterns are targeted. - In
extensivemode all potential dates are collected and a disambiguation algorithm determines the best one.
- In
Finally, the output is validated and converted to the chosen format.
Performance
1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)
| Python Package | Precision | Recall | Accuracy | F-Score | Time | | -------------- | --------- | ------ | -------- | ------- | ---- | | articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x | | date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x | | goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x | | htmldate[all] 1.6.0 (fast) | 0.883 | 0.924 | 0.823 | 0.903 | 1x | | htmldate[all] 1.6.0 (extensive) | 0.870 | 0.993 | 0.865 | 0.928 | 1.7x | | newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x | | news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |
For the complete results and explanations see evaluation page.
Installation
Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with pip (pip3
where applicable) from the PyPI package repository:
-
pip install htmldate - (optionally)
pip install htmldate[speed]
The last version to support Python 3.6 and 3.7 is htmldate==1.8.1.
Documentation
For more details on installation, Python & CLI usage, please refer to the documentation: htmldate.readthedocs.io
License
This package is distributed under the Apache 2.0 license.
Versions prior to v1.8.0 are under GPLv3+ license.
Context and contributions
Initially launched to create text databases for research purposes at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units), this project continues to be maintained but its future development depends on community support.
If you value this software or depend on it for your product, consider sponsoring it and contributing to its codebase. Your support will help maintain and enhance this package. Visit the Contributing page for more information.
Reach out via the software repository or the contact page for inquiries, collaborations, or feedback.
shell
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
- Barbaresi, A. \"htmldate: A Python package to extract publication dates from web pages\", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
- Barbaresi, A. \"Generic Web Content Extraction with Open-Source Software\", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. \"Efficient construction of metadata-enhanced web corpora\", Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.
Acknowledgements
Kudos to the following software libraries:
- lxml, dateparser
- A few patterns are derived from the python-goose, metascraper, newspaper and articleDateExtractor libraries. This module extends their coverage and robustness significantly.
Owner
- Name: Adrien Barbaresi
- Login: adbar
- Kind: user
- Location: Berlin
- Company: Berlin-Brg. Academy of Sciences (BBAW)
- Website: adrien.barbaresi.eu
- Twitter: adbarbaresi
- Repositories: 37
- Profile: https://github.com/adbar
Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
JOSS Publication
htmldate: A Python package to extract publication dates from web pages
Tags
metadata extraction date parsing web scraping natural language processingGitHub Events
Total
- Create event: 12
- Issues event: 8
- Release event: 2
- Watch event: 16
- Delete event: 11
- Issue comment event: 12
- Push event: 18
- Pull request event: 23
- Fork event: 2
Last Year
- Create event: 12
- Issues event: 8
- Release event: 2
- Watch event: 16
- Delete event: 11
- Issue comment event: 12
- Push event: 18
- Pull request event: 23
- Fork event: 2
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Adrien Barbaresi | b****i@b****e | 508 |
| Adrien Barbaresi | a****i@o****t | 65 |
| evolutionoftheuniverse | 6****e | 31 |
| DerKozmonaut | 5****t | 17 |
| Adrien Barbaresi | a****i@e****r | 15 |
| Corey Dockser | c****r@g****m | 9 |
| Radhi Fadlillah | m****f@g****m | 7 |
| dependabot[bot] | 4****] | 6 |
| Vincent Barbaresi | v****i@o****m | 2 |
| Daniel S. Katz | d****z@i****g | 2 |
| Rahul B | r****t@g****m | 2 |
| kernc | k****e@g****m | 2 |
| sourcery-ai[bot] | 5****] | 2 |
| Ashik Paul | a****7@g****m | 1 |
| Andrei Zhemaituk | a****k@g****m | 1 |
| Adam Hupp | 1****i | 1 |
| B3N | b@a****t | 1 |
| EkaterineSheshelidze | 8****e | 1 |
| Felipe Hertzer | f****r@g****m | 1 |
| Lawrence M Stewart | g****a | 1 |
| MSK1582 | 6****2 | 1 |
| Nada Ayesh | n****0@s****s | 1 |
| SalihTalha | 4****a | 1 |
| lgtm-com[bot] | 4****] | 1 |
| liulinlin90 | l****0@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 60
- Total pull requests: 119
- Average time to close issues: 2 months
- Average time to close pull requests: 3 days
- Total issue authors: 33
- Total pull request authors: 23
- Average comments per issue: 2.63
- Average comments per pull request: 1.49
- Merged pull requests: 81
- Bot issues: 0
- Bot pull requests: 40
Past Year
- Issues: 6
- Pull requests: 15
- Average time to close issues: 14 days
- Average time to close pull requests: about 9 hours
- Issue authors: 6
- Pull request authors: 3
- Average comments per issue: 0.83
- Average comments per pull request: 1.0
- Merged pull requests: 10
- Bot issues: 0
- Bot pull requests: 3
Top Authors
Issue Authors
- adbar (25)
- RadhiFadlillah (2)
- rahulbot (2)
- geoffbacon (1)
- evolutionoftheuniverse (1)
- Kulratis (1)
- PetroffSky (1)
- frenzymadness (1)
- zhemaituk (1)
- masylum (1)
- alroythalus (1)
- Ismael-Hery (1)
- eupattaro89 (1)
- arcombe012 (1)
- rolisz (1)
Pull Request Authors
- adbar (69)
- dependabot[bot] (43)
- sourcery-ai[bot] (9)
- DerKozmonaut (3)
- evolutionoftheuniverse (3)
- b3n4kh (2)
- danielskatz (2)
- nadasuhailAyesh12 (2)
- EkaterineSheshelidze (2)
- vbarbaresi (2)
- felipehertzer (2)
- mohmmadAyesh (2)
- kernc (2)
- zhemaituk (2)
- SalihTalha (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 4,920,761 last-month
- Total docker downloads: 606
- Total dependent packages: 5
- Total dependent repositories: 50
- Total versions: 58
- Total maintainers: 1
pypi.org: htmldate
Fast and robust extraction of original and updated publication dates from URLs and web pages.
- Homepage: https://htmldate.readthedocs.io
- Documentation: https://htmldate.readthedocs.io/
- License: Apache 2.0
-
Latest release: 1.9.3
published 12 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/autobuild v2 composite
- github/codeql-action/init v2 composite
- actions/cache v2 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- codecov/codecov-action v3 composite
- htmldate *
- sphinx >=7.2.6
- backports-datetime-fromisoformat *
- charset_normalizer *
- dateparser *
- lxml *
- python-dateutil *
- urllib3 *
- articleDateExtractor ==0.20 test
- date_guesser ==2.1.4 test
- goose3 ==3.1.17 test
- htmldate >=1.5.0 test
- news-please ==1.5.35 test
- newspaper3k ==0.2.8 test
- tabulate ==0.9.0 test
