Recent Releases of Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved Annotation Postprocessors
- Improved both the HTML and XML Annotation postprocessors.
- Fixed #94 (XML tags are now closed in the correct order).
- Breaking change:
- the
XmlAnnotationProcessornow uses a mandatory root element to ensure that the create XML is valid. - Impact:
- The created XML will contain a
<content>root element, which contains all annotations. - The name of the root element can be overwritten, if the optional
root_elementparameter is provided to the annotation processor call.
- The created XML will contain a
- the
- Documentation fixes.
- Updated dependencies and added
pytestas a build dependency.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun about 1 year ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Allow for newer requests versions.
This is a maintenance release which fixes a regression in the package dependencies introduced in 2.5.1. - merged #91 to allow for newer requests versions.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 1 year ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Python 3.13 support
- added Python 3.13 to the build pipeline.
- deprecated Python 3.8.
- updated dependencies.
- minor optimizations in the attribute handling.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 1 year ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Custom HTML Handling and HTML engine improvements
- add working support for specifying custom html tags (fixes #81)
- improved html_engine.py
- improved typing across all modules
- added unittests for
- inscript
- inscriptis-api
- documentation update
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 2 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fix documentation build and update publish script.
- fix building documentation on readthedocs.org
- update publish script
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 2 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Code cleanup, improved Web service and distribution
- added official Python 3.12 support
- Inscriptis command line client
- renamed
inscript.pytoinscriptand install client via pip - added
--timeoutargument.
- renamed
- Inscriptis Web service:
- migrate the Web service to FastAPI and uvicorn
- enable install as an extra using
pip install inscriptis[web-service]
- code cleanup
- migrate to
pyproject.tomland poetry for package distribution - use black for code formatting
- improved tox config and code checks
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 2 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Official Python 3.11 support
Maintenance release adding Python 3.11 to the build pipeline.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 3 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed handling of invalid length specifications
This is a bugfix release correcting the handling of invalid length specifications (bug #63).
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun almost 4 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Correct handling of tail text in HTML comments
- fix: correctly handle HTML comments used to confuse HTML to text conversion (fixes #45).
- fix: updated unittests to correctly work with lxml in Ubuntu 22.04.
- add: updated and extended flake8 testing.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun almost 4 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Support for custom HTML table separators and Python 3.10
- support custom HTML tables separators (addresses #29).
- extended documentation on the command line client and added a link to the JOSS paper on inscriptis.
- officially support Python 3.10 and add it to the build pipeline.
- fixed dependency resolution for tox builds.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 4 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Zenodo DOI and integrated feedback obtained through the Journal of Open Source Software review process
- improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
- the Inscriptis web service has been included into the Python package and can now be started with
bash export FLASK_APP="inscriptis.service.web" python3 -m flask run
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 4 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Integrated feedback obtained through the Journal of Open Source Software review process
- improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
- the Inscriptis web service has been included into the Python package and can now be started with
bash export FLASK_APP="inscriptis.service.web" python3 -m flask run
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 4 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved document model, parsing of borderline cases & HTML annotation support
Changes
HTML parsing:
- new: improved model for handling text blocks and lines
- chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
- chg: improved whitespace handling
- add: cover more borderline cases with unit tests
Inscriptis core:
- new: annotation support
- new: processing of annotation rules and annotation output
- new: type hints
- add: extended and improved documentation
Inscript command line client:
- new: added
--annotation-rulesoption for annotation support. - new: added
--post-processoroption to export and visualize annotations (HTML, XML and surface form export) - chg: apply
--encodingto Web URLs as well
Misc:
- chg: migrated to the semantic versioning schema described on https://semver.org/ for versioning.
Note
In terms of functionality, this release corresponds to Inscriptis 2.0rc2.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun almost 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed annotations for borderline cases
Please refer to https://github.com/weblyzard/inscriptis/releases/tag/2.0rc1 for a list of all new features. This release candidate fixes the following issues in rc1:
- fixed annotations for some borderline cases
- improved documentation compared to 2.0rc2
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun almost 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved document model, parsing of borderline cases & HTML annotation support
HTML parsing:
- new: new model for handling blocks and lines
- chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
- chg: improved whitespace handling
- add: cover more borderline cases with unit tests
Inscriptis core:
- new: support for annotation rules and annotation output
- new: annotation post-processors (html, xml, surface form)
- new: type hints
- chg: extended and improved documentation
Inscript command line client:
- chg: apply
--encodingto Web URLs as well
- chg: apply
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun almost 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web -
- tables: add support for vertical (
valign,css: text-vertical-alginment) and horizontal (align) cell alignment (fixes: #33) - improved handling of HTML attributes and styles
- code cleanup
- migrated build from travis to github actions
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun about 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved margin handling & more liberal licensing
- ignore top margins at the beginning of a document.
- more liberal licensing:
- the license change has been triggered by another project that created a Java port of inscriptis.
- to facilitate the free sharing of code and ideas between our two projects, we have (i) obtained the permission of all contributors for a license change, and (ii) changed the inscriptis license to the "Apache License 2.0".
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved testing and Python 3.9 support
- minor performance improvements and code optimizations
- added Python 3.9 test environment
- improved test coverage
- updated package metadata
- improved tox configuration
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 5 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved HTML rendering, command line client and Web service
- added support for rendering tags with the
white-space: preCSS attribute (e.g.<pre>which is often used for formatting code). API change: A
ParserConfigobject replaces the parametersdisplay_images,dedpulicate_captions,display_linksandindentationinget_text()and for initializing theInscriptisclass. ```pythonfrom lxml.html import fromstring from inscriptis.model.config import ParserConfig
htmltree = fromstring(html)
# optional parser configuration fine tuning config = ParserConfig(displaylinks=True, displayanchors=True) parser = Inscriptis(htmltree, config) text = parser.get_text() ```command line client:
- added option for displaying anchor links
--encodingnot sets the HTML and output encoding- new
--versionoption
Web service
- use the related CSS profile per default
- added
versioncall
Documentation fixes and improvements
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun about 6 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved performance and code structure, documentation and unit testing
- improved performance and code structure.
- use metadata published in
./inscriptis/__init__.pyfor versioning and in setup.py. - improved test coverage
- created sphinx API, usage and testing documentation which is published on https://inscriptis.readthedocs.org
- requires Python 3.5+ (dropped support for Python 2.7)
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 6 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Correct inscript.py default indentation strategy.
Use the extended indentation strategy per default as outlined in the README.md.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 6 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved indentation and custom rendering styles
- improved indentation, if span and div tags are used
- support for custom rendering styles
- improved documentation
- use travis for auto CI
- requires Python 2.7+ or Python 3.5+ since lxml does not support Python 3 versions <3.5
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 6 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved table rendering (nested tables and line breaks in tables)
- Correctly handle nested tables and line breaks (e.g. due to enumerations, list or paragraph breaks) in tables.
- Improved content stripping.
Please take a look at the Rendering document for an overview of how Inscriptis renders different tables.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 7 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Use the requests library for URL fetching
- use requests for URL fetching (this addresses #17 and prevents
403responses with some Web servers).
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 7 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed handling of negative margins.
- correctly parse negative margins in CSS definitions.
- This fixes a bug that led for some pages to a high number (>1000) of newlines between content.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 7 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Use server encoding, if available in the inscript.py client.
This prevents encoding errors when using inscript.py for converting HTML pages to text.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 7 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Decode HTLM entities
Decode HTML entities such as Auml;, Ouml;, Uuml;prior to returning the plain text version of the HTML page.
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 7 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved parsing and PyPI metadata
- improved handling of highly nested tables
- more comprehensive PyPI metadata
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun about 8 years ago
Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - flask web service and more reliable parsing
Changelog
- optional flask web service for converting html to python
- bug fixes
- allow infinitely nested lists
- fix a css parsing bug
- correctly handle empty documents
Scientific Software - Peer-reviewed
- Python
Published by AlbertWeichselbraun over 8 years ago