Recent Releases of Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved Annotation Postprocessors

  • Improved both the HTML and XML Annotation postprocessors.
  • Fixed #94 (XML tags are now closed in the correct order).
  • Breaking change:
    • the XmlAnnotationProcessor now uses a mandatory root element to ensure that the create XML is valid.
    • Impact:
      1. The created XML will contain a <content> root element, which contains all annotations.
      2. The name of the root element can be overwritten, if the optional root_element parameter is provided to the annotation processor call.
  • Documentation fixes.
  • Updated dependencies and added pytest as a build dependency.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun about 1 year ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Allow for newer requests versions.

This is a maintenance release which fixes a regression in the package dependencies introduced in 2.5.1. - merged #91 to allow for newer requests versions.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 1 year ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Python 3.13 support

  • added Python 3.13 to the build pipeline.
  • deprecated Python 3.8.
  • updated dependencies.
  • minor optimizations in the attribute handling.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 1 year ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Custom HTML Handling and HTML engine improvements

  • add working support for specifying custom html tags (fixes #81)
  • improved html_engine.py
  • improved typing across all modules
  • added unittests for
    • inscript
    • inscriptis-api
  • documentation update

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 2 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fix documentation build and update publish script.

  • fix building documentation on readthedocs.org
  • update publish script

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 2 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Code cleanup, improved Web service and distribution

  • added official Python 3.12 support
  • Inscriptis command line client
    • renamed inscript.py to inscript and install client via pip
    • added --timeout argument.
  • Inscriptis Web service:
    • migrate the Web service to FastAPI and uvicorn
    • enable install as an extra using pip install inscriptis[web-service]
  • code cleanup
  • migrate to pyproject.toml and poetry for package distribution
  • use black for code formatting
  • improved tox config and code checks

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 2 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Official Python 3.11 support

Maintenance release adding Python 3.11 to the build pipeline.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 3 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed handling of invalid length specifications

This is a bugfix release correcting the handling of invalid length specifications (bug #63).

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun almost 4 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Correct handling of tail text in HTML comments

  • fix: correctly handle HTML comments used to confuse HTML to text conversion (fixes #45).
  • fix: updated unittests to correctly work with lxml in Ubuntu 22.04.
  • add: updated and extended flake8 testing.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun almost 4 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Support for custom HTML table separators and Python 3.10

  • support custom HTML tables separators (addresses #29).
  • extended documentation on the command line client and added a link to the JOSS paper on inscriptis.
  • officially support Python 3.10 and add it to the build pipeline.
  • fixed dependency resolution for tox builds.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 4 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Zenodo DOI and integrated feedback obtained through the Journal of Open Source Software review process

  • improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
  • the Inscriptis web service has been included into the Python package and can now be started with
    bash export FLASK_APP="inscriptis.service.web" python3 -m flask run

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 4 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Integrated feedback obtained through the Journal of Open Source Software review process

  • improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
  • the Inscriptis web service has been included into the Python package and can now be started with
    bash export FLASK_APP="inscriptis.service.web" python3 -m flask run

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 4 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved document model, parsing of borderline cases & HTML annotation support

Changes

HTML parsing:

  • new: improved model for handling text blocks and lines
  • chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
  • chg: improved whitespace handling
  • add: cover more borderline cases with unit tests

Inscriptis core:

  • new: annotation support
  • new: processing of annotation rules and annotation output
  • new: type hints
  • add: extended and improved documentation

Inscript command line client:

  • new: added --annotation-rules option for annotation support.
  • new: added --post-processor option to export and visualize annotations (HTML, XML and surface form export)
  • chg: apply --encoding to Web URLs as well

Misc:

  • chg: migrated to the semantic versioning schema described on https://semver.org/ for versioning.

Note

In terms of functionality, this release corresponds to Inscriptis 2.0rc2.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun almost 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed annotations for borderline cases

Please refer to https://github.com/weblyzard/inscriptis/releases/tag/2.0rc1 for a list of all new features. This release candidate fixes the following issues in rc1:

  • fixed annotations for some borderline cases
  • improved documentation compared to 2.0rc2

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun almost 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved document model, parsing of borderline cases & HTML annotation support

  1. HTML parsing:

    • new: new model for handling blocks and lines
    • chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
    • chg: improved whitespace handling
    • add: cover more borderline cases with unit tests
  2. Inscriptis core:

    • new: support for annotation rules and annotation output
    • new: annotation post-processors (html, xml, surface form)
    • new: type hints
    • chg: extended and improved documentation
  3. Inscript command line client:

    • chg: apply --encoding to Web URLs as well

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun almost 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web -

  • tables: add support for vertical (valign, css: text-vertical-alginment) and horizontal (align) cell alignment (fixes: #33)
  • improved handling of HTML attributes and styles
  • code cleanup
  • migrated build from travis to github actions

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun about 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved margin handling & more liberal licensing

  • ignore top margins at the beginning of a document.
  • more liberal licensing:
    • the license change has been triggered by another project that created a Java port of inscriptis.
    • to facilitate the free sharing of code and ideas between our two projects, we have (i) obtained the permission of all contributors for a license change, and (ii) changed the inscriptis license to the "Apache License 2.0".

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved testing and Python 3.9 support

  • minor performance improvements and code optimizations
  • added Python 3.9 test environment
  • improved test coverage
  • updated package metadata
  • improved tox configuration

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 5 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved HTML rendering, command line client and Web service

  1. added support for rendering tags with the white-space: pre CSS attribute (e.g. <pre> which is often used for formatting code).
  2. API change: A ParserConfig object replaces the parameters display_images, dedpulicate_captions, display_links and indentation in get_text() and for initializing the Inscriptis class. ```python

    from lxml.html import fromstring from inscriptis.model.config import ParserConfig

    htmltree = fromstring(html)
    # optional parser configuration fine tuning config = ParserConfig(display
    links=True, displayanchors=True) parser = Inscriptis(htmltree, config) text = parser.get_text() ```

  3. command line client:

    • added option for displaying anchor links
    • --encoding not sets the HTML and output encoding
    • new --version option
  4. Web service

    • use the related CSS profile per default
    • added version call
  5. Documentation fixes and improvements

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun about 6 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved performance and code structure, documentation and unit testing

  • improved performance and code structure.
  • use metadata published in ./inscriptis/__init__.py for versioning and in setup.py.
  • improved test coverage
  • created sphinx API, usage and testing documentation which is published on https://inscriptis.readthedocs.org
  • requires Python 3.5+ (dropped support for Python 2.7)

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 6 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Correct inscript.py default indentation strategy.

Use the extended indentation strategy per default as outlined in the README.md.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 6 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved indentation and custom rendering styles

  • improved indentation, if span and div tags are used
  • support for custom rendering styles
  • improved documentation
  • use travis for auto CI
  • requires Python 2.7+ or Python 3.5+ since lxml does not support Python 3 versions <3.5

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 6 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved table rendering (nested tables and line breaks in tables)

  • Correctly handle nested tables and line breaks (e.g. due to enumerations, list or paragraph breaks) in tables.
  • Improved content stripping.

Please take a look at the Rendering document for an overview of how Inscriptis renders different tables.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 7 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Use the requests library for URL fetching

  • use requests for URL fetching (this addresses #17 and prevents 403 responses with some Web servers).

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 7 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Fixed handling of negative margins.

  • correctly parse negative margins in CSS definitions.
  • This fixes a bug that led for some pages to a high number (>1000) of newlines between content.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 7 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Use server encoding, if available in the inscript.py client.

This prevents encoding errors when using inscript.py for converting HTML pages to text.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 7 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Decode HTLM entities

Decode HTML entities such as Auml;, Ouml;, Uuml;prior to returning the plain text version of the HTML page.

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 7 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - Improved parsing and PyPI metadata

  • improved handling of highly nested tables
  • more comprehensive PyPI metadata

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun about 8 years ago

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web - flask web service and more reliable parsing

Changelog

  1. optional flask web service for converting html to python
  2. bug fixes
    • allow infinitely nested lists
    • fix a css parsing bug
    • correctly handle empty documents

Scientific Software - Peer-reviewed - Python
Published by AlbertWeichselbraun over 8 years ago