html2text-cleansing

https://github.com/bda-kts/html2text-cleansing

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: BDA-KTS
License: mit
Language: HTML
Default Branch: master
Size: 541 KB

Statistics

Stars: 0
Watchers: 0
Forks: 1
Open Issues: 1
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License Citation

HTML to Text Cleansing

Description

When mining text from web pages, an important preprocessing step is to strip the respective HTML page of all structural elements (e.g., tags, scripts, styles) and extract the plain textual content. HTML Information Extraction is a Python-based toolkit that automates this process and provides a structured JSON object for each HTML file, containing key components like the title, description, and main text.

By distilling the essence of a web page into a concise and meaningful representation, this toolkit offers a brief yet informative overview of the story or context behind the content. It employs two extraction methods—NewsPlease and Trafilatura—to ensure flexibility and accuracy across diverse content types. Whether analyzing news articles, building text corpora, or conducting sentiment analysis, this tool serves as a powerful preprocessing resource for social science researchers, computational linguists, and data scientists.

Use Cases

To study media bias in online news articles. The toolkit to extract and analyze HTML files to analyze patterns of bias in reporting styles across different sources
To investigate the evolution of public discourse on climate change. The textual data is collected as HTML files from various online news sources, facilitating sentiment analysis and topic modeling
To build a corpus of web-scraped articles for training models on natural language understanding

Input Data

Downloaded webpages as HTML files. Sample input files are in html/ directory.

Output Data

html_json_trafilatura.json: A JSON file with the extracted main text and metadata.

A sample entry in the JSON output:

json { "title": "Sample News Article", "description": "A brief overview of the article content.", "main_text": "This is the main body of the article...", "filename": "sample.html" }

Environment Setup

Python version: 3.6 or higher is required.
Install the dependencies listed in requirements.txt:

bash pip install -r requirements.txt

Dependencies

NewsPlease: For extracting news content.
Trafilatura: For general text extraction from HTML.

How to Use

(a) Using NewsPlease

Place your HTML files in the html/ directory.
Run the script with the following command:

bash python NewsPlease.py

Output:
- html_json_news.json: A JSON file containing extracted news information (title, description, main text).

(b) Using Trafilatura

Place your HTML files in the html/ directory.
Execute the script with:

bash python Trafilatura.py

Output:
- html_json_trafilatura.json: A JSON file with the extracted main text and metadata.

error handling - Both scripts include basic error handling to skip files that cannot be processed. - If issues arise, error messages will be printed, and the script will continue processing the remaining files.

Acknowledgements

Special thanks to the developers of NewsPlease and Trafilatura for providing powerful libraries that make HTML content extraction straightforward and efficient. If you use this tool for research purposes, please consider citing the respective libraries: - NewsPlease - Trafilatura

Disclaimer

This tool is provided as-is for academic and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service when scraping content.

Contact Details

For further queries, please contact Po-Chun.Chang@gesis.org

Owner

Name: BDA-KTS
Login: BDA-KTS
Kind: organization

Repositories: 1
Profile: https://github.com/BDA-KTS

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Chang
    given-names: Po-Chun
    orcid: https://orcid.org/0009-0002-9371-8582
title: "html2text-cleansing"
version: 2.0.4
identifiers:
  - type: doi
    value: 
date-released: 2025-06-29

GitHub Events

Total

Issue comment event: 4
Push event: 6
Pull request event: 3
Fork event: 1
Create event: 2

Last Year

Issue comment event: 4
Push event: 6
Pull request event: 3
Fork event: 1
Create event: 2

Dependencies

requirements.txt pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science