html2text-cleansing
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: BDA-KTS
- License: mit
- Language: HTML
- Default Branch: master
- Size: 541 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
HTML to Text Cleansing
Description
When mining text from web pages, an important preprocessing step is to strip the respective HTML page of all structural elements (e.g., tags, scripts, styles) and extract the plain textual content. HTML Information Extraction is a Python-based toolkit that automates this process and provides a structured JSON object for each HTML file, containing key components like the title, description, and main text.
By distilling the essence of a web page into a concise and meaningful representation, this toolkit offers a brief yet informative overview of the story or context behind the content. It employs two extraction methods—NewsPlease and Trafilatura—to ensure flexibility and accuracy across diverse content types. Whether analyzing news articles, building text corpora, or conducting sentiment analysis, this tool serves as a powerful preprocessing resource for social science researchers, computational linguists, and data scientists.
Use Cases
- To study media bias in online news articles. The toolkit to extract and analyze HTML files to analyze patterns of bias in reporting styles across different sources
- To investigate the evolution of public discourse on climate change. The textual data is collected as HTML files from various online news sources, facilitating sentiment analysis and topic modeling
- To build a corpus of web-scraped articles for training models on natural language understanding
Input Data
Downloaded webpages as HTML files. Sample input files are in html/ directory.
Output Data
html_json_trafilatura.json: A JSON file with the extracted main text and metadata.
A sample entry in the JSON output:
json
{
"title": "Sample News Article",
"description": "A brief overview of the article content.",
"main_text": "This is the main body of the article...",
"filename": "sample.html"
}
Environment Setup
- Python version: 3.6 or higher is required.
- Install the dependencies listed in
requirements.txt:
bash
pip install -r requirements.txt
Dependencies
- NewsPlease: For extracting news content.
- Trafilatura: For general text extraction from HTML.
How to Use
(a) Using NewsPlease
- Place your HTML files in the
html/directory. - Run the script with the following command:
bash
python NewsPlease.py
- Output:
html_json_news.json: A JSON file containing extracted news information (title, description, main text).
(b) Using Trafilatura
- Place your HTML files in the
html/directory. - Execute the script with:
bash
python Trafilatura.py
- Output:
html_json_trafilatura.json: A JSON file with the extracted main text and metadata.
error handling - Both scripts include basic error handling to skip files that cannot be processed. - If issues arise, error messages will be printed, and the script will continue processing the remaining files.
Acknowledgements
Special thanks to the developers of NewsPlease and Trafilatura for providing powerful libraries that make HTML content extraction straightforward and efficient. If you use this tool for research purposes, please consider citing the respective libraries: - NewsPlease - Trafilatura
Disclaimer
This tool is provided as-is for academic and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service when scraping content.
Contact Details
For further queries, please contact Po-Chun.Chang@gesis.org
Owner
- Name: BDA-KTS
- Login: BDA-KTS
- Kind: organization
- Repositories: 1
- Profile: https://github.com/BDA-KTS
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Chang
given-names: Po-Chun
orcid: https://orcid.org/0009-0002-9371-8582
title: "html2text-cleansing"
version: 2.0.4
identifiers:
- type: doi
value:
date-released: 2025-06-29
GitHub Events
Total
- Issue comment event: 4
- Push event: 6
- Pull request event: 3
- Fork event: 1
- Create event: 2
Last Year
- Issue comment event: 4
- Push event: 6
- Pull request event: 3
- Fork event: 1
- Create event: 2