news-watch
news-watch: Indonesia's top news websites scraper
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Keywords
Repository
news-watch: Indonesia's top news websites scraper
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 10
Topics
Metadata Files
README.md
news-watch: Indonesia's top news websites scraper
news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research
⚠️ Ethical Considerations & Disclaimer ⚠️
Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.
User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.
Installation
```bash pip install news-watch playwright install chromium
Development version
pip install git+https://github.com/okkymabruri/news-watch.git@dev ```
Performance Notes
⚠️ Works best locally. Cloud environments (Google Colab, servers) may experience degraded performance or blocking due to anti-bot measures.
Usage
To run the scraper from the command line:
bash
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v
Command-Line Arguments
| Argument | Description |
|----------|-------------|
| -k, --keywords | Required. Comma-separated keywords to scrape (e.g., "ojk,bank,npl") |
| -sd, --start_date | Required. Start date in YYYY-MM-DD format (e.g., 2025-01-01) |
| -s, --scrapers | Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail) |
| -of, --output_format | Output format: csv or xlsx (default: xlsx) |
| -v, --verbose | Show detailed logging output (default: silent) |
| --list_scrapers | List all supported scrapers and exit |
Examples
```bash
Basic usage
newswatch --keywords ihsg --start_date 2025-01-01
Multiple keywords with specific scraper
newswatch -k "ihsg,bank" -s "detik" --output_format xlsx -v
List available scrapers
newswatch --list_scrapers ```
Python API Usage
```python import newswatch as nw
Basic scraping - returns list of article dictionaries
articles = nw.scrape("ekonomi,politik", "2025-01-01") print(f"Found {len(articles)} articles")
Get results as pandas DataFrame for analysis
df = nw.scrapetodataframe("teknologi,startup", "2025-01-01") print(df['source'].value_counts())
Save directly to file
nw.scrapetofile( keywords="bank,ihsg", startdate="2025-01-01", outputpath="financial_news.xlsx" )
Quick recent news
recentnews = nw.quickscrape("politik", days_back=3)
Get available news sources
sources = nw.list_scrapers() print("Available sources:", sources) ```
See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.
Run on Google Colab
You can run news-watch on Google Colab
Output
The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.
The output file contains the following columns:
titlepublish_dateauthorcontentkeywordcategorysourcelink
Supported Websites
- Antaranews.com
- Bisnis.com
- Bloomberg Technoz
- CNBC Indonesia
- Detik.com
- Jawapos.com
- Katadata.co.id
- Kompas.com
- Kontan.co.id
- Media Indonesia
- Metrotvnews.com
- Okezone.com
- Tempo.co
- Viva.co.id
Note: - On Linux platforms: Kontan, Jawapos, Katadata are automatically excluded due to compatibility issues. Use
-s allto force (may cause errors) - Limitation: Kontan scraper maximum 50 pages
Contributing
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
License
This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.
Citation
bibtex
@software{mabruri_newswatch,
author = {Okky Mabruri},
title = {news-watch},
year = {2025},
doi = {10.5281/zenodo.14908389}
}
Related Work
Owner
- Name: Okky Mabruri
- Login: okkymabruri
- Kind: user
- Location: Indonesia
- Company: @iData1011
- Website: okkymabruri.github.io
- Twitter: okkymbrur
- Repositories: 2
- Profile: https://github.com/okkymabruri
commit to pushing 1%
Citation (CITATION.cff)
cff-version: 1.2.0
title: "news-watch"
message: 'If you use this software, please cite it as below.'
authors:
- name: "Okky Mabruri"
website: "okkymabruri.github.io"
orcid: "https://orcid.org/0000-0002-2103-1311"
version: 0.2.5
abstract: "news-watch: Indonesia's top news websites scraper. A Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research."
publisher: Zenodo
doi: "10.5281/zenodo.14908389"
license: MIT
license-url: "https://github.com/okkymabruri/news-watch/blob/main/LICENSE"
repository-code: "https://github.com/okkymabruri/news-watch"
keywords:
- scraping
- news
- indonesian-news
- newswatch
- newsscraper
- berita
- scraping-berita
type: software
contact:
- name: Okky Mabruri
email: okkymbrur@gmail.com
GitHub Events
Total
- Release event: 7
- Watch event: 2
- Delete event: 3
- Public event: 1
- Push event: 45
- Pull request event: 15
- Create event: 10
Last Year
- Release event: 7
- Watch event: 2
- Delete event: 3
- Public event: 1
- Push event: 45
- Pull request event: 15
- Create event: 10
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.11
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.11
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
Pull Request Authors
- okkymabruri (8)
- dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 107 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 13
- Total maintainers: 1
pypi.org: news-watch
news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research.
- Homepage: https://github.com/okkymabruri/news-watch
- Documentation: https://news-watch.readthedocs.io/
- License: MIT License
-
Latest release: 0.3.0
published 7 months ago