news-watch

news-watch: Indonesia's top news websites scraper

https://github.com/okkymabruri/news-watch

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary

Keywords

berita indonesian-news news newsscraper newswatch scraping scraping-berita
Last synced: 6 months ago · JSON representation ·

Repository

news-watch: Indonesia's top news websites scraper

Basic Info
  • Host: GitHub
  • Owner: okkymabruri
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 231 KB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 10
Topics
berita indonesian-news news newsscraper newswatch scraping scraping-berita
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation

README.md

news-watch: Indonesia's top news websites scraper

PyPI version Build Status PyPI Downloads

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

⚠️ Ethical Considerations & Disclaimer ⚠️

Purpose: For educational and research purposes only. Not designed for commercial use that could be detrimental to news source providers.

User Responsibility: Users must comply with each website's Terms of Service and robots.txt. Aggressive scraping may lead to IP blocking. Scrape responsibly and respect server limitations.

Installation

```bash pip install news-watch playwright install chromium

Development version

pip install git+https://github.com/okkymabruri/news-watch.git@dev ```

Performance Notes

⚠️ Works best locally. Cloud environments (Google Colab, servers) may experience degraded performance or blocking due to anti-bot measures.

Usage

To run the scraper from the command line:

bash newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v

Command-Line Arguments

| Argument | Description | |----------|-------------| | -k, --keywords | Required. Comma-separated keywords to scrape (e.g., "ojk,bank,npl") | | -sd, --start_date | Required. Start date in YYYY-MM-DD format (e.g., 2025-01-01) | | -s, --scrapers | Scrapers to use: specific names (e.g., "kompas,viva"), "auto" (default, platform-appropriate), or "all" (force all, may fail) | | -of, --output_format | Output format: csv or xlsx (default: xlsx) | | -v, --verbose | Show detailed logging output (default: silent) | | --list_scrapers | List all supported scrapers and exit |

Examples

```bash

Basic usage

newswatch --keywords ihsg --start_date 2025-01-01

Multiple keywords with specific scraper

newswatch -k "ihsg,bank" -s "detik" --output_format xlsx -v

List available scrapers

newswatch --list_scrapers ```

Python API Usage

```python import newswatch as nw

Basic scraping - returns list of article dictionaries

articles = nw.scrape("ekonomi,politik", "2025-01-01") print(f"Found {len(articles)} articles")

Get results as pandas DataFrame for analysis

df = nw.scrapetodataframe("teknologi,startup", "2025-01-01") print(df['source'].value_counts())

Save directly to file

nw.scrapetofile( keywords="bank,ihsg", startdate="2025-01-01", outputpath="financial_news.xlsx" )

Quick recent news

recentnews = nw.quickscrape("politik", days_back=3)

Get available news sources

sources = nw.list_scrapers() print("Available sources:", sources) ```

See the comprehensive guide for detailed usage examples and advanced patterns. For interactive examples, see the API reference notebook.

Run on Google Colab

You can run news-watch on Google Colab Open In Colab

Output

The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Supported Websites

Note: - On Linux platforms: Kontan, Jawapos, Katadata are automatically excluded due to compatibility issues. Use -s all to force (may cause errors) - Limitation: Kontan scraper maximum 50 pages

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.

Citation

DOI

bibtex @software{mabruri_newswatch, author = {Okky Mabruri}, title = {news-watch}, year = {2025}, doi = {10.5281/zenodo.14908389} }

Related Work

Owner

  • Name: Okky Mabruri
  • Login: okkymabruri
  • Kind: user
  • Location: Indonesia
  • Company: @iData1011

commit to pushing 1%

Citation (CITATION.cff)

cff-version: 1.2.0
title: "news-watch"
message: 'If you use this software, please cite it as below.'
authors:
  - name: "Okky Mabruri"
    website: "okkymabruri.github.io"
    orcid: "https://orcid.org/0000-0002-2103-1311"
version: 0.2.5
abstract: "news-watch: Indonesia's top news websites scraper. A Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research."
publisher: Zenodo
doi: "10.5281/zenodo.14908389"
license: MIT
license-url: "https://github.com/okkymabruri/news-watch/blob/main/LICENSE"
repository-code: "https://github.com/okkymabruri/news-watch"
keywords:
  - scraping
  - news
  - indonesian-news
  - newswatch
  - newsscraper
  - berita
  - scraping-berita
type: software
contact:
  - name: Okky Mabruri
    email: okkymbrur@gmail.com

GitHub Events

Total
  • Release event: 7
  • Watch event: 2
  • Delete event: 3
  • Public event: 1
  • Push event: 45
  • Pull request event: 15
  • Create event: 10
Last Year
  • Release event: 7
  • Watch event: 2
  • Delete event: 3
  • Public event: 1
  • Push event: 45
  • Pull request event: 15
  • Create event: 10

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.11
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.11
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
Pull Request Authors
  • okkymabruri (8)
  • dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1) python (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 107 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 13
  • Total maintainers: 1
pypi.org: news-watch

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research.

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 107 Last month
Rankings
Dependent packages count: 10.1%
Average: 33.5%
Dependent repos count: 56.9%
Maintainers (1)
Last synced: 6 months ago