collectress

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.

https://github.com/stratosphereips/collectress

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.0%) to scientific vocabulary

Keywords

feeds feeds-downloader threat-intelligence web-downloader
Last synced: 6 months ago · JSON representation ·

Repository

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.

Basic Info
  • Host: GitHub
  • Owner: stratosphereips
  • License: gpl-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 111 KB
Statistics
  • Stars: 5
  • Watchers: 3
  • Forks: 0
  • Open Issues: 1
  • Releases: 3
Topics
feeds feeds-downloader threat-intelligence web-downloader
Created over 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Code of conduct Citation Security

README.md

image

Python package Validate-YAML CodeQL Docker GHCR Docker Hub CI

Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.

Features

  • Downloads content from multiple feeds specified in a YAML file
  • Creates a directory for each feed
  • Content stored in a date-structured directory format (YYYY/MM/DD)
  • Handles errors gracefully, allowing the tool to continue even if a single operation fails
  • Command-line arguments for input, output, and cache.
  • Download optimisation through eTag cache.
  • Logs a JSON-formatted comprehensive activity summary per script run

Usage

Collectress can be run from the command line as follows (a log.json will be created upon execution):

bash python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json

Parameters: bash -h, --help show this help message and exit -e ECACHE, --ecache ECACHE eTag cache for optimizing downloads -f FEED, --feed FEED YAML file containing the feeds -w WORKDIR, --workdir WORKDIR The root of the output directory

Usage Docker

Collectress can be used through its Docker image:

bash docker run --rm \ -e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \ -v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \ -v ${PWD}/log.json:/collectress/log.json \ -v ${PWD}/etag_cache.json:/collectress/etag_cache.json \ -v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \ python collectress.py -f data_feeds.yml -e etag_cache.json -w /data

About

This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.

Owner

  • Name: Stratosphere IPS
  • Login: stratosphereips
  • Kind: organization
  • Location: Prague

Cybersecurity Research Laboratory at the Czech Technical University in Prague. Creators of Slips, a free software machine learning-based behavioral IDS/IPS.

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Collectress: Automated Framework To Collect Web Data
message: 'If you use this software, please cite it as specified below.'
type: software
authors:
  - given-names: Veronica
    family-names: Valeros
    email: valerver@fel.cvut.cz
    affiliation: >-
      Stratosphere Laboratory, AIC, FEL, Czech
      Technical University in Prague
    orcid: 'https://orcid.org/0000-0003-2554-3231'

GitHub Events

Total
  • Watch event: 2
  • Delete event: 1
  • Push event: 1
  • Pull request event: 2
  • Create event: 2
Last Year
  • Watch event: 2
  • Delete event: 1
  • Push event: 1
  • Pull request event: 2
  • Create event: 2

Dependencies

.github/workflows/autotag.yml actions
  • actions/checkout v2 composite
  • anothrNick/github-tag-action 1.36.0 composite
.github/workflows/docker-hub.yml actions
  • actions/checkout v2 composite
  • docker/build-push-action v2 composite
  • docker/login-action v1 composite
  • docker/metadata-action v4 composite
.github/workflows/docker-publish.yml actions
  • actions/checkout v3 composite
  • docker/build-push-action ac9327eae2b366085ac7f6a2d02df8aa8ead720a composite
  • docker/login-action 28218f9b04b4f3f62068d7b6ce6ca5b26e35336c composite
  • docker/metadata-action 98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 composite
  • docker/setup-buildx-action 79abd3f86f79a9d68a23c75a09a9a85889262adf composite
  • sigstore/cosign-installer v3.6.0 composite
.github/workflows/python-checks.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/validate-yml.yml actions
  • actions/checkout v2 composite
Dockerfile docker
  • python 3.11-alpine build
requirements.txt pypi
  • python-json-logger *
  • pyyaml *
  • requests *
  • tqdm *