Updated 6 months ago

trafilatura • Rank 26.3 • Science 77%

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated 6 months ago

wpextract • Rank 6.1 • Science 85%

Create datasets from WordPress sites for research or archiving

Updated 6 months ago

nebula • Rank 8.4 • Science 54%

🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

Updated 6 months ago

mc-crawler • Rank 3.3 • Science 54%

A MobileCoin network crawler. Corresponding preprint available on arXiv (https://arxiv.org/pdf/2111.12364.pdf).

Updated 5 months ago

https://github.com/adbar/courlan • Rank 20.2 • Science 13%

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Updated 6 months ago

metafinder • Rank 14.4 • Science 13%

Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata

Updated 5 months ago

https://github.com/buaadreamer/buaastar • Rank 1.4 • Science 23%

北航星球网站 北航2021年夏季学期Python英文课大作业

Updated 6 months ago

foot • Rank 0.0 • Science 13%

Foot is a library that fetches a list of URLs and silly walks through each site to gather information.

Updated 5 months ago

https://github.com/citiususc/polypus • Rank 0.0 • Science 13%

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis

Updated 6 months ago

cs-insights-crawler • Science 31%

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

Updated 6 months ago

unidisk • Science 44%

A Crawler to search for keywords and compare the score

Updated 5 months ago

https://github.com/byt3n33dl3/thc-katanax • Science 26%

The Next generation of Samurai blades that Crawling and Spidering Framework.

Updated 6 months ago

pacs-ris-crawler • Science 44%

Search the PACS and RIS

Updated 5 months ago

https://github.com/byt3n33dl3/crawler_v2 • Science 13%

Remote access Trojan based (Client) After the Malware hits the Kernel.

Updated 5 months ago

https://github.com/birkhofflee/blizzard_forum.js • Science 13%

An unofficial Node.js API for Blizzard Forums. (works in 2019)

Updated 6 months ago

qasports-dataset-scripts • Science 44%

Scripts used to generate the (Question-Answering) QASports2 Dataset

Updated 6 months ago

scrapy • Science 26%

Scrapy, a fast high-level web crawling & scraping framework for Python.

Updated 6 months ago

github-crawler • Science 44%

The GitHub Crawler is a Python-based project that utilizes the GitHub API to fetch and crawl data related to commits and pull requests from various repositories. It's a tool designed for developers who want to analyze the activity in a GitHub repository. The crawler can fetch data about commits, pull requests, pull commits, pull files, pull reviews

Updated 5 months ago

https://github.com/0xk1h0/phishing_alive_measurement • Science 26%

7 Days Later: Analyzing Phishing-Site Lifespan After Detected (WWW 2025)