trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
ghs
GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them
findpapers
Findpapers: A tool for helping researchers who are looking for related works
mc-crawler
A MobileCoin network crawler. Corresponding preprint available on arXiv (https://arxiv.org/pdf/2111.12364.pdf).
persian-news-crawler
Simple Script To Crawl Data From Persian News Agencies Including Fars, Mehr.
https://github.com/adbar/courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
https://github.com/commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
metafinder
Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata
https://github.com/amirzenoozi/insta-downloader
You Can Download Instagram Post With This Script
https://github.com/citiususc/polypus
Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis
cs-insights-crawler
This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.
https://github.com/byt3n33dl3/crawler_v2
Remote access Trojan based (Client) After the Malware hits the Kernel.
https://github.com/amirzenoozi/aparat-videos-dataset
Some Simple Information About Aparat Videos for DataScientists
https://github.com/birkhofflee/blizzard_forum.js
An unofficial Node.js API for Blizzard Forums. (works in 2019)
qasports-dataset-scripts
Scripts used to generate the (Question-Answering) QASports2 Dataset
semantic-outlier-removal
Code and data for SORE (ACL 2025), a semantic boilerplate remover.
github-crawler
The GitHub Crawler is a Python-based project that utilizes the GitHub API to fetch and crawl data related to commits and pull requests from various repositories. It's a tool designed for developers who want to analyze the activity in a GitHub repository. The crawler can fetch data about commits, pull requests, pull commits, pull files, pull reviews
https://github.com/0xk1h0/phishing_alive_measurement
7 Days Later: Analyzing Phishing-Site Lifespan After Detected (WWW 2025)