Projects

Updated 10 months ago

trafilatura • Rank 26.3 • Science 77%

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

article-extractor corpus-builder corpus-tools crawler html-to-markdown html2text llm news-aggregator news-crawler nlp rag readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping

Updated 10 months ago

wpextract • Rank 6.1 • Science 85%

Create datasets from WordPress sites for research or archiving

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Updated 10 months ago

ghs • Rank 7.5 • Science 67%

GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them

bootstrap crawler csv-export dataset-generation docker-compose git github java-17 json-export mining-software-repositories msr mysql platform repository search-engine spring-boot spring-boot-application spring-boot-server sql-dump xml-export

Updated 10 months ago

findpapers • Rank 12.2 • Science 54%

Findpapers: A tool for helping researchers who are looking for related works

academic academic-publishing acm arxiv bibtex biorxiv crawler ieee medrxiv paper papers pubmed research scientific-papers scientific-publications scientific-publishing scopus scraper systematic-mapping systematic-review

Updated 10 months ago

nebula • Rank 8.4 • Science 54%

🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

cid crawler filecoin golang hacktoberfest ipfs libp2p

Updated 10 months ago

decryptlogin • Rank 14.7 • Science 44%

DecryptLogin: APIs for loginning some websites by using requests.

12306 baidu baiduyun bilibili crawler jingdong login migu pypi python3 requests spider stackoverflow taobao tencent twitter weibo xiami xiaomi zhihu

Updated 10 months ago

mc-crawler • Rank 3.3 • Science 54%

A MobileCoin network crawler. Corresponding preprint available on arXiv (https://arxiv.org/pdf/2111.12364.pdf).

crawler mobilecoin rust

Updated 10 months ago

persian-news-crawler • Rank 2.3 • Science 54%

Simple Script To Crawl Data From Persian News Agencies Including Fars, Mehr.

cli crawler database fars-news farsi-datasets kaggle-dataset mehr-news news news-agencies newspaper python python3 script shargh-news sqlite3 tensorflow tensorflow2

Updated 10 months ago

Rcrawler • Rank 16.7 • Science 23%

An R web crawler and scraper

crawler crawlers r rpackage scraper webcrawler webscraper webscraping webscrapping

Updated 10 months ago

https://github.com/adbar/courlan • Rank 20.2 • Science 13%

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

crawler crawling recon tld uri url url-checker url-normalization url-parser url-parsing url-validation

Updated 10 months ago

https://github.com/commoncrawl/news-crawl • Rank 6.6 • Science 26%

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Updated 10 months ago

metafinder • Rank 14.4 • Science 13%

Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata

crawler metadata osint

Updated 10 months ago

https://github.com/buaadreamer/buaastar • Rank 1.4 • Science 23%

北航星球网站北航2021年夏季学期Python英文课大作业

crawler css flask html javascript python

Updated 10 months ago

https://github.com/amirzenoozi/insta-downloader • Rank 2.6 • Science 13%

You Can Download Instagram Post With This Script

crawler crawling downloader instagram

Updated 10 months ago

foot • Rank 0.0 • Science 13%

Foot is a library that fetches a list of URLs and silly walks through each site to gather information.

bugbounty crawler scraping

Updated 10 months ago

https://github.com/citiususc/polypus • Rank 0.0 • Science 13%

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis

analytics bigdata crawler scraper sentiment-analysis twitter

Updated 10 months ago

qasports-dataset-scripts • Science 44%

Scripts used to generate the (Question-Answering) QASports2 Dataset

crawler dataset python question-answering sports

Updated 10 months ago

semantic-outlier-removal • Science 54%

Code and data for SORE (ACL 2025), a semantic boilerplate remover.

article-extractor crawler embedding html-to-text html2text llm nlp outlier-removal preprocessing readability scraping text-extraction text-mining web-scraping

Updated 10 months ago

https://github.com/0xk1h0/phishing_alive_measurement • Science 26%

7 Days Later: Analyzing Phishing-Site Lifespan After Detected (WWW 2025)

crawler crawler-js phishing webconf webconf2025 www

Updated 10 months ago

eyes • Science 54%

Public Opinion Mining System of Taiwanese Forums

crawler data-engineering data-mining data-science graphql javascript natural-language-processing opinion-mining public-opinion python react redux tailwindcss task-queue

Updated 10 months ago

unidisk • Science 44%

A Crawler to search for keywords and compare the score

comparison crawler nlp solr-client

Updated 10 months ago

pacs-ris-crawler • Science 44%

Search the PACS and RIS

crawler dicom pacs ris

Updated 10 months ago

https://github.com/birkhofflee/blizzard_forum.js • Science 13%

An unofficial Node.js API for Blizzard Forums. (works in 2019)

api crawler web

Updated 10 months ago

torbot • Science 44%

Dark Web OSINT Tool

algorithm crawler dark-web dedsec-inside deepweb go hacking hacktoberfest osint projects psnappz python python-web-crawler python3 security security-tools spider tor tor-network torbot

Updated 10 months ago

scrapegraph-ai • Science 54%

Python scraper based on AI

ai ai-scraping automated-scraper crawler html-to-markdown llm markdown rag scraping scraping-python web-crawler web-crawlers web-scraping

Updated 10 months ago

scrapy • Science 26%

Scrapy, a fast high-level web crawling & scraping framework for Python.

crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python

Updated 10 months ago

https://github.com/amirzenoozi/aparat-videos-dataset • Science 13%

Some Simple Information About Aparat Videos for DataScientists

aparat cli crawler data-science data-science-projects pandas python python3 sdk-python sqlite3 video

Updated 10 months ago

github-crawler • Science 44%

The GitHub Crawler is a Python-based project that utilizes the GitHub API to fetch and crawl data related to commits and pull requests from various repositories. It's a tool designed for developers who want to analyze the activity in a GitHub repository. The crawler can fetch data about commits, pull requests, pull commits, pull files, pull reviews

crawler github-api github-crawler python python-crawler

Updated 10 months ago

cs-insights-crawler • Science 31%

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

crawler dblp dblp-dataset nlp semanticscholar

Updated 10 months ago

https://github.com/byt3n33dl3/thc-katanax • Science 26%

The Next generation of Samurai blades that Crawling and Spidering Framework.

cli crawler domain framework golang hacking http pentesting subdomain subfinder tls

Updated 10 months ago

https://github.com/byt3n33dl3/crawler_v2 • Science 13%

Remote access Trojan based (Client) After the Malware hits the Kernel.

compiler crawler exploit offensive-security pentesting rat

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

trafilatura • Rank 26.3 • Science 77%

wpextract • Rank 6.1 • Science 85%

ghs • Rank 7.5 • Science 67%

findpapers • Rank 12.2 • Science 54%

nebula • Rank 8.4 • Science 54%

decryptlogin • Rank 14.7 • Science 44%

mc-crawler • Rank 3.3 • Science 54%

persian-news-crawler • Rank 2.3 • Science 54%

Rcrawler • Rank 16.7 • Science 23%

https://github.com/adbar/courlan • Rank 20.2 • Science 13%

https://github.com/commoncrawl/news-crawl • Rank 6.6 • Science 26%

metafinder • Rank 14.4 • Science 13%

https://github.com/buaadreamer/buaastar • Rank 1.4 • Science 23%

https://github.com/amirzenoozi/insta-downloader • Rank 2.6 • Science 13%

foot • Rank 0.0 • Science 13%

https://github.com/citiususc/polypus • Rank 0.0 • Science 13%

qasports-dataset-scripts • Science 44%

semantic-outlier-removal • Science 54%

https://github.com/0xk1h0/phishing_alive_measurement • Science 26%

eyes • Science 54%

unidisk • Science 44%

pacs-ris-crawler • Science 44%

https://github.com/birkhofflee/blizzard_forum.js • Science 13%

torbot • Science 44%

scrapegraph-ai • Science 54%

scrapy • Science 26%

https://github.com/amirzenoozi/aparat-videos-dataset • Science 13%

github-crawler • Science 44%

cs-insights-crawler • Science 31%

https://github.com/byt3n33dl3/thc-katanax • Science 26%

https://github.com/byt3n33dl3/crawler_v2 • Science 13%