Updated 5 months ago

https://github.com/commoncrawl/web-languages • Rank 7.7 • Science 26%

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

Updated 6 months ago

https://github.com/adbar/courlan • Rank 20.2 • Science 13%

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Updated 6 months ago

https://github.com/capjamesg/getsitemap • Rank 5.8 • Science 13%

A Python library that retrieves all URLs in the sitemaps on a website.

Updated 6 months ago

parallel-urls-classifier • Science 57%

Parallel URLs Classifier (PUC) infers the parallelness of a pair of documents from their URLs

Updated 6 months ago

scrapy • Science 26%

Scrapy, a fast high-level web crawling & scraping framework for Python.