Updated 5 months ago
https://github.com/commoncrawl/web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
Updated 6 months ago
https://github.com/adbar/courlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Updated 6 months ago
https://github.com/capjamesg/getsitemap
A Python library that retrieves all URLs in the sitemaps on a website.
Updated 6 months ago
https://github.com/amirzenoozi/insta-downloader
You Can Download Instagram Post With This Script
Updated 6 months ago
https://github.com/amirzenoozi/poster-finder
Download All Poster of Movie with URL
Updated 6 months ago
parallel-urls-classifier
Parallel URLs Classifier (PUC) infers the parallelness of a pair of documents from their URLs