cc-cached-downloader

https://github.com/alfredtruong/cc-cached-downloader

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: alfredtruong
License: mit
Language: Python
Default Branch: master
Size: 332 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

introduction

what is Common Crawl?

Common Crawl project is "open repository of web crawl data that can be accessed and analyzed by anyone"
a search index is provided that lets you search at domain level
search results contains link and byte offset to a specific record in AWS S3 buckets for targeted downloaded
you can also query for records with AWS Athena queries without needing to download each index

what does repo do

helps identify records of interest from common crawl
can loop over said records to
1. download and 2. extract its contents

has caching mechanism to allow reruns (ensure all records downloaded)

how to identify records of interest

you can either gather these by 1. searching at domain name level via the CDX index api (less flexible) or 2. use AWS Athena to query / dump a csv to specify all records of interest (very flexible)

download usage

with CDX index api

```python from comcrawl.core import IndexClient

crawl of interest + output location

ic = IndexClient('2024-26', outdir = '/home/alfred/nfs/common_crawl')

identify records to scrape (populate `ic.results` list of records to download)

ic.initresultswithurlfilter("reddit.com/r/MachineLearning/*")

ic.initresultswithurlfilter('*.hk01.com') # read / save

downloads and extracts each record

ic.populateresults() firstrecord = ic.results[0]["content"] ```

with AWS Athena

```python from comcrawl.core import IndexClient

crawl of interest + output location

ic = IndexClient(outdir = '/home/alfred/nfs/common_crawl') # use athena csvs

identify records to scrape (populate `ic.results` list of records to download)

ic.initresultswithathenaquerycsvs(index=INDEX, minlength = MINLENGTH, maxlength = MAX_LENGTH)

you need to update IndexClient.ATHENAQUERYEXECUTION_IDS with the AWS Athena query csv hash

download and extract each record

ic.populateresults() firstrecord = ic.results[0]["content"] ```

AWS Athena query usage

look at comcrawl/utils/athena.py

Multithreading

can multithread the search and/or download by specifying number of threads (don't overdo this as could stress Common Crawl servers, Code of Conduct).

```python from comcrawl.core import IndexClient

client = IndexClient()

client.initresultswithurlfilter("reddit.com/r/MachineLearning/*", threads=4) client.populate_results(threads=4) ```

removing duplicates & saving

e.g. use pandas say to filter out duplicate results and persist to disk:

```python from comcrawl.core import IndexClient import pandas as pd

client = IndexClient() client.initresultswithurlfilter("reddit.com/r/MachineLearning/*")

client.results = (pd.DataFrame(client.results) .sortvalues(by="timestamp") .dropduplicates("urlkey", keep="last") .todict("records")) client.populateresults()

pd.DataFrame(client.results).to_csv("results.csv") ```

The urlkey alone might not be sufficient here, so you might want to write a function to compute a custom id from the results' properties for the removal of duplicates.

Logging HTTP requests

can enable logging to debug HTTP requests

```python from comcrawl.core import IndexClient

client = IndexClient(verbose=True) client.initresultswithurlfilter("reddit.com/r/MachineLearning/*") client.populate_results() ```

Code of Conduct

please beware of guidelines posted by Common Crawl maintainers

Owner

Login: alfredtruong
Kind: user

Repositories: 9
Profile: https://github.com/alfredtruong

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Truong"
  given-names: "Alfred Kar Yin"
  orcid: "https://orcid.org/0009-0002-1723-9854"
title: "cc_cached_downloader"
date-released: 2024-08-28
url: "https://github.com/alfredtruong/cc-cached-downloader"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

cc-cached-downloader

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

introduction

what is Common Crawl?

what does repo do

how to identify records of interest

download usage

with CDX index api

crawl of interest + output location

identify records to scrape (populate `ic.results` list of records to download)

ic.initresultswithurlfilter('*.hk01.com') # read / save

downloads and extracts each record

with AWS Athena

crawl of interest + output location

identify records to scrape (populate `ic.results` list of records to download)

you need to update IndexClient.ATHENAQUERYEXECUTION_IDS with the AWS Athena query csv hash

download and extract each record

AWS Athena query usage

Multithreading

removing duplicates & saving

Logging HTTP requests

Code of Conduct

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

cc-cached-downloader

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

introduction

what is Common Crawl?

what does repo do

how to identify records of interest

download usage

with CDX index api

crawl of interest + output location

identify records to scrape (populate ic.results list of records to download)

ic.initresultswithurlfilter('*.hk01.com') # read / save

downloads and extracts each record

with AWS Athena

crawl of interest + output location

identify records to scrape (populate ic.results list of records to download)

you need to update IndexClient.ATHENAQUERYEXECUTION_IDS with the AWS Athena query csv hash

download and extract each record

AWS Athena query usage

Multithreading

removing duplicates & saving

Logging HTTP requests

Code of Conduct

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

identify records to scrape (populate `ic.results` list of records to download)

identify records to scrape (populate `ic.results` list of records to download)