Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: alfredtruong
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 332 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

introduction

what is Common Crawl?

Common Crawl project is "open repository of web crawl data that can be accessed and analyzed by anyone"
a search index is provided that lets you search at domain level
search results contains link and byte offset to a specific record in AWS S3 buckets for targeted downloaded
you can also query for records with AWS Athena queries without needing to download each index

what does repo do

helps identify records of interest from common crawl
can loop over said records to
1. download and 2. extract its contents

has caching mechanism to allow reruns (ensure all records downloaded)

how to identify records of interest

you can either gather these by 1. searching at domain name level via the CDX index api (less flexible) or 2. use AWS Athena to query / dump a csv to specify all records of interest (very flexible)

download usage

with CDX index api

```python from comcrawl.core import IndexClient

crawl of interest + output location

ic = IndexClient('2024-26', outdir = '/home/alfred/nfs/common_crawl')

identify records to scrape (populate ic.results list of records to download)

ic.initresultswithurlfilter("reddit.com/r/MachineLearning/*")

ic.initresultswithurlfilter('*.hk01.com') # read / save

downloads and extracts each record

ic.populateresults() firstrecord = ic.results[0]["content"] ```

with AWS Athena

```python from comcrawl.core import IndexClient

crawl of interest + output location

ic = IndexClient(outdir = '/home/alfred/nfs/common_crawl') # use athena csvs

identify records to scrape (populate ic.results list of records to download)

ic.initresultswithathenaquerycsvs(index=INDEX, minlength = MINLENGTH, maxlength = MAX_LENGTH)

you need to update IndexClient.ATHENAQUERYEXECUTION_IDS with the AWS Athena query csv hash

download and extract each record

ic.populateresults() firstrecord = ic.results[0]["content"] ```

AWS Athena query usage

look at comcrawl/utils/athena.py

Multithreading

can multithread the search and/or download by specifying number of threads (don't overdo this as could stress Common Crawl servers, Code of Conduct).

```python from comcrawl.core import IndexClient

client = IndexClient()

client.initresultswithurlfilter("reddit.com/r/MachineLearning/*", threads=4) client.populate_results(threads=4) ```

removing duplicates & saving

e.g. use pandas say to filter out duplicate results and persist to disk:

```python from comcrawl.core import IndexClient import pandas as pd

client = IndexClient() client.initresultswithurlfilter("reddit.com/r/MachineLearning/*")

client.results = (pd.DataFrame(client.results) .sortvalues(by="timestamp") .dropduplicates("urlkey", keep="last") .todict("records")) client.populateresults()

pd.DataFrame(client.results).to_csv("results.csv") ```

The urlkey alone might not be sufficient here, so you might want to write a function to compute a custom id from the results' properties for the removal of duplicates.

Logging HTTP requests

can enable logging to debug HTTP requests

```python from comcrawl.core import IndexClient

client = IndexClient(verbose=True) client.initresultswithurlfilter("reddit.com/r/MachineLearning/*") client.populate_results() ```

Code of Conduct

please beware of guidelines posted by Common Crawl maintainers

Owner

  • Login: alfredtruong
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Truong"
  given-names: "Alfred Kar Yin"
  orcid: "https://orcid.org/0009-0002-1723-9854"
title: "cc_cached_downloader"
date-released: 2024-08-28
url: "https://github.com/alfredtruong/cc-cached-downloader"

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1