cc-cached-downloader
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: alfredtruong
- License: mit
- Language: Python
- Default Branch: master
- Size: 332 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
introduction
what is Common Crawl?
Common Crawl project is "open repository of web crawl data that can be accessed and analyzed by anyone"
a search index is provided that lets you search at domain level
search results contains link and byte offset to a specific record in AWS S3 buckets for targeted downloaded
you can also query for records with AWS Athena queries without needing to download each index
what does repo do
helps identify records of interest from common crawl
can loop over said records to
1. download and
2. extract its contents
has caching mechanism to allow reruns (ensure all records downloaded)
how to identify records of interest
you can either gather these by 1. searching at domain name level via the CDX index api (less flexible) or 2. use AWS Athena to query / dump a csv to specify all records of interest (very flexible)
download usage
with CDX index api
```python from comcrawl.core import IndexClient
crawl of interest + output location
ic = IndexClient('2024-26', outdir = '/home/alfred/nfs/common_crawl')
identify records to scrape (populate ic.results list of records to download)
ic.initresultswithurlfilter("reddit.com/r/MachineLearning/*")
ic.initresultswithurlfilter('*.hk01.com') # read / save
downloads and extracts each record
ic.populateresults() firstrecord = ic.results[0]["content"] ```
with AWS Athena
```python from comcrawl.core import IndexClient
crawl of interest + output location
ic = IndexClient(outdir = '/home/alfred/nfs/common_crawl') # use athena csvs
identify records to scrape (populate ic.results list of records to download)
ic.initresultswithathenaquerycsvs(index=INDEX, minlength = MINLENGTH, maxlength = MAX_LENGTH)
you need to update IndexClient.ATHENAQUERYEXECUTION_IDS with the AWS Athena query csv hash
download and extract each record
ic.populateresults() firstrecord = ic.results[0]["content"] ```
AWS Athena query usage
look at
comcrawl/utils/athena.py
Multithreading
can multithread the search and/or download by specifying number of threads (don't overdo this as could stress Common Crawl servers, Code of Conduct).
```python from comcrawl.core import IndexClient
client = IndexClient()
client.initresultswithurlfilter("reddit.com/r/MachineLearning/*", threads=4) client.populate_results(threads=4) ```
removing duplicates & saving
e.g. use pandas say to filter out duplicate results and persist to disk:
```python from comcrawl.core import IndexClient import pandas as pd
client = IndexClient() client.initresultswithurlfilter("reddit.com/r/MachineLearning/*")
client.results = (pd.DataFrame(client.results) .sortvalues(by="timestamp") .dropduplicates("urlkey", keep="last") .todict("records")) client.populateresults()
pd.DataFrame(client.results).to_csv("results.csv") ```
The urlkey alone might not be sufficient here, so you might want to write a function to compute a custom id from the results' properties for the removal of duplicates.
Logging HTTP requests
can enable logging to debug HTTP requests
```python from comcrawl.core import IndexClient
client = IndexClient(verbose=True) client.initresultswithurlfilter("reddit.com/r/MachineLearning/*") client.populate_results() ```
Code of Conduct
please beware of guidelines posted by Common Crawl maintainers
Owner
- Login: alfredtruong
- Kind: user
- Repositories: 9
- Profile: https://github.com/alfredtruong
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Truong" given-names: "Alfred Kar Yin" orcid: "https://orcid.org/0009-0002-1723-9854" title: "cc_cached_downloader" date-released: 2024-08-28 url: "https://github.com/alfredtruong/cc-cached-downloader"
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1