https://github.com/commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

https://github.com/commoncrawl/cc-downloader

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary

Keywords

commoncrawl downloader rust
Last synced: 5 months ago · JSON representation

Repository

A polite and user-friendly downloader for Common Crawl data

Basic Info
  • Host: GitHub
  • Owner: commoncrawl
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Homepage:
  • Size: 144 KB
Statistics
  • Stars: 51
  • Watchers: 8
  • Forks: 3
  • Open Issues: 3
  • Releases: 5
Topics
commoncrawl downloader rust
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Code of conduct Security

README.md

CC-Downloader

This is an experimental polite downloader for Common Crawl data written in rust. This tool is intended for use outside of AWS.

Todo

  • [ ] Add Python bindings
  • [ ] Add more tests
  • [ ] Handle unrecoverable errors

Installation

You can install cc-downloader via our pre-built binaries, or by compiling it from source.

Pre-built binaries

You can find our pre-built binaries on our GitHub releases page. They are available for Linux, macOS, and Windows, in x86_64 and aarch64 architectures (Windows is only supported in x86_64). In order to use them please select and download the correct binary for your system.

bash wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].[COMPRESSION-FORMAT]

After downloading it, please verify the checksum of the binary. You can find the checksum file in the same location as the binary. The checksum is generated using sha512sum. You can verify it by running the following command:

bash wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].sha512 sha512sum -c cc-downloader-[VERSION]-[ARCH]-[OS].sha512

If the checksum is valid, which will be indicated by and OK message, you can proceed to extract the binary. For tar.gz files you can use the following command:

bash tar -xzf cc-downloader-[VERSION]-[ARCH]-[OS].tar.gz

For zip files you can use the following command:

bash unzip cc-downloader-[VERSION]-[ARCH]-[OS].zip

This will extract the binary, the licenses and the readme file in the current folder. After extracting the binary, you can run it by executing the following command:

bash ./cc-downloader

If you want to use the binary from anywhere, you can move it to a folder in your PATH. For more information on how to do this, please refer to the documentation of your operating system. For example, on Linux and macOS you can move it to ~/.bin:

bash mv cc-downloader ~/.bin

And then add the following line to your ~/.bashrc or ~/.zshrc file:

bash export PATH=$PATH:~/.bin

then run the following command to apply the changes:

bash source ~/.bashrc

or

bash source ~/.zshrc

Then, you can run the binary from anywhere. If you want to update the binary, you can repeat the process and download the new version. Make sure to replace the binary that is stored in the folder that you added to your PATH. If you want to remove the binary, you can simply delete from this folder.

Compiling from source

For this you need to have rust installed. You can install rust by following the instructions on the official website.

Or by running the following command:

bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Even if you have rust a system-wide installation, we recommend the linked installation method. A system-wide installation and a user installation can co-exist without any problems.

When compiling from source, please make sure you have the latest version of rust installed by running the following command:

bash rustup update

Now you can install the cc-downloader tool by running the following command:

bash cargo install cc-downloader

Usage

```text ➜ cc-downloader -h A polite and user-friendly downloader for Common Crawl data.

Usage: cc-downloader [COMMAND]

Commands: download-paths Download paths for a given crawl download Download files from a crawl help Print this message or the help of the given subcommand(s)

Options: -h, --help Print help -V, --version Print version


➜ cc-downloader download-paths -h Download paths for a given crawl

Usage: cc-downloader download-paths

Arguments: Crawl reference, e.g. CC-MAIN-2021-04 or CC-NEWS-2025-01 Data type [possible values: segment, warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table] Destination folder

Options:

-h, --help Print help

➜ cc-downloader download -h Download files from a crawl

Usage: cc-downloader download [OPTIONS]

Arguments: Path file Destination folder

Options: -f, --files-only Download files without the folder structure. This only works for WARC/WET/WAT files -n, --numbered Enumerate output files for compatibility with Ungoliant Pipeline. This only works for WET files -t, --threads Number of threads to use [default: 10] -r, --retries Maximum number of retries per file [default: 1000] -p, --progress Print progress -h, --help Print help ```

Number of threads

The number of threads can be set using the -t flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will start receiving 403 errors which are unrecoverable and cannot be retried by the downloader.

Owner

  • Name: Common Crawl Foundation
  • Login: commoncrawl
  • Kind: organization
  • Email: info@commoncrawl.org

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total
  • Fork event: 2
  • Create event: 10
  • Commit comment event: 9
  • Issues event: 11
  • Release event: 5
  • Watch event: 42
  • Delete event: 1
  • Issue comment event: 14
  • Member event: 1
  • Push event: 33
  • Pull request review comment event: 1
  • Pull request review event: 5
  • Pull request event: 10
Last Year
  • Fork event: 2
  • Create event: 10
  • Commit comment event: 9
  • Issues event: 11
  • Release event: 5
  • Watch event: 42
  • Delete event: 1
  • Issue comment event: 14
  • Member event: 1
  • Push event: 33
  • Pull request review comment event: 1
  • Pull request review event: 5
  • Pull request event: 10

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 69
  • Total Committers: 2
  • Avg Commits per committer: 34.5
  • Development Distribution Score (DDS): 0.029
Past Year
  • Commits: 67
  • Committers: 2
  • Avg Commits per committer: 33.5
  • Development Distribution Score (DDS): 0.03
Top Committers
Name Email Commits
Pedro Ortiz Suarez p****o@c****g 67
Greg Lindahl g****g@c****g 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 6
  • Total pull requests: 16
  • Average time to close issues: 26 days
  • Average time to close pull requests: 3 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.31
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 14
  • Average time to close issues: 26 days
  • Average time to close pull requests: 3 days
  • Issue authors: 4
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.36
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • pjox (3)
  • ssachin520280 (1)
  • jt55401 (1)
  • thunderpoot (1)
Pull Request Authors
  • pjox (15)
  • BorisQuanLi (1)
Top Labels
Issue Labels
enhancement (6) good first issue (1) bug (1)
Pull Request Labels
enhancement (9) bug (4) documentation (2)

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 7,954 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 11
  • Total maintainers: 1
crates.io: cc-downloader

A polite and user-friendly downloader for Common Crawl data.

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 7,954 Total
Rankings
Dependent repos count: 27.0%
Dependent packages count: 35.9%
Average: 53.1%
Downloads: 96.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

Cargo.toml cargo