https://github.com/commoncrawl/cc-downloader
A polite and user-friendly downloader for Common Crawl data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords
Repository
A polite and user-friendly downloader for Common Crawl data
Basic Info
Statistics
- Stars: 51
- Watchers: 8
- Forks: 3
- Open Issues: 3
- Releases: 5
Topics
Metadata Files
README.md
CC-Downloader
This is an experimental polite downloader for Common Crawl data written in rust. This tool is intended for use outside of AWS.
Todo
- [ ] Add Python bindings
- [ ] Add more tests
- [ ] Handle unrecoverable errors
Installation
You can install cc-downloader via our pre-built binaries, or by compiling it from source.
Pre-built binaries
You can find our pre-built binaries on our GitHub releases page. They are available for Linux, macOS, and Windows, in x86_64 and aarch64 architectures (Windows is only supported in x86_64). In order to use them please select and download the correct binary for your system.
bash
wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].[COMPRESSION-FORMAT]
After downloading it, please verify the checksum of the binary. You can find the checksum file in the same location as the binary. The checksum is generated using sha512sum. You can verify it by running the following command:
bash
wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].sha512
sha512sum -c cc-downloader-[VERSION]-[ARCH]-[OS].sha512
If the checksum is valid, which will be indicated by and OK message, you can proceed to extract the binary. For tar.gz files you can use the following command:
bash
tar -xzf cc-downloader-[VERSION]-[ARCH]-[OS].tar.gz
For zip files you can use the following command:
bash
unzip cc-downloader-[VERSION]-[ARCH]-[OS].zip
This will extract the binary, the licenses and the readme file in the current folder. After extracting the binary, you can run it by executing the following command:
bash
./cc-downloader
If you want to use the binary from anywhere, you can move it to a folder in your PATH. For more information on how to do this, please refer to the documentation of your operating system. For example, on Linux and macOS you can move it to ~/.bin:
bash
mv cc-downloader ~/.bin
And then add the following line to your ~/.bashrc or ~/.zshrc file:
bash
export PATH=$PATH:~/.bin
then run the following command to apply the changes:
bash
source ~/.bashrc
or
bash
source ~/.zshrc
Then, you can run the binary from anywhere. If you want to update the binary, you can repeat the process and download the new version. Make sure to replace the binary that is stored in the folder that you added to your PATH. If you want to remove the binary, you can simply delete from this folder.
Compiling from source
For this you need to have rust installed. You can install rust by following the instructions on the official website.
Or by running the following command:
bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Even if you have rust a system-wide installation, we recommend the linked installation method. A system-wide installation and a user installation can co-exist without any problems.
When compiling from source, please make sure you have the latest version of rust installed by running the following command:
bash
rustup update
Now you can install the cc-downloader tool by running the following command:
bash
cargo install cc-downloader
Usage
```text ➜ cc-downloader -h A polite and user-friendly downloader for Common Crawl data.
Usage: cc-downloader [COMMAND]
Commands: download-paths Download paths for a given crawl download Download files from a crawl help Print this message or the help of the given subcommand(s)
Options: -h, --help Print help -V, --version Print version
➜ cc-downloader download-paths -h Download paths for a given crawl
Usage: cc-downloader download-paths
Arguments:
Options:
-h, --help Print help
➜ cc-downloader download -h Download files from a crawl
Usage: cc-downloader download [OPTIONS]
Arguments:
Options:
-f, --files-only Download files without the folder structure. This only works for WARC/WET/WAT files
-n, --numbered Enumerate output files for compatibility with Ungoliant Pipeline. This only works for WET files
-t, --threads
Number of threads
The number of threads can be set using the -t flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will start receiving 403 errors which are unrecoverable and cannot be retried by the downloader.
Owner
- Name: Common Crawl Foundation
- Login: commoncrawl
- Kind: organization
- Email: info@commoncrawl.org
- Website: https://commoncrawl.org
- Twitter: commoncrawl
- Repositories: 50
- Profile: https://github.com/commoncrawl
Common Crawl provides an archive of webpages going back to 2007.
GitHub Events
Total
- Fork event: 2
- Create event: 10
- Commit comment event: 9
- Issues event: 11
- Release event: 5
- Watch event: 42
- Delete event: 1
- Issue comment event: 14
- Member event: 1
- Push event: 33
- Pull request review comment event: 1
- Pull request review event: 5
- Pull request event: 10
Last Year
- Fork event: 2
- Create event: 10
- Commit comment event: 9
- Issues event: 11
- Release event: 5
- Watch event: 42
- Delete event: 1
- Issue comment event: 14
- Member event: 1
- Push event: 33
- Pull request review comment event: 1
- Pull request review event: 5
- Pull request event: 10
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Pedro Ortiz Suarez | p****o@c****g | 67 |
| Greg Lindahl | g****g@c****g | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 6
- Total pull requests: 16
- Average time to close issues: 26 days
- Average time to close pull requests: 3 days
- Total issue authors: 4
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.31
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 14
- Average time to close issues: 26 days
- Average time to close pull requests: 3 days
- Issue authors: 4
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.36
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- pjox (3)
- ssachin520280 (1)
- jt55401 (1)
- thunderpoot (1)
Pull Request Authors
- pjox (15)
- BorisQuanLi (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cargo 7,954 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 11
- Total maintainers: 1
crates.io: cc-downloader
A polite and user-friendly downloader for Common Crawl data.
- Homepage: https://commoncrawl.org
- Documentation: https://docs.rs/cc-downloader/
- License: MIT OR Apache-2.0
-
Latest release: 0.6.1
published 10 months ago