https://github.com/commoncrawl/cc-webgraph

Tools to construct and process Common Crawl webgraphs

https://github.com/commoncrawl/cc-webgraph

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

centrality-measures common-crawl commoncrawl pagerank webgraph webgraph-framework

Keywords from Contributors

archival projection profiles interactive sequences generic observability autograding hacking shellcodes
Last synced: 5 months ago · JSON representation

Repository

Tools to construct and process Common Crawl webgraphs

Basic Info
Statistics
  • Stars: 92
  • Watchers: 12
  • Forks: 5
  • Open Issues: 2
  • Releases: 0
Topics
centrality-measures common-crawl commoncrawl pagerank webgraph webgraph-framework
Created over 8 years ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

cc-webgraph

Tools to construct web graphs from Common Crawl data, process and explore them.

Compiling and Packaging Java Tools

Java 11 or upwards are required.

The Java tools are compiled and packaged by Maven. If Maven is installed just run mvn package. Now the Java tools can be run via java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <args>...

The assembly jar file includes also the WebGraph and LAW packages required to process the webgraphs and compute PageRank or Harmonic Centrality.

Javadocs

The Javadocs are created by mvn javadoc:javadoc. Then open the file target/site/apidocs/index.html in a browser.

Memory and Disk Requirements

Note that the webgraphs are usually multiple Gigabytes in size and require for processing - a sufficient Java heap size (Java option -Xmx) - enough disk space to store the graphs and temporary data.

The exact requirements depend on the graph size and the task – graph exploration or ranking, etc.

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

The host-level web graph is built with help of PySpark, the corresponding code is found in the project cc-pyspark. Instructions are found in the script build_hostgraph.sh.

Domain-Level Web Graph

The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array or FastUtil's big array if the host-level graph has more vertices than a Java array can hold (around 2³¹). The Java tool to fold the host graph is best run from the script host2domaingraph.sh.

Processing Graphs using the WebGraph Framework

To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely the WebGraph framework and the LAW library.

A couple of scripts may help you to run the WebGraph tools to build and process the graphs are provided in src/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases.

To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup: vi ./src/script/webgraph_ranking/webgraph_config.sh After running ./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir the output_dir/ should contain all generated files, eg. graph_name.graph and graph_name-ranks.txt.gz.

The shell script is easily adapted to your needs. Please refer to the LAW dataset tutorial, the API docs of LAW and webgraph for further information.

Exploring Webgraph Data Sets

The Common Crawl webgraph data sets are announced on the Common Crawl web site.

For instructions how to explore the webgraphs using the JShell please see the tutorial Interactive Graph Exploration. For an older approach using Jython and pyWebGraph, see the cc-notebooks project.

Credits

Thanks to the authors of the WebGraph framework used to process the graphs and compute page rank and harmonic centrality. See also Sebastiano Vigna's projects webgraph and webgraph-big.

Owner

  • Name: Common Crawl Foundation
  • Login: commoncrawl
  • Kind: organization
  • Email: info@commoncrawl.org

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total
  • Issues event: 2
  • Watch event: 17
  • Delete event: 3
  • Issue comment event: 7
  • Push event: 26
  • Fork event: 1
Last Year
  • Issues event: 2
  • Watch event: 17
  • Delete event: 3
  • Issue comment event: 7
  • Push event: 26
  • Fork event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 122
  • Total Committers: 5
  • Avg Commits per committer: 24.4
  • Development Distribution Score (DDS): 0.139
Past Year
  • Commits: 29
  • Committers: 3
  • Avg Commits per committer: 9.667
  • Development Distribution Score (DDS): 0.31
Top Committers
Name Email Commits
Sebastian Nagel s****n@c****g 105
Thom Vaughan t****m@c****g 13
Julien Nioche j****n@d****m 2
dependabot[bot] 4****] 1
Ubuntu u****u@i****l 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 14
  • Total pull requests: 7
  • Average time to close issues: 6 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 9
  • Total pull request authors: 2
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.29
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 4
  • Pull requests: 1
  • Average time to close issues: 21 days
  • Average time to close pull requests: 2 months
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 1.25
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sebastian-nagel (5)
  • akul-goyal (2)
  • scd31 (1)
  • julesearl (1)
  • wumpus (1)
  • PeterCarragher (1)
  • liyucheng09 (1)
  • AbraGanz (1)
  • covuworie (1)
Pull Request Authors
  • sebastian-nagel (8)
  • dependabot[bot] (3)
Top Labels
Issue Labels
enhancement (2) bug (2) help wanted (1) question (1)
Pull Request Labels
dependencies (3)

Dependencies

pom.xml maven
  • org.junit:junit-bom 5.9.0 import
  • com.github.crawler-commons:crawler-commons 1.3
  • commons-cli:commons-cli 1.5.0
  • it.unimi.dsi:fastutil-core 8.5.7
  • it.unimi.dsi:law 2.7.2
  • it.unimi.dsi:webgraph 3.6.10
  • it.unimi.dsi:webgraph-big 3.7.0
  • org.apache.commons:commons-configuration2 2.8.0
  • org.slf4j:slf4j-api 2.0.0
  • org.junit.jupiter:junit-jupiter test
  • org.slf4j:slf4j-simple 2.0.0 test
.github/workflows/build.yml actions
  • actions/checkout v4 composite
  • actions/setup-java v4 composite