github-crawler

The GitHub Crawler is a Python-based project that utilizes the GitHub API to fetch and crawl data related to commits and pull requests from various repositories. It's a tool designed for developers who want to analyze the activity in a GitHub repository. The crawler can fetch data about commits, pull requests, pull commits, pull files, pull reviews

https://github.com/ehsan200/github-crawler

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.4%) to scientific vocabulary

Keywords

crawler github-api github-crawler python python-crawler
Last synced: 6 months ago · JSON representation ·

Repository

The GitHub Crawler is a Python-based project that utilizes the GitHub API to fetch and crawl data related to commits and pull requests from various repositories. It's a tool designed for developers who want to analyze the activity in a GitHub repository. The crawler can fetch data about commits, pull requests, pull commits, pull files, pull reviews

Basic Info
  • Host: GitHub
  • Owner: Ehsan200
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 21.5 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
crawler github-api github-crawler python python-crawler
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Github-crawler

This is a Python-based project that uses the GitHub API to crawl and fetch data related to commits and pull requests from various repositories.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You need to have Python and pip installed on your machine. You can download Python from here and pip is included in Python 3.4 and later versions.

Installing

  1. Clone the repository to your local machine.
  2. Navigate to the project directory.
  3. Install the required packages using pip: bash pip install -r requirements.txt

Environment Variables

Before running the project, you need to set the following environment variable:

  • GH_TOKEN: Your personal GitHub token. This is required to authenticate with the GitHub API and increase the rate limit. You can generate a personal access token from here.

You can set the environment variable in your terminal like this:

bash export GH_TOKEN=your_token_here

Replace your_token_here with your actual GitHub token. This command needs to be run in the same terminal session before you start the application. If you close the terminal or start a new session, you will need to run the command again.

Please note that you should keep your tokens secret; do not commit them or share them online.

Usage

The project can be used to crawl and fetch data related to commits and pull requests from various repositories. The following commands are available:

  • To see help: bash python crawl.py --help
  • To crawl commits: bash python crawl.py commits --r_owner <repository_owner> --r_name <repository_name>
  • To crawl pull requests: bash python crawl.py pull-requests --r_owner <repository_owner> --r_name <repository_name>
  • To crawl pull commits: bash python crawl.py pull-commits --r_owner <repository_owner> --r_name <repository_name>
  • To crawl pull files: bash python crawl.py pull-files --r_owner <repository_owner> --r_name <repository_name>
  • To crawl pull reviews: bash python crawl.py pull-reviews --r_owner <repository_owner> --r_name <repository_name>
  • To crawl single commits: bash python crawl.py single-commits --r_owner <repository_owner> --r_name <repository_name>
  • To crawl pull reviews comments: bash python crawl.py pull-reviews-comments --r_owner <repository_owner> --r_name <repository_name>

Replace <repository_owner> and <repository_name> with the owner and name of the repository you want to crawl.

It's important to note that before crawling pull dependencies (pull-deps), you must first crawl the pull requests of the project. Similarly, before crawling single commits, ensure that you have crawled both the pull reviews and commits.

Data Storage and Logs

The crawled data will be stored in a directory named crawled-data. The crawler will create it if it doesn't exist.

Logs related to the crawling process will be stored in a directory named logs. This includes information such as the start and end time of the crawl, any errors encountered, and the number of items crawled. The crawler will create it if it doesn't exist.

Built With

Authors

Owner

  • Login: Ehsan200
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Movaffagh"
  given-names: "Ehsan"
title: "Github-crawler"
version: 1.0.0
doi: 10.5281/zenodo.10568926
date-released: 2024-01-25
url: "https://github.com/Ehsan200/Github-crawler"

GitHub Events

Total
Last Year