ghs

GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them

https://github.com/seart-group/ghs

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary

Keywords

bootstrap crawler csv-export dataset-generation docker-compose git github java-17 json-export mining-software-repositories msr mysql platform repository search-engine spring-boot spring-boot-application spring-boot-server sql-dump xml-export

Keywords from Contributors

mesh interpretability sequences generic projection interactive optim hacking network-simulation
Last synced: 4 months ago · JSON representation ·

Repository

GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them

Basic Info
  • Host: GitHub
  • Owner: seart-group
  • License: mit
  • Language: Java
  • Default Branch: master
  • Homepage: https://seart-ghs.si.usi.ch
  • Size: 41.4 MB
Statistics
  • Stars: 167
  • Watchers: 3
  • Forks: 21
  • Open Issues: 11
  • Releases: 45
Topics
bootstrap crawler csv-export dataset-generation docker-compose git github java-17 json-export mining-software-repositories msr mysql platform repository search-engine spring-boot spring-boot-application spring-boot-server sql-dump xml-export
Created almost 5 years ago · Last pushed 4 months ago
Metadata Files
Readme Contributing License Citation Codeowners

README.md

GitHub Search · Status MIT license Latest Dump DOI <!-- markdownlint-disable-line -->

This project is made of two components:

  1. A Spring Boot powered back-end, responsible for:
    1. Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;
    2. Acting as an API for providing access to the stored data.
  2. A Bootstrap-styled and jQuery-powered web user interface, serving as an accessible front for the API.

Running Locally

Prerequisites

| Dependency | Version Requirement | |----------------------------------------------|--------------------:| | Java | 17 | | Maven | 3.9 | | MySQL | 8.3 | | Flyway | 10.13 | | cloc[^1] | 2.00 | | Git[^1] | 2.43 |

[^1]: Only required in versions prior to 1.7.0

Database

Before choosing whether to start with a clean slate or pre-populated database, make sure the following requirements are met:

  1. The database timezone is set to +00:00. You can verify this via:

    sql SELECT @@global.time_zone, @@session.time_zone;

  2. The event scheduler is turned ON. You can verify this via:

    sql SELECT @@global.event_scheduler;

  3. The binary logging during the creation of stored functions is set to 1. You can verify this via:

    sql SELECT @@global.log_bin_trust_function_creators;

  4. The gse database exists. To create it:

    sql CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin;

  5. The gseadmin user exists. To create one, run:

    sql CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020'; GRANT ALL ON gse.* TO 'gseadmin'@'%';

If you prefer to begin with an empty database, there is nothing more for you to do. The required tables will be generated through Flyway migrations during the initial startup of the server. However, if you would like your local database to be pre-populated with the data we've collected, you can use the compressed SQL dump we offer. We host this dump, along with the four previous iterations, on Dropbox. After choosing and downloading a database dump, you can import the data by executing:

shell gzcat < gse.sql.gz | mysql -u gseadmin -pLugano2020 gse

Server

Before attempting to run the server, you should generate your own GitHub personal access token (PAT). The crawler relies on the GraphQL API, which is inaccessible without authentication. To access the information provided by the GitHub API, the token must include the repo scope.

Once that is done, you can run the server locally using Maven:

shell mvn spring-boot:run

If you want to make use of the token when crawling, specify it in the run arguments:

shell mvn spring-boot:run -Dspring-boot.run.arguments=--ghs.github.tokens=<your_access_token>

Alternatively, you can compile and run the JAR directly:

shell mvn clean package ln target/ghs-application-*.jar target/ghs-application.jar java -Dghs.github.tokens=<your_access_token> -jar target/ghs-application.jar

Here is a list of project-specific arguments supported by the application that you can find in the application.properties:

| Variable Name | Type | Default Value | Description | |--------------------------------------|--------------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ghs.github.tokens | List<String> | | List of GitHub personal access tokens (PATs) that will be used for mining the GitHub API. Must not contain blank strings. | | ghs.github.api-version | String | 2022-11-28 | GitHub API version used across various operations. | | ghs.git.username | String | | Git account login used to interact with the version control system. | | ghs.git.password | String | | Password used to authenticate the specified Git account. | | ghs.git.config | Map<String,String> | See application.properties | Git configurations specific to the application[^2]. | | ghs.git.folder-prefix | String | ghs-clone- | Prefix used for the temporary directories into which analyzed repositories are cloned. Must not be blank. | | ghs.git.ls-remote-timeout-duration | Duration | 1m | Maximum time allowed for listing remotes of Git repositories. | | ghs.git.clone-timeout-duration | Duration | 5m | Maximum time allowed for cloning Git repositories. | | ghs.cloc.max-file-size | DataSize | 25MB | Maximum file size threshold for analysis with cloc. | | ghs.cloc.timeout-duration | Duration | 5m | Maximum time allowed for a cloc command to execute. | | ghs.crawler.enabled | Boolean | true | Specifies if the repository crawling job is enabled. | | ghs.crawler.minimum-stars | int | 10 | Inclusive lower bound for the number of stars a project needs to have in order to be picked up by the crawler. Must not be negative. | | ghs.crawler.languages | List<String> | See application.properties | List of language names that will be targeted during crawling. Must not contain blank strings. To ensure proper operations, the names must match those specified in linguist. | | ghs.crawler.start-date | Date | 2008-01-01T00:00:00Z | Default crawler start date: the earliest date for repository crawling in the absence of prior crawl jobs. Value format: yyyy-MM-ddTHH:MM:SSZ. | | ghs.crawler.delay-between-runs | Duration | PT6H | Delay between successive crawler runs, expressed as a duration string. | | ghs.analysis.enabled | Boolean | true | Specifies if the analysis job is enabled. | | ghs.analysis.delay-between-runs | Duration | PT6H | Delay between successive analysis runs, expressed as a duration string. | | ghs.analysis.max-pool-threads | int | 3 | Maximum amount of live threads dedicated to concurrently analyzing repositories. Must be positive. | | ghs.clean-up.enabled | Boolean | true | Specifies if the job responsible for removing unavailable repositories (clean-up) is enabled. | | ghs.clean-up.cron | CronTrigger | 0 0 0 * * 1 | Delay between successive repository clean-up runs, expressed as a Spring CRON expression. |

[^2]: We separate the application-level Git configurations from the ones used by the user to avoid any potential conflicts or confusion. As such, an application-specific configuration file is created in the temporary directory on startup. Settings added to the file depend on the ghs.git.config entries in the application.properties. Note that configuration subsections are currently not supported.

Web UI

The easiest way to launch the front-end is through the provided NPM script:

shell npm run dev

You can also use the built-in web server of your IDE, or any other web server of your choice. Regardless of which method you choose for hosting, the back-end CORS restricts you to using ports 3030 and 7030.

Dockerisation :whale:

The deployment stack consists of the following containers:

| Service/Container name | Image | Description | Enabled by Default | |------------------------|:-----------------------------------------------------------------------:|------------------------------------------|:-----------------------------:| | gse-database | mysql | Platform database | :whitecheckmark: | | gse-migration | flyway | Database schema migration executions | :whitecheckmark: | | gse-backup | tiredofit/db-backup | Automated database backups | :negativesquaredcrossmark: | | gse-server | seart/ghs-server | Spring Boot server application | :whitecheckmark: | | gse-website | seart/ghs-website | NGINX web server acting as HTML supplier | :whitecheckmark: | | gse-watchtower | containrrr/watchtower | Automatic Docker image updates | :negativesquaredcrossmark: |

The service dependency chain can be represented as follows:

mermaid graph RL gse-migration --> |service_healthy| gse-database gse-backup --> |service_completed_successfully| gse-migration gse-server --> |service_completed_successfully| gse-migration gse-website --> |service_healthy| gse-server gse-watchtower --> |service_healthy| gse-website

Deploying is as simple as, in the docker-compose directory, run:

shell docker-compose -f docker-compose.yml up -d

It is important to note that the database setup steps explained in the preceding section aren't necessary when running with Docker. This is because the environment properties passed to the service will automatically create the MySQL user and database during the initial startup. However, this convenience doesn't extend to the database data, as the default deployment generates an empty database. If you wish to use existing data from the dumps, you will need to override the docker-compose deployment to employ a custom database image that includes the dump. To achieve this, create your docker-compose.override.yml file with the following contents:

```yaml version: "3.9" name: "gse"

services: gse-database: image: seart/ghs-database:latest ```

The above image will include the freshest database dump, at most 15 days behind the actual platform data. For a more specific database version, refer to the Docker Hub page. Remember to specify the override file during deployment:

shell docker-compose -f docker-compose.yml -f docker-compose.override.yml up -d

The database data itself is kept in the gse-data volume, while detailed back-end logs are kept in a local mount called logs. You can also use this override file to change the configurations of other services. For example, specifying your own PAT for the crawler:

```yaml version: "3.9" name: "gse"

services: # other services omitted...

gse-server:
    environment:
        GHS_GITHUB_TOKENS: "A single or comma-separated list of token(s)"
        GHS_CRAWLER_ENABLED: "true"

```

Any of the Spring Boot properties or aforementioned application-specific properties can be overridden. Keep in mind that a property such as ghs.x.y corresponds to the GHS_X_Y service environment setting.

Another example is the automated database backup service, which is disabled by default. If you would like to re-enable it, you would have to add the following to the override file:

```yaml version: "3.9" name: "gse"

services: # other services omitted...

gse-backup:
    restart: always
    entrypoint: "/init"

```

FAQ

How can I request a feature or ask a question?

If you have ideas for a feature you would like to see implemented or if you have any questions, we encourage you to create a new discussion. By initiating a discussion, you can engage with the community and our team, and we will respond promptly to address your queries or consider your feature requests.

How can I report a bug?

To report any issues or bugs you encounter, create a new issue. Providing detailed information about the problem you're facing will help us understand and address it more effectively. Rest assured, we're committed to promptly reviewing and responding to the issues you raise, working collaboratively to resolve any bugs and improve the overall user experience.

How do I contribute to the project?

Refer to CONTRIBUTING.md for more information.

How do I extend/modify the existing database schema?

To do that, you should be familiar with database migration tools and practices. This project uses Flyway by Redgate. The general rule for schema manipulation is: create new migrations, and refrain from editing existing ones.

Owner

  • Name: SEART - SoftwarE Analytics Research Team
  • Login: seart-group
  • Kind: organization
  • Location: Lugano, Switzerland

The SEART group is part of the Software Institute at the Università della Svizzera italiana, located in Lugano, Switzerland.

Citation (CITATION.bib)

@inproceedings{Dabic:msr2021data,
  author    = {Ozren Dabic and Emad Aghajani and Gabriele Bavota},
  title     = {Sampling Projects in GitHub for {MSR} Studies},
  booktitle = {18th {IEEE/ACM} International Conference on Mining Software Repositories, {MSR} 2021},
  pages     = {560--564},
  publisher = {{IEEE}},
  year      = {2021}
}

GitHub Events

Total
  • Issues event: 19
  • Watch event: 31
  • Delete event: 170
  • Issue comment event: 38
  • Push event: 184
  • Pull request review event: 166
  • Pull request event: 340
  • Fork event: 5
  • Create event: 180
Last Year
  • Issues event: 19
  • Watch event: 31
  • Delete event: 170
  • Issue comment event: 38
  • Push event: 184
  • Pull request review event: 166
  • Pull request event: 340
  • Fork event: 5
  • Create event: 180

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 2,713
  • Total Committers: 10
  • Avg Commits per committer: 271.3
  • Development Distribution Score (DDS): 0.211
Past Year
  • Commits: 349
  • Committers: 3
  • Avg Commits per committer: 116.333
  • Development Distribution Score (DDS): 0.372
Top Committers
Name Email Commits
dabico d****o@u****h 2,140
dependabot[bot] 4****] 349
Albert Cerfeda c****t@g****m 93
Emad Aghajani e****s@g****m 92
seart-bot s****i@g****m 30
gbavota g****a@g****m 3
Csaba Nagy c****y@u****h 3
GitLab r****t@l****t 1
Albert 3****a 1
Emad Aghajani e****s@E****t 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 59
  • Total pull requests: 470
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 12
  • Total pull request authors: 5
  • Average comments per issue: 1.27
  • Average comments per pull request: 0.2
  • Merged pull requests: 402
  • Bot issues: 19
  • Bot pull requests: 395
Past Year
  • Issues: 11
  • Pull requests: 305
  • Average time to close issues: 27 days
  • Average time to close pull requests: 2 days
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 0.18
  • Average comments per pull request: 0.17
  • Merged pull requests: 261
  • Bot issues: 10
  • Bot pull requests: 303
Top Authors
Issue Authors
  • github-actions[bot] (19)
  • dabico (17)
  • wolfenmark (9)
  • emadpres (4)
  • dependabot[bot] (4)
  • marodev (2)
  • andrehora (1)
  • kasrahabib (1)
  • cerfedino (1)
  • kargaranamir (1)
  • jhs507 (1)
  • hsrain3 (1)
Pull Request Authors
  • dependabot[bot] (532)
  • dabico (73)
  • cerfedino (12)
  • emadpres (1)
  • lanpirot (1)
Top Labels
Issue Labels
dumps (19) bug (12) enhancement (9) dependencies (7) feature (4) good first issue (2) stale (2) refactoring (1) documentation (1) wontfix (1) question (1) docker (1) server (1)
Pull Request Labels
dependencies (533) enhancement (3) bug (1) documentation (1) java (1)

Dependencies

pom.xml maven
  • com.fasterxml.jackson.core:jackson-annotations 2.13.2
  • com.fasterxml.jackson.core:jackson-core 2.13.2
  • com.fasterxml.jackson.core:jackson-databind 2.13.2
  • com.fasterxml.jackson.dataformat:jackson-dataformat-csv 2.13.2
  • com.fasterxml.jackson.dataformat:jackson-dataformat-xml 2.13.2
  • com.google.code.gson:gson 2.8.9
  • com.google.guava:guava 31.0.1-jre
  • com.squareup.okhttp3:okhttp 4.9.3
  • javax.annotation:javax.annotation-api 1.3.2
  • mysql:mysql-connector-java 8.0.25
  • org.apache.commons:commons-io 1.3.2
  • org.apache.commons:commons-lang3 3.12.0
  • org.flywaydb:flyway-core
  • org.hibernate.javax.persistence:hibernate-jpa-2.1-api 1.0.2.Final
  • org.hibernate:hibernate-core 5.6.5.Final
  • org.hibernate:hibernate-jpamodelgen 5.6.5.Final
  • org.hibernate:hibernate-validator 6.1.2.Final
  • org.jetbrains.kotlin:kotlin-stdlib 1.6.10
  • org.projectlombok:lombok 1.18.22
  • org.springframework.boot:spring-boot-configuration-processor 2.6.4
  • org.springframework.boot:spring-boot-starter-data-jpa 2.6.4
  • org.springframework.boot:spring-boot-starter-hateoas 2.6.4
  • org.springframework.boot:spring-boot-starter-web 2.6.4
  • org.springframework.boot:spring-boot-starter-test 2.6.4 test