yegor256/cam
Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, ieee.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Repository
Classes and Metriсs (CaM): a dataset of Java classes from public open-source GitHub repositories
Basic Info
- Host: GitHub
- Owner: yegor256
- License: mit
- Language: Shell
- Default Branch: master
- Homepage: http://cam.yegor256.com
- Size: 2.83 MB
Statistics
- Stars: 26
- Watchers: 3
- Forks: 49
- Open Issues: 39
- Releases: 35
Topics
Metadata Files
README.md
Classes and Metrics (CaM)
This is a dataset of open source Java classes and some metrics on them. Every now and then I make a new version of it using the scripts in this repository. You are welcome to use it in your research. Each release has a fixed version. By referring to it in your research, you avoid ambiguity and guarantee repeatability of your experiments.
This is a more formal explanation of this project: in PDF.
The latest ZIP archive with the dataset is here: cam-2024-03-02.zip (2.22Gb). There are 48 metrics calculated for 532,394 Java classes from 1000 GitHub repositories, including: lines of code (reported by cloc); NCSS; cyclomatic and cognitive complexity (by PMD); Halstead volume, effort, and difficulty; maintainability index; number of attributes, constructors, methods; number of Git authors; and others (see PDF).
Previous archives (took me a few days to build each of them, using a pretty big machine):
- cam-2024-03-02.zip (2.22Gb): 1000 repos, 48 metrics, 532K classes
- cam-2023-10-22.zip (2.19Gb): 1000 repos, 33 metrics, 863K classes
- cam-2023-10-11.zip (3Gb): 959 repos, 29 metrics, 840K classes
- cam-2021-08-04.zip (692Mb): 1000 repos, 15 metrics
- cam-2021-07-08.zip (387Mb): 1000 repos, 11 metrics
If you want to create a new dataset,
just run the following command and the entire dataset will
be built in the current directory
(you need to have Docker installed),
where 1000 is the number of repositories to fetch from GitHub
and XXX is
your personal access token:
bash
docker run --detach --name=cam --rm --volume "$(pwd):/dataset" \
-e "TOKEN=XXX" -e "TOTAL=1000" -e "TARGET=/dataset" \
--oom-kill-disable --memory=16g --memory-swap=16g \
yegor256/cam:0.9.3 "make -e >/dataset/make.log 2>&1"
This command will create a new Docker container, running in the background.
(run docker ps -a, in order to see it).
If you want to run docker interactively and see all the logs,
you can just disable detached mode
by removing the --detach option from the command.
The dataset will be created in the current directory (may take some time,
maybe a few days!), and a .zip archive will also be there.
Docker container will run in the background: you can safely close
the console and come back when the
dataset is ready and the container is deleted.
Make sure your server has enough swap memory (at least 32Gb) and free disk space (at least 512Gb) — without this, the dataset will have many errors. It's better to have multiple CPUs, since the entire build process is highly parallel: all CPUs will be utilized.
If the script fails at some point, you can restart it again, without deleting previously created files. The process is incremental — it will understand where it stopped before. In order to restart an entire "step," delete the following directory:
github/to rerunclonetemp/jpeek-logs/to rerunjpeekmeasurements/to rerunmeasure
You can also run it without Docker:
bash
make clean
make TOTAL=100
Should work, if you have all the dependencies installed, as suggested in the Dockerfile.
In order to analyze just a single repository, do this
(yegor256/tojos as an example):
bash
make clean
make REPO=yegor256/tojos
How to Contribute (e.g. by adding a new metric)
For example, you want to add a new metric to the script:
- Fork a repository.
Ensure dependencies are installed and the upstream master branch works correctly:
bash sudo make install make env test lintIf you discover any errors in the master branch during this step, please create an issue to report it before proceeding with your changes.
Create a new file in the
metrics/directory, using one of the existing files as an example.Create a test for your metric, in the
tests/metrics/directory.Run the entire test suite (this should take a few minutes to complete, without errors):
bash sudo make test lintYou can also test it with Docker:
bash docker build . -t cam docker run --rm cam make testThere is even a faster way to run all tests, with the help of Docker, if you don't change any installation scripts:
bash docker run -v $(pwd):/c --rm yegor256/cam:0.9.3 make -C /c testSend us a pull request. We will review your changes and apply them to the
masterbranch shortly, provided they don't violate our quality standards.
How to Calculate Additional Metrics
You may want to use this dataset as a basis, with an intent of adding your own metrics on top of it. It should be easy:
- Clone this repo into
cam/directory - Download ZIP archive
- Unpack it to the
cam/dataset/directory - Add a new script to the
cam/metrics/directory (useast.pyas an example) - Delete all other files except yours from the
cam/metrics/directory - Run
makein thecam/directory:sudo make install; make all
The make should understand that a new metric was added.
It will apply this new metric
to all .java files, generate new .csv reports, aggregate them with existing
reports (in the cam/dataset/data/ directory),
and then the final .pdf report will also be updated.
How to Build a New Archive
When it's time to build a new archive, create a new m7i.2xlarge
server (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS.
Then, install Docker into it:
bash
sudo apt update -y
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) \
signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
188 | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update -y
sudo apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo usermod -aG docker ${USER}
Then, add swap memory of 16Gb:
bash
sudo dd if=/dev/zero of=/swapfile bs=1048576 count=16384
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Then, create a personal access token in GitHub, and run Docker as explained above.
Owner
- Name: Yegor Bugayenko
- Login: yegor256
- Kind: user
- Location: Russia
- Company: @zerocracy
- Website: https://www.yegor256.com
- Twitter: yegor256
- Repositories: 176
- Profile: https://github.com/yegor256
Author of "Elegant Objects" book series (buy them on Amazon); architect of @objectionary; founder of @zerocracy; creator of @zold-io
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." license: MIT repository-code: https://github.com/yegor256/cam abstract: | CAM is a dataset of open source Java classes and some metrics on them. Every now and then we make a new version of it using the scripts in this repository. You are welcome to use it in your researches. Each release has a fixed version. By referring to it in your research you avoid ambiguity and guarantees repeatability of your experiments. authors: - family-names: "Bugayenko" given-names: "Yegor" orcid: "https://orcid.org/0000-0001-6370-0678" title: "CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics" version: 0.9.3 doi: 10.48550/arXiv.2403.08488 date-released: 2024-09-23 url: "https://arxiv.org/abs/2403.08488"
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 164
- Total Committers: 7
- Avg Commits per committer: 23.429
- Development Distribution Score (DDS): 0.213
Top Committers
| Name | Commits | |
|---|---|---|
| yegor256 | y****6@g****m | 129 |
| rocket | g****s@g****m | 14 |
| rliskunov | l****a@y****u | 12 |
| renovate[bot] | 2****]@u****m | 5 |
| Yaroslav Starostin | y****n@g****m | 2 |
| rocket-3 | 4****3@u****m | 1 |
| KHairullin_Aleksandr | 3****6@n****u | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 145
- Total pull requests: 186
- Average time to close issues: 4 months
- Average time to close pull requests: 14 days
- Total issue authors: 21
- Total pull request authors: 34
- Average comments per issue: 3.39
- Average comments per pull request: 2.61
- Merged pull requests: 140
- Bot issues: 1
- Bot pull requests: 89
Past Year
- Issues: 29
- Pull requests: 61
- Average time to close issues: 14 days
- Average time to close pull requests: 7 days
- Issue authors: 6
- Pull request authors: 16
- Average comments per issue: 3.07
- Average comments per pull request: 1.87
- Merged pull requests: 38
- Bot issues: 0
- Bot pull requests: 11
Top Authors
Issue Authors
- yegor256 (87)
- dzhovi (17)
- timur-harin (12)
- ilnarkhasanov (10)
- veledara (6)
- padjal (4)
- nai1ka (4)
- Nypiaka (3)
- h1alexbel (3)
- RuslanGaliullin (3)
- volodya-lombrozo (3)
- zaqbez39me (2)
- nikzor (2)
- sokanaid (2)
- howcanunot (1)
Pull Request Authors
- renovate[bot] (110)
- github-actions[bot] (27)
- timur-harin (24)
- zaqbez39me (16)
- veledara (13)
- ilnarkhasanov (11)
- lueFlake (10)
- nai1ka (10)
- IlnurHA (9)
- dzhovi (9)
- Raleksan (8)
- padjal (6)
- nikzor (6)
- h1alexbel (5)
- howcanunot (5)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total docker downloads: 680
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 35
github actions: yegor256/cam
Run full cycle with one repository
- Homepage: http://cam.yegor256.com
- License: mit
-
Latest release: 0.9.3
published over 1 year ago
Rankings
Dependencies
- flake8 ==3.9.2
- javalang ==0.13.0
- pygments ==2.9.0
- pylint ==2.9.3
- actions/checkout v3 composite
- yegor256/latexmk-action 0.7.1 composite
- actions/checkout v3 composite
- yegor256/cam master composite
- Dockerfile * docker
- yegor256/rultor-image 1.8.0 build
- actions/checkout 3df4ab11eba7bda6032a0b82a6bb43b11571feac composite
- peter-evans/create-pull-request v5 composite