code-lifetime

Tools for analyzing the lifetime of code lines and tokens

https://github.com/dspinellis/code-lifetime

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Tools for analyzing the lifetime of code lines and tokens

Basic Info
  • Host: GitHub
  • Owner: dspinellis
  • License: apache-2.0
  • Language: Perl
  • Default Branch: master
  • Size: 24.4 KB
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Tools for tracking the lifetime of code lines and tokens

The tools in this repository allow the precise tracking of when a specific code line or token is modified or removed. They have been used for conducting the studies described in the following paper.

Diomidis Spinellis, Panos Louridas, and Maria Kechagia. Software evolution: The lifetime of fine-grained elements. PeerJ Computer Science, 7:e372, February 2021. doi:10.7717/peerj-cs.372

This is the paper's abstract.

A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the efficiency of software development. We present methods and tools that allow tracking of each line's or token's birth and death. Through them, we analyze 3.3 billion source code element lifetime events in 89 revision control repositories. Statistical analysis shows that code lines are durable, with a median lifespan of about 2.3 years, and that young lines are more likely to be modified or deleted, following a Weibull distribution with the associated hazard rate decreasing over time. This behavior appears to be independent from specific characteristics of lines or tokens, as we could not determine factors that influence significantly their longevity across projects. The programming language, and developer tenure and experience were not found to be significantly correlated with line or token longevity, while project size and project age showed only a slight correlation.

The following sections describe the tools included in this repository.

lifetime

The lifetime tool parses the output of successive git diff runs and, for every changed or deleted line, outputs a record containing the timestamps of the line's creation and deletion. Input can be supplied on its standard input or as files specified as arguments. To monitor progress in long repositories it also outputs on its standard error the SHA hash of each commit being processed. When all commits have been processed it outputs the creation timestamps of all remaining lines followed by alive NA.

Example run

``` git log -M -m --pretty=tformat:'commit %H %ct' --topo-order --reverse -U0 | lifetime.pl 1516281718 1597482365 1514636783 1597482365 1591563588 1598358198 1601804488 1601809923 1601809923 1601810093 1601810093 1601821073 1601809923 1602450156 1601804488 1603903274 1601804488 1603903274 1601821073 1603903274 1601821073 1603903274 1525764676 alive NA 1587747980 alive NA 1587747980 alive NA 1587747980 alive NA 1586362490 alive NA 1586362490 alive NA

```

The tool's operation can be modified through the following command-line arguments. -c Output in "compressed" format: commit, followed by birthday of deaths -d Report the LoC delta -D opts Debug as specified by the letters in opts C Show commit set changes D Show diff headers E Show diff extended headers H Show each commit SHA, timestamp header L Show LoC change processing P Show push to change set operations R Reconstruct the repository contents from its log @ Show range headers S Show results of splicing operations u Run unit tests -e SHA End processing after the specified (full) SHA commit hash -E Redirect (debugging) output to stderr -g file Create a growth file with line count of live lines at every commit -h Print usage information and exit -l Associate with each line details about its composition -q Quiet; do not output commit and timestamp on normal processing -s Report only changes in source code files (based on their suffix) -t Show tokens with lifetime

daglp

The daglp program simplifies Git commit history into a linear graph with the most commits, using a graph longest path algorithm. Given as input a topologically sorted list of each commit's parents, it will output the longest path of the directed acyclic graph from the beginning (the oldest commit) to the end (the newest one). The input is expected to come from a command such as git log --topo-order --pretty=format:'%H %at %P'. The output is a set of "SHA identifier" lines.

Example run

$ git log --topo-order --pretty=format:'%H %at %P' | daglp 13af1997c687bb4462f97ab512e51e8c072a2858 1370686723 d8e85967adc0b188a49117b5db4f10cc6c7c36cb 1370688578 27a8ec806f16ae66a7eaa8563220f600c99b9ab9 1370688605 222f60c28228e189c0986f8c4e86cc5a07e69bfa 1370688896 a0759fa8d6838170e4b693d26d6edb5e0463c1d0 1370689181

difflog

The difflog tool produces a Git repository's log of changes in unified diff format This is the equivalent of running, as required by the lifetime tool.

git -c diff.renameLimit=30000 log -m -M -C --pretty=tformat:'commit %H %at' --topo-order --reverse -U0

However, the former command has been known to produce incorrect results, which difflog corrects. Any command line options are passed as arguments to git diff.

tokenize

The tokenize tool is used to convert the source code commits of a Git repository into equivalent ones containing one token per line, as e.g. proposed by cregit and used on the Linux kernel. The new repository can then be used for performing token-level diffs.

The tool supports code written in Java, C, C#, C++, PHP, and Python, as recognized by each file's suffix. The tool expects the separate tokenizer tool to be installed and available in its execution path. It is invoked with a Git repository directory and branch name as argument. Its output is suitable for feeding into git fast-input. Each line in the new repository contains the token 's type (KW for keyword, NUM for number, ID for identifier, and TOK for all other tokens), followed by the actual token.

Example run

``` $ git init tokenized-repo $ tokenize.pl repo main | (cd tokenized-repo ; git fast-import)

/usr/lib/git-core/git-fast-import statistics:

Alloc'd objects: 5000 Total objects: 494 ( 91 duplicates ) blobs : 243 ( 87 duplicates 234 deltas of 237 attempts) trees : 141 ( 4 duplicates 138 deltas of 138 attempts) commits: 110 ( 0 duplicates 0 deltas of 0 attempts) tags : 0 ( 0 duplicates 0 deltas of 0 attempts) Total branches: 1 ( 1 loads ) marks: 1024 ( 440 unique ) atoms: 54 Memory total: 2344 KiB pools: 2110 KiB

objects: 234 KiB

packreport: getpagesize() = 4096 packreport: core.packedGitWindowSize = 1073741824 packreport: core.packedGitLimit = 35184372088832 packreport: packusedctr = 25 packreport: packmmapcalls = 10 packreport: packopenwindows = 1 / 1

packreport: packmapped = 237444 / 237444

$ cd tokenized-repo $ git show commit 1004d9ad8074c774dfe60f8d0527d3eefd20a003 (HEAD -> master) Author: Diomidis Spinellis dds@aueb.gr Date: Fri Feb 8 15:34:17 2019 +0200

Handle numbers representing infinity

Issue: #10

diff --git a/src/TokenId.cpp b/src/TokenId.cpp index 35b8296..511e57a 100644 --- a/src/TokenId.cpp +++ b/src/TokenId.cpp @@ -37,6 +37,18 @@ KW constexpr KW int ID TokenId TOK :: +ID NUMBERINFINITE +TOK ; +KW constexpr +KW int +ID TokenId +TOK :: +ID NUMBERNAN +TOK ; +KW constexpr +KW int +ID TokenId +TOK :: ID NUMBER_END TOK ; KW constexpr ```

Owner

  • Name: Diomidis Spinellis
  • Login: dspinellis
  • Kind: user
  • Location: Athens, Greece
  • Company: Athens University of Economics and Business & Delft University of Technology

Professor of Software Engineering at AUEB and of Software Analytics at TU Delft, programmer, and technology author.

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels