codesim
Source Code Clone Detection Using Unsupervised Similarity Measures
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, springer.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Keywords
Repository
Source Code Clone Detection Using Unsupervised Similarity Measures
Basic Info
- Host: GitHub
- Owner: jorge-martinez-gil
- License: mit
- Language: Java
- Default Branch: main
- Homepage: https://link.springer.com/chapter/10.1007/978-3-031-56281-5_2
- Size: 201 KB
Statistics
- Stars: 9
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Source Code Clone Detection Using Unsupervised Similarity Measures
This repository contains the source code for reproducing the paper Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-56281-5_2.
The dataset has been created by Oscar Karnalim: https://github.com/oscarkarnalim/sourcecodeplagiarismdataset
🌍 Overview
This project implements a collection of established methods for measuring similarity. In this context, the goal is to detect similarity (and subsequently identify code clones) in Java source code using unsupervised similarity measures. It aims to check the most promising unsupervised similarity measures to identify duplicates (a.k.a. clones) in source code efficiently, offering a valuable tool for software maintenance and plagiarism detection.
📚 Reference
If you use this work, please cite:
@InProceedings{10.1007/978-3-031-56281-5_2,
author="Martinez-Gil, Jorge",
editor="Bludau, Peter
and Ramler, Rudolf
and Winkler, Dietmar
and Bergsmann, Johannes",
title="Source Code Clone Detection Using Unsupervised Similarity Measures",
booktitle="Software Quality as a Foundation for Security",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="21--37",
isbn="978-3-031-56281-5"
}
🛠️ How it Works
Each script java-sim-*-opt.py processes the Java code snippets from the IR-Plag dataset. There are 21 different methods implemented. Please note that each script tries to find a threshold value for semantic similarity capable of separating clones from non-clones.
📈 Performance Results
| Approach | Script | Accuracy | Precision | Recall | F-Measure | Execution Time (ms) |
|-----------------------------|---------------------------|----------|-----------|--------|-----------|-----------------------|
| Abstract Syntax Tree | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80907.37 |
| Bag-of-Words | java-sim-bow-opt.py | 0.77 | 0.79 | 0.66 | 0.72 | 57444.90 |
| Bag-of-Words II | java-sim-bow2-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 59961.69 |
| CodeBERT | java-sim-codebert-opt.py*| 0.54 | 0.75 | 0.34 | 0.47 | 868755.96 |
| Comment Sim. | java-sim-comments-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 983231.42 |
| Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1381335.16 |
| Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30303.88 |
| Fuzzy Matching | java-sim-fuzz-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 12778.62 |
| Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65076.91 |
| Rolling Hash | java-sim-hash-opt.py | 0.59 | 0.93 | 0.18 | 0.30 | 959157.60 |
| Perceptual Hash | java-sim-image-opt.py | 0.77 | 0.77 | 0.85 | 0.81 | 38152.71 |
| Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2066.13 |
| Longest Common Subsequence | java-sim-lcs-opt.py | 0.48 | 0.74 | 0.06 | 0.11 | 7268.67 |
| Levenshtein | java-sim-lev-opt.py | 0.77 | 0.80 | 0.66 | 0.72 | 10280.09 |
| Metrics comparison | java-sim-metrics-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 60508.62 |
| N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66635.25 |
| Program Dependence Graph | java-sim-pdg-opt.py | 0.65 | 0.85 | 0.39 | 0.53 | 40518.80 |
| Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225218.76 |
| Semantic Clone | java-sim-semclone-opt.py| 0.77 | 0.79 | 0.68 | 0.73 | 41543.53 |
| Semdiff method | java-sim-semdiff-opt.py | 0.77 | 0.79 | 0.38 | 0.51 | 26351.06 |
| TDF-IDF | java-sim-tdf-opt.py | 0.77 | 0.77 | 0.99 | 0.87 | 68587.17 |
| Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77160.81 |
| Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104032.99 |
*CodeBERT is used without recalibration
📄 License
These scripts are provided under the MIT License.
Owner
- Name: Jorge Martinez-Gil
- Login: jorge-martinez-gil
- Kind: user
- Location: Austria
- Website: https://www.jorgemar.com
- Repositories: 1
- Profile: https://github.com/jorge-martinez-gil
Computer Scientist living and working in Austria
Citation (citation.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." title: "Source Code Clone Detection Using Unsupervised Similarity Measures" authors: - name: "Jorge Martinez-Gil" version: "1.0.0" date-released: 2023-12-06 repository-code: "https://github.com/jorge-martinez-gil/codesim"
GitHub Events
Total
- Watch event: 3
- Fork event: 2
Last Year
- Watch event: 3
- Fork event: 2