ease2024-hf-replicationpackage
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization mdegroup has institutional domain (mdegroup.disim.univaq.it) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: MDEGroup
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 22 MB
Statistics
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
EASE2024-HF-ReplicationPackage
This repository contains the source code and the supporting data of the paper entitled Automated categorization of pre-trained models in software engineering: A case study with a Hugging Face dataset
Project structure and set up
/results: This folder contains the results of the evaluation shown in RQ1./se_kb: This folder stores the papers, the identified macro and the corresponding sub-tasks used in the mapping algorithm./stats: This folder stores the distribution of the HF tags and the macro-tasks.datasets.zip: It contains the original HF dump, the d1 dataset (step 1 of the filtering process), and d2 which represents the final dataset used in the RQ1 experiments.requirements.txt: This file contains the needed libraries to run the project.
Utils
classifier.py: It contains all the functions to run the two classifiers employed in RQ1.config.py: It contains local paths that are needed to produce the results.data_utils.py: A utility script that provides functions for data handling and manipulation.dump_utils.py: It contains the helper functions to interact with the HF dump.main.py: The main script that runs the core logic of the project.mapping_utils.py: It contains the implementation of the mapping algorithm presented in the paper.
Classification pipeline
To replicate the experiment conducted in RQ1, you can run the run_classifier function with the following parameters:
run_classifier(dataset='datasets/d2.csv',
desc='card_data',
cat='tags',
model='CNB', results_csv_path='results/cnb_results.csv')
where desc and cat are the model card description and the pipeline tag respectively. You can change the classifier by setting the model parameter equal to ['SVC','CNB']. The results of the cross-fold validation are stored in the path specified by the resultscsvpath parameter.
Identified papers and macro tasks
The se_kb folder contains the following files:
papers_all.csv: It contains all the papers retrieved from Scopuspapers_selected.csv: It contains the final number of papers relevant to our study. Each paper has a unique ID used in the mapping phase.extracted_se_tasks.csv: It contains all the SE sub-tasks and the corresponding paper IDsmacro_sub_tasks.csv: It stores the list of macro tasks and the corresponding sub-tasks.
All the abovementioned files have been used by the mapping algorithm discussed in the forthcoming section.
Example mapping
To get the mapping, you need to run the function
mapping_pipeline(ptm,dataset) where ptm is the name of the model and the dataset is d2.csv file stored in the datasets.zip folder. An explanatory output of the query ptm='bert' is shown below:
Most Frequent Tag
text-classification, 8 occurrences
Similar Pre-trained Models (PTMs)
- ber2, text-classification
- ber3, text-classification
- bort, text2text-generation
- ber4, text-classification
- bert, sentence-similarity
- bert1, text-classification
- best, token-classification
- sbert, sentence-similarity
- bert, question-answering
- bert, text-classification
Identified Macro tasks
- Classification of SE artifacts
- Miscellaneous
- Testing/Program repair
- Code-related task
- Documentation/Requirements
- Text engineering related to SE artifacts
Identified Sub-tasks
- Generating code patches
- Bug fix/Program repair
- Code generation/completion
- Algorithm classification
- Code clone detection
- Code search
- Bug report
- Requirement classification
- API reviews classification
- StackOverflow title generation
- Sentiment analysis
- Issue report classification
- String generation secondary studies
- Commit classification
- Program merge
- Stack overflow post summarization
- Traceability
Owner
- Name: MDEGroup
- Login: MDEGroup
- Kind: organization
- Location: Via Vetoio, Coppito, 67100 L'Aquila IT
- Website: http://mdegroup.disim.univaq.it/
- Repositories: 64
- Profile: https://github.com/MDEGroup
The Model-Driven Engineering Group at the University of L'Aquila
Citation (CITATION.cff)
cff-version: 1.2.0 authors: - family-names: Anonymous title: "EASE2024-Vision-Replication-Package" version: 2.0.4 date-released: 2024-03-07