https://github.com/andstor/verified-smart-contracts

:page_facing_up: Verified Ethereum Smart Contract dataset

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary

Keywords

dataset ethereum etherscan huggingface language-modeling smart-contracts text-generation

Last synced: 6 months ago · JSON representation

Repository

:page_facing_up: Verified Ethereum Smart Contract dataset

Basic Info

Host: GitHub
Owner: andstor
License: mit
Language: Python
Default Branch: main
Homepage: https://huggingface.co/datasets/andstor/smart_contracts
Size: 42 KB

Statistics

Stars: 29
Watchers: 3
Forks: 4
Open Issues: 1
Releases: 0

Topics

dataset ethereum etherscan huggingface language-modeling smart-contracts text-generation

Created almost 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

verified-smart-contracts

:pagefacingup: Verified Ethereum Smart Contract dataset

Verified Smart Contracts is a dataset of real Ethereum Smart Contract, containing both Solidity and Vyper source code. It consists of every deployed Ethereum Smart Contract as of :black_joker: 1st of April 2022, whose been verified on Etherescan, and has at least one transaction. The dataset is available at 🤗 Hugging Face.

Metrics

| Component | Size | Num rows | LoC[^1] | | --------- |:----:| -------:| -------:| | Raw| 8.80 GiB | 2217692 | 839665295 | | Flattened | 1.16 GiB | 136969 | 97529473 | | Inflated | 0.76 GiB | 186397 | 53843305 | | Parsed | 4.44 GiB | 4434014 | 29965185 |

[^1]: LoC refers to the lines of source_code. The Parsed dataset counts lines of func_code + func_documentation.

Description

Raw

The raw dataset contains mostly the raw data from Etherscan, downloaded with the smart-contract-downlader tool. It normalizes all different contract formats (JSON, multi-file, etc.) to a flattened source code structure.

script python script/2parquet.py -s data -o parquet

Flattened

The flattened dataset contains smart contracts, where every contract contains all required library code. Each "file" is marked in the source code with a comment stating the original file path: //File: path/to/file.sol. These are then filtered for uniqeness with a similarity threshold of 0.9. The low uniqeness requirement is due to the often large amount of embedded library code. If a more unique dataset is required, see the inflated dataset instead.

script python script/filter_data.py -s parquet -o data/flattened --threshold 0.9

Inflated

The inflated dataset splits every contracts into its representative files. These are then filtered for uniqeness with a similarity threshold of 0.9.

script python script/filter_data.py -s parquet -o data/inflated --split-files --threshold 0.9

Parsed

The parsed dataset contains a parsed extract of Solidity code from the Inflated dataset. It consists of contract classes (contract definition) and functions (function definition), as well as accompanying documentation (code comments). The code is parsed with the solidity-universal-parser.

script python script/parse_data.py -s data/inflated -o data/parsed

Plain Text

A subset of the datasets above can be created by using the 2plain_text.py script. This will produce a plain text dataset with the columns text (source code) and language.

script python script/2plain_text.py -s data/inflated -o data/inflated_plain_text This will produce a plain text version of the inflated dataset, and save it to data/inflated_plain_text.

Filtering

A large quantity of the Smart Contracts is/contains duplicated code. This is mostly due to frequent use of library code. Etherscan embeds the library code used in a contract in the source code. To mitigate this, some filtering is applied in order to produce dataset with mostly unique contract source code. This filtering is done by calculating the string distance between the surce code. Due to the large amount of contracts (~2 million), the comparison is only done in groups by contract_name for the flattened dataset, and by file_name for the inflated dataset.

The string comparison algorithm used is the Jaccard index.

Data format

The data format used is parquet files, most with a total of 30,000 records.

License

This repository is licensed under the MIT License.

All contracts in the dataset are publicly available, obtained by using Etherscan APIs, and subject to their own original licenses.

Owner

Name: André Storhaug
Login: andstor
Kind: user
Location: Trondheim 🇳🇴
Company: NTNU

Website: https://andre.storhaug.no
Repositories: 87
Profile: https://github.com/andstor

🎓 CS PhD student @ Norwegian University of Science and Technology (NTNU)

GitHub Events

Total

Watch event: 3

Last Year

Watch event: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 1
Total pull requests: 1
Average time to close issues: 12 months
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 5.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dbhurley (1)

Pull Request Authors

dependabot[bot] (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Dependencies

requirements.txt pypi

antlr4-python3-runtime ==4.9.3
datasets ==2.0.0
fastparquet ==0.8.0
pandas ==1.4.1
py-etherscan-api ==0.8.0
pyarrow ==7.0.0
textdistance *

https://github.com/andstor/verified-smart-contracts

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

verified-smart-contracts

Metrics

Description

Raw

Flattened

Inflated

Parsed

Plain Text

Filtering

Data format

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies