xchaindatagen

A cross-chain data generator. Contains a data extractor and a cross-chain transaction generator for multiple cross-chain protocols/bridges

https://github.com/andreaugusto11/xchaindatagen

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A cross-chain data generator. Contains a data extractor and a cross-chain transaction generator for multiple cross-chain protocols/bridges

Basic Info
Statistics
  • Stars: 10
  • Watchers: 1
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Created 11 months ago · Last pushed 6 months ago
Metadata Files
Readme Citation Codeowners

README.md

XChainDataGen: A Cross-Chain Dataset Generation Framework

License: MIT Python 3.7+ Contributions welcome

This repository contains the code for XChainDataGen, a cross-chain dataset extraction and generation framework -- i.e., a tool that extracts cross-chain data from bridge contracts in multiple blockchains and generates datasets of cross-chain transactions (CCTX).

Paper: https://arxiv.org/abs/2503.13637

Dataset (Jun 2024 - Dec 2024): https://zenodo.org/records/15341722

Project structure

. .vscode/ # Configurations to launch the application in VSCode analysis/ # R scripts for data analysis data/ # Generated data in CSV format R Scripts/ # R scripts for the analysis of data generate_csv.ipynb # Extracts data from database and converts into CSV (saved in the `data` folder) paper-visualizations.ipynb # Main analysis of the data cli/ cli.py # Command Line Interface config/ constants.py # File with the constants for the project (blockchains and bridges supported) rpcs_base_config.py # List of public RPCs used for extracting data from each blockchain rpcs_config.py # File with the list of all available RPCs (i.e., returning 200), generated in runtime extractor/ across/ # Data extraction logic for events emitted by across's contracts ABIs/ arbitrum/ # The ABIs for each Across contract deployed in Arbitrum avalanche/ # The ABIs for each Across contract deployed in Avalanche ... constants.py # Definition of all contract addresses, for all blockchains, and events of interest for each contract decoder.py # A custom decoder for the events emitted by the contracts handler.py # Received a set of events, and stores in the database according to the defined schema ccip/ # Data extraction logic for events emitted by ccip's contracts ... cctp/ # Data extraction logic for events emitted by cctp's contracts ... ... decoder.py # Base decoder logic extractor.py # Base extraction logic generator/ common/ # Cross-chain transaction generation logic for across price_generator.py # Fetches token metadata and token prices for each token transacted across/ # Cross-chain transaction generation logic for across generator.py # Cross-chain transaction generator for across ccip/ # Cross-chain transaction generation logic for ccip ... cctp/ # Cross-chain transaction generation logic for cctp ... ... generator.py # Base generation logic repository/ across/ # Implementation of repository pattern, with the definition of data models for across models.py # Definition of data models for all relevant events repository.py # Definition of the repository for across ccip/ # Implementation of repository pattern, with the definition of data models for ccip ... cctp/ # Implementation of repository pattern, with the definition of data models for cctp ... ... base.py # Implementation of base repository, extended by all concrete implementations (CRUD operations) database.py # Main logic for database creation rpcs/ generate_rpc_configs.py # Generate config file based on the public RPCs available for each blockchain utils/ # Datalog rules and facts rpc_utils.py # Management of RPC requests logic utils.py # utils __init__.py # Entry point of the application

Data Extraction

The Extractor takes as input: (i) the bridge to be analyzed, (ii) a time interval defined using Unix timestamps, and (iii) a set of supported blockchains. The extraction process works as follows. It first loads the bridge configuration file, which specifies all relevant contract events for each blockchain where the bridge is deployed. The Extractor iterates over the user-specified blockchains, determining the nearest block numbers corresponding to the provided timestamps (i.e., the start and end blocks for each blockchain). It then divides this block range into intervals of 2,000 blocks and retrieves logs for all specified events in each contract using the ethgetLogs RPC method ./extractor/extractor.py. For every captured event, the Extractor also fetches the corresponding transaction receipt and block information using *ethgetTransactionReceipt* and eth_getBlockByNumber. Each event is decoded using either a base decoder ./extractor/decoder.py or, when necessary, a custom decoder tailored to the specific contract and event type. The extracted data is then stored in a storage system ./repository/database.py, with each event written as a separate relation. To ensure flexibility, we implemented a Repository Pattern ./repository/base.py, abstracting the data layer and allowing different storage systems. By modifying the database configuration file, users can customize the storage system based on their specific dataset requirements. At the end of the extraction phase, the storage system contains all the data associated with the specified bridge events and blockchains.

CCTX Generation

The Generator builds cross-chain transactions based on previously extracted data. The base generator dynamically loads a custom generator for the bridge to be analyzed ./generator/generator.py. In these custom components, the data previously extracted and written to the storage system is read, and the different records are merged in order to create cross-chain transactions. Records are merged based on cross-chain transaction identifiers (called deposit IDs, withdrawal IDs, or message IDs depending on the protocol), which link actions on both chains, also based on the sender, recipient, and tokens being transferred, which are always data available on both sides that can be used for linkability. The specific fields through which records are merged depend on the logic of each bridge. At the end of this phase, the storage system also contains datasets of cross-chain transactions.

A Quick Start (Docker)

The quick start leverages the docker-compose.yaml file. It sets up a container running postgres, and the configuration for the application.

Requirements

  • Docker
  • Docker Compose (optional for local PostgreSQL setup)

Build & Start Containers

bash docker-compose up --build -d The -d flag runs the containers in detached mode.

Running XChainDataGen CLI Commands

Extract the data related to a single bridge in multiple contracts using the following template:

bash docker-compose run --rm app extract --bridge <BRIDGE_NAME> --start_ts <START_TIMESTAMP> --end_ts <END_TIMESTAMP> --blockchains <BLOCKCHAIN_1> <BLOCKCHAIN_2> ... <BLOCKCHAIN_N>

To override the default data storage, use the -e flag to assign a new value to the environment variable DATABASEURL (e.g., `-e DATABASEURL=postgresql://user:password@db:5432/ccip`).

Take into consideration that, depending on the number of contracts deployed for each bridge, on the number of events emitted by each in the interval of analysis, and the capabilities of your machine, this process can take long periods



Example for Cross-Chain Interoperability Protocol (CCIP by Chainlink) (~10 minutes)

Extract the data related to CCIP from Dec 01, 2024 00:00:00 GMT+0000 to Dec 02, 2024 00:00:00 GMT+0000.

Unix timestamp: 1733011200 (Sun Dec 01 2024 00:00:00 GMT+0000)

Unix timestamp: 1733097600 (Sun Dec 02 2024 00:00:00 GMT+0000)

bash docker-compose run --rm app extract --bridge ccip --start_ts 1733011200 --end_ts 1733097600 --blockchains ethereum arbitrum avalanche polygon optimism base bnb gnosis ronin linea scroll

Cross-Chain Transaction Generator (~1 minute)

Generate cross-chain transactions, linking events and data across blockchains.

bash docker-compose run --rm app generate --bridge ccip

Retrieve Generated Data

Access the database container, and enter the database (the default database is db_app).

bash docker exec -it my_postgres psql -U user -d db_app

Run the \d command to list all relations in the database.

For CCIP, all cross-chain transactions will be in the ccipcrosschain_transactions table.

sql select count(*) from ccip_cross_chain_transactions;

Output Examples

We provide some examples of what you can expect in a successful run of XChainDataGen:

Data Extraction for CCIP

CCTX Generation for CCIP

Database Relations for CCIP

Run locally

XChainDataGen can also be ran locally in your host machine.

Requirements

  • Postgres (v14)
  • Python (v3.11.5)
  • Virtualenv (optional)

Python & Virtualenv -- Installation Linux (Ubuntu)

``` sudo add-apt-repository ppa:deadsnakes/ppa sudo apt update sudo apt install python3.11

sudo apt install python3.11-venv ```

Python & Virtualenv -- Installation MacOS

brew install python@3.11 pip install virtualenv

Setup

Firstly, make sure Postgres is installed and you have a working database running in your own machine.

  1. Create virtual environment python3.11 -m venv .xchaindata
  2. Activate virstual environment source .xchaindata/bin/activate
  3. Install all dependencies pip install -r requirements.txt
  4. To stop using the env, run deactivate
  5. Create a .env file setting the DATABASE_URL variable according to your database connection.

Using Terminal

Data Extraction

shell python3.11 __init__.py extract --bridge <BRIDGE_NAME> --start_ts <START_TIMESTAMP> --end_ts <END_TIMESTAMP> --blockchains <BLOCKCHAIN_1> <BLOCKCHAIN_2> ... <BLOCKCHAIN_N>

CCTX Generation

shell python3.11 __init__.py generate --bridge <BRIDGE_NAME>

Using VSCode

  1. Open the project in VS Code.
  2. Make sure you have the Python extension installed.
  3. Open the Command Palette (Cmd+Shift+P on macOS or Ctrl+Shift+P on Windows/Linux).
  4. Type "Python: Select Interpreter" and choose the interpreter in your xchaindata virtual environment (python 3.11).
  5. Open the Debug view (Ctrl+Shift+D or Cmd+Shift+D on Mac).
  6. From the dropdown at the top of the Debug view, select one of the options:

* [Stargate] test * [Stargate] generate cross-chain transactions * [Across] test * [Across] generate cross-chain transactions * [Omnibridge] test * [Omnibridge] generate cross-chain transactions * [Ronin] test * [Ronin] generate cross-chain transactions * [CCTP] test * [CCTP] generate cross-chain transactions * [CCIP] test * [CCIP] generate cross-chain transactions * [Polygon] test * [Polygon] generate cross-chain transactions

Click the green play button or press F5 to start debugging.

Contributing

Results and Data Analysis

The analysis of data extracted between Jun 1, 2024 and December 31, 2024 can be found in ./analysis/paper-visualizations-and-tables-generation.ipynb and in ./analysis/R%20Scripts/paper-visualizations.R.

Suggested Citation

This work is an extension of our research. If using this repository, cite as:

bibtex @misc{augusto2025xchaindatagencrosschaindatasetgeneration, title={XChainDataGen: A Cross-Chain Dataset Generation Framework}, author={Andr Augusto and Andr Vasconcelos and Miguel Correia and Luyao Zhang}, year={2025}, eprint={2503.13637}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2503.13637}, }

Owner

  • Name: André Augusto
  • Login: AndreAugusto11
  • Kind: user
  • Location: Lisbon, Portugal

Ph.D student | Blockchain Interoperability | Mentor @ Hyperledger

Citation (CITATION.cff)

@misc{augusto2025xchaindatagen,
      title={XChainDataGen: A Cross-Chain Dataset Generation Framework}, 
      author={André Augusto and André Vasconcelos and Miguel Correia and Luyao Zhang},
      year={2025},
      eprint={2503.13637},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2503.13637}, 
}

GitHub Events

Total
  • Watch event: 10
  • Delete event: 11
  • Issue comment event: 1
  • Push event: 52
  • Pull request review comment event: 11
  • Pull request review event: 19
  • Pull request event: 27
  • Fork event: 3
  • Create event: 13
Last Year
  • Watch event: 10
  • Delete event: 11
  • Issue comment event: 1
  • Push event: 52
  • Pull request review comment event: 11
  • Pull request review event: 19
  • Pull request event: 27
  • Fork event: 3
  • Create event: 13

Dependencies

Dockerfile docker
  • python 3.10-slim build
pyproject.toml pypi
requirements.txt pypi
  • Jinja2 ==3.1.4
  • MarkupSafe ==3.0.2
  • PyYAML ==6.0.2
  • Pygments ==2.18.0
  • SQLAlchemy ==2.0.36
  • aiohappyeyeballs ==2.4.4
  • aiohttp ==3.11.9
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • appnope ==0.1.4
  • asttokens ==3.0.0
  • attrs ==24.2.0
  • backcall ==0.2.0
  • base58 ==2.1.1
  • beautifulsoup4 ==4.12.3
  • bitarray ==3.0.0
  • black ==24.10.0
  • bleach ==6.2.0
  • certifi ==2024.8.30
  • charset-normalizer ==3.4.0
  • ckzg ==2.0.1
  • click ==8.1.7
  • cytoolz ==1.0.0
  • decorator ==5.0.5
  • defusedxml ==0.7.1
  • docopt ==0.6.2
  • eth-account ==0.13.4
  • eth-hash ==0.7.0
  • eth-keyfile ==0.8.1
  • eth-keys ==0.6.0
  • eth-rlp ==2.1.0
  • eth-typing ==5.0.1
  • eth-utils ==5.1.0
  • eth_abi ==5.1.0
  • executing ==2.1.0
  • fastjsonschema ==2.21.1
  • frozenlist ==1.5.0
  • hexbytes ==1.2.1
  • idna ==3.10
  • ipython ==7.34.0
  • isort ==5.13.2
  • jedi ==0.19.2
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • jupyter_client ==8.0.0
  • jupyter_core ==5.7.2
  • jupyterlab_pygments ==0.3.0
  • matplotlib ==3.10.0
  • matplotlib-inline ==0.1.7
  • mistune ==3.0.2
  • multidict ==6.1.0
  • mypy-extensions ==1.0.0
  • nbclient ==0.10.1
  • nbconvert ==7.16.4
  • nbformat ==5.10.4
  • packaging ==24.2
  • pandas ==2.2.2
  • pandocfilters ==1.5.1
  • parsimonious ==0.10.0
  • parso ==0.8.4
  • pathspec ==0.12.1
  • pexpect ==4.9.0
  • pickleshare ==0.7.5
  • platformdirs ==4.3.6
  • prompt_toolkit ==3.0.48
  • propcache ==0.2.1
  • psycopg2-binary ==2.9.10
  • ptyprocess ==0.7.0
  • pure_eval ==0.2.3
  • pycryptodome ==3.21.0
  • pydantic ==2.10.3
  • pydantic_core ==2.27.1
  • python-dateutil ==2.9.0.post0
  • pyunormalize ==16.0.0
  • pyzmq ==26.2.0
  • referencing ==0.35.1
  • regex ==2024.11.6
  • requests ==2.32.3
  • rlp ==4.0.1
  • rpds-py ==0.22.3
  • seaborn ==0.13.2
  • six ==1.17.0
  • soupsieve ==2.6
  • sqlalchemy_utils ==0.41.2
  • stack-data ==0.6.3
  • tinycss2 ==1.4.0
  • toolz ==1.0.0
  • tornado ==6.3.3
  • traitlets ==5.14.3
  • types-requests ==2.32.0.20241016
  • typing_extensions ==4.12.2
  • urllib3 ==2.2.3
  • wcwidth ==0.2.13
  • web3 ==7.6.0
  • webencodings ==0.5.1
  • websockets ==13.1
  • yarg ==0.1.9
  • yarl ==1.18.3