xchaindatagen
A cross-chain data generator. Contains a data extractor and a cross-chain transaction generator for multiple cross-chain protocols/bridges
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary
Repository
A cross-chain data generator. Contains a data extractor and a cross-chain transaction generator for multiple cross-chain protocols/bridges
Basic Info
- Host: GitHub
- Owner: AndreAugusto11
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/abs/2503.13637
- Size: 4.03 MB
Statistics
- Stars: 10
- Watchers: 1
- Forks: 3
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
XChainDataGen: A Cross-Chain Dataset Generation Framework
This repository contains the code for XChainDataGen, a cross-chain dataset extraction and generation framework -- i.e., a tool that extracts cross-chain data from bridge contracts in multiple blockchains and generates datasets of cross-chain transactions (CCTX).
Paper: https://arxiv.org/abs/2503.13637
Dataset (Jun 2024 - Dec 2024): https://zenodo.org/records/15341722
Project structure
.
.vscode/ # Configurations to launch the application in VSCode
analysis/ # R scripts for data analysis
data/ # Generated data in CSV format
R Scripts/ # R scripts for the analysis of data
generate_csv.ipynb # Extracts data from database and converts into CSV (saved in the `data` folder)
paper-visualizations.ipynb # Main analysis of the data
cli/
cli.py # Command Line Interface
config/
constants.py # File with the constants for the project (blockchains and bridges supported)
rpcs_base_config.py # List of public RPCs used for extracting data from each blockchain
rpcs_config.py # File with the list of all available RPCs (i.e., returning 200), generated in runtime
extractor/
across/ # Data extraction logic for events emitted by across's contracts
ABIs/
arbitrum/ # The ABIs for each Across contract deployed in Arbitrum
avalanche/ # The ABIs for each Across contract deployed in Avalanche
...
constants.py # Definition of all contract addresses, for all blockchains, and events of interest for each contract
decoder.py # A custom decoder for the events emitted by the contracts
handler.py # Received a set of events, and stores in the database according to the defined schema
ccip/ # Data extraction logic for events emitted by ccip's contracts
...
cctp/ # Data extraction logic for events emitted by cctp's contracts
...
...
decoder.py # Base decoder logic
extractor.py # Base extraction logic
generator/
common/ # Cross-chain transaction generation logic for across
price_generator.py # Fetches token metadata and token prices for each token transacted
across/ # Cross-chain transaction generation logic for across
generator.py # Cross-chain transaction generator for across
ccip/ # Cross-chain transaction generation logic for ccip
...
cctp/ # Cross-chain transaction generation logic for cctp
...
...
generator.py # Base generation logic
repository/
across/ # Implementation of repository pattern, with the definition of data models for across
models.py # Definition of data models for all relevant events
repository.py # Definition of the repository for across
ccip/ # Implementation of repository pattern, with the definition of data models for ccip
...
cctp/ # Implementation of repository pattern, with the definition of data models for cctp
...
...
base.py # Implementation of base repository, extended by all concrete implementations (CRUD operations)
database.py # Main logic for database creation
rpcs/
generate_rpc_configs.py # Generate config file based on the public RPCs available for each blockchain
utils/ # Datalog rules and facts
rpc_utils.py # Management of RPC requests logic
utils.py # utils
__init__.py # Entry point of the application
Data Extraction
The Extractor takes as input: (i) the bridge to be analyzed, (ii) a time interval defined using Unix timestamps, and (iii) a set of supported blockchains. The extraction process works as follows. It first loads the bridge configuration file, which specifies all relevant contract events for each blockchain where the bridge is deployed. The Extractor iterates over the user-specified blockchains, determining the nearest block numbers corresponding to the provided timestamps (i.e., the start and end blocks for each blockchain). It then divides this block range into intervals of 2,000 blocks and retrieves logs for all specified events in each contract using the ethgetLogs RPC method ./extractor/extractor.py. For every captured event, the Extractor also fetches the corresponding transaction receipt and block information using *ethgetTransactionReceipt* and eth_getBlockByNumber. Each event is decoded using either a base decoder ./extractor/decoder.py or, when necessary, a custom decoder tailored to the specific contract and event type. The extracted data is then stored in a storage system ./repository/database.py, with each event written as a separate relation. To ensure flexibility, we implemented a Repository Pattern ./repository/base.py, abstracting the data layer and allowing different storage systems. By modifying the database configuration file, users can customize the storage system based on their specific dataset requirements. At the end of the extraction phase, the storage system contains all the data associated with the specified bridge events and blockchains.
CCTX Generation
The Generator builds cross-chain transactions based on previously extracted data. The base generator dynamically loads a custom generator for the bridge to be analyzed ./generator/generator.py. In these custom components, the data previously extracted and written to the storage system is read, and the different records are merged in order to create cross-chain transactions. Records are merged based on cross-chain transaction identifiers (called deposit IDs, withdrawal IDs, or message IDs depending on the protocol), which link actions on both chains, also based on the sender, recipient, and tokens being transferred, which are always data available on both sides that can be used for linkability. The specific fields through which records are merged depend on the logic of each bridge. At the end of this phase, the storage system also contains datasets of cross-chain transactions.
A Quick Start (Docker)
The quick start leverages the docker-compose.yaml file. It sets up a container running postgres, and the configuration for the application.
Requirements
- Docker
- Docker Compose (optional for local PostgreSQL setup)
Build & Start Containers
bash
docker-compose up --build -d
The -d flag runs the containers in detached mode.
Running XChainDataGen CLI Commands
Extract the data related to a single bridge in multiple contracts using the following template:
bash
docker-compose run --rm app extract --bridge <BRIDGE_NAME> --start_ts <START_TIMESTAMP> --end_ts <END_TIMESTAMP> --blockchains <BLOCKCHAIN_1> <BLOCKCHAIN_2> ... <BLOCKCHAIN_N>
To override the default data storage, use the -e flag to assign a new value to the environment variable DATABASEURL (e.g., `-e DATABASEURL=postgresql://user:password@db:5432/ccip`).
Take into consideration that, depending on the number of contracts deployed for each bridge, on the number of events emitted by each in the interval of analysis, and the capabilities of your machine, this process can take long periods
Example for Cross-Chain Interoperability Protocol (CCIP by Chainlink) (~10 minutes)
Extract the data related to CCIP from Dec 01, 2024 00:00:00 GMT+0000 to Dec 02, 2024 00:00:00 GMT+0000.
Unix timestamp: 1733011200 (Sun Dec 01 2024 00:00:00 GMT+0000)
Unix timestamp: 1733097600 (Sun Dec 02 2024 00:00:00 GMT+0000)
bash
docker-compose run --rm app extract --bridge ccip --start_ts 1733011200 --end_ts 1733097600 --blockchains ethereum arbitrum avalanche polygon optimism base bnb gnosis ronin linea scroll
Cross-Chain Transaction Generator (~1 minute)
Generate cross-chain transactions, linking events and data across blockchains.
bash
docker-compose run --rm app generate --bridge ccip
Retrieve Generated Data
Access the database container, and enter the database (the default database is db_app).
bash
docker exec -it my_postgres psql -U user -d db_app
Run the \d command to list all relations in the database.
For CCIP, all cross-chain transactions will be in the ccipcrosschain_transactions table.
sql
select count(*) from ccip_cross_chain_transactions;
Output Examples
We provide some examples of what you can expect in a successful run of XChainDataGen:
Data Extraction for CCIP

CCTX Generation for CCIP

Database Relations for CCIP

Run locally
XChainDataGen can also be ran locally in your host machine.
Requirements
- Postgres (v14)
- Python (v3.11.5)
- Virtualenv (optional)
Python & Virtualenv -- Installation Linux (Ubuntu)
``` sudo add-apt-repository ppa:deadsnakes/ppa sudo apt update sudo apt install python3.11
sudo apt install python3.11-venv ```
Python & Virtualenv -- Installation MacOS
brew install python@3.11
pip install virtualenv
Setup
Firstly, make sure Postgres is installed and you have a working database running in your own machine.
- Create virtual environment
python3.11 -m venv .xchaindata - Activate virstual environment
source .xchaindata/bin/activate - Install all dependencies
pip install -r requirements.txt - To stop using the env, run
deactivate - Create a
.envfile setting theDATABASE_URLvariable according to your database connection.
Using Terminal
Data Extraction
shell
python3.11 __init__.py extract --bridge <BRIDGE_NAME> --start_ts <START_TIMESTAMP> --end_ts <END_TIMESTAMP> --blockchains <BLOCKCHAIN_1> <BLOCKCHAIN_2> ... <BLOCKCHAIN_N>
CCTX Generation
shell
python3.11 __init__.py generate --bridge <BRIDGE_NAME>
Using VSCode
- Open the project in VS Code.
- Make sure you have the Python extension installed.
- Open the Command Palette (Cmd+Shift+P on macOS or Ctrl+Shift+P on Windows/Linux).
- Type "Python: Select Interpreter" and choose the interpreter in your xchaindata virtual environment (python 3.11).
- Open the Debug view (Ctrl+Shift+D or Cmd+Shift+D on Mac).
- From the dropdown at the top of the Debug view, select one of the options:
* [Stargate] test
* [Stargate] generate cross-chain transactions
* [Across] test
* [Across] generate cross-chain transactions
* [Omnibridge] test
* [Omnibridge] generate cross-chain transactions
* [Ronin] test
* [Ronin] generate cross-chain transactions
* [CCTP] test
* [CCTP] generate cross-chain transactions
* [CCIP] test
* [CCIP] generate cross-chain transactions
* [Polygon] test
* [Polygon] generate cross-chain transactions
Click the green play button or press F5 to start debugging.
Contributing
Results and Data Analysis
The analysis of data extracted between Jun 1, 2024 and December 31, 2024 can be found in ./analysis/paper-visualizations-and-tables-generation.ipynb and in ./analysis/R%20Scripts/paper-visualizations.R.
Suggested Citation
This work is an extension of our research. If using this repository, cite as:
bibtex
@misc{augusto2025xchaindatagencrosschaindatasetgeneration,
title={XChainDataGen: A Cross-Chain Dataset Generation Framework},
author={Andr Augusto and Andr Vasconcelos and Miguel Correia and Luyao Zhang},
year={2025},
eprint={2503.13637},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2503.13637},
}
Owner
- Name: André Augusto
- Login: AndreAugusto11
- Kind: user
- Location: Lisbon, Portugal
- Website: https://andreaugusto11.github.io
- Repositories: 39
- Profile: https://github.com/AndreAugusto11
Ph.D student | Blockchain Interoperability | Mentor @ Hyperledger
Citation (CITATION.cff)
@misc{augusto2025xchaindatagen,
title={XChainDataGen: A Cross-Chain Dataset Generation Framework},
author={André Augusto and André Vasconcelos and Miguel Correia and Luyao Zhang},
year={2025},
eprint={2503.13637},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2503.13637},
}
GitHub Events
Total
- Watch event: 10
- Delete event: 11
- Issue comment event: 1
- Push event: 52
- Pull request review comment event: 11
- Pull request review event: 19
- Pull request event: 27
- Fork event: 3
- Create event: 13
Last Year
- Watch event: 10
- Delete event: 11
- Issue comment event: 1
- Push event: 52
- Pull request review comment event: 11
- Pull request review event: 19
- Pull request event: 27
- Fork event: 3
- Create event: 13
Dependencies
- python 3.10-slim build
- Jinja2 ==3.1.4
- MarkupSafe ==3.0.2
- PyYAML ==6.0.2
- Pygments ==2.18.0
- SQLAlchemy ==2.0.36
- aiohappyeyeballs ==2.4.4
- aiohttp ==3.11.9
- aiosignal ==1.3.1
- annotated-types ==0.7.0
- appnope ==0.1.4
- asttokens ==3.0.0
- attrs ==24.2.0
- backcall ==0.2.0
- base58 ==2.1.1
- beautifulsoup4 ==4.12.3
- bitarray ==3.0.0
- black ==24.10.0
- bleach ==6.2.0
- certifi ==2024.8.30
- charset-normalizer ==3.4.0
- ckzg ==2.0.1
- click ==8.1.7
- cytoolz ==1.0.0
- decorator ==5.0.5
- defusedxml ==0.7.1
- docopt ==0.6.2
- eth-account ==0.13.4
- eth-hash ==0.7.0
- eth-keyfile ==0.8.1
- eth-keys ==0.6.0
- eth-rlp ==2.1.0
- eth-typing ==5.0.1
- eth-utils ==5.1.0
- eth_abi ==5.1.0
- executing ==2.1.0
- fastjsonschema ==2.21.1
- frozenlist ==1.5.0
- hexbytes ==1.2.1
- idna ==3.10
- ipython ==7.34.0
- isort ==5.13.2
- jedi ==0.19.2
- jsonschema ==4.23.0
- jsonschema-specifications ==2024.10.1
- jupyter_client ==8.0.0
- jupyter_core ==5.7.2
- jupyterlab_pygments ==0.3.0
- matplotlib ==3.10.0
- matplotlib-inline ==0.1.7
- mistune ==3.0.2
- multidict ==6.1.0
- mypy-extensions ==1.0.0
- nbclient ==0.10.1
- nbconvert ==7.16.4
- nbformat ==5.10.4
- packaging ==24.2
- pandas ==2.2.2
- pandocfilters ==1.5.1
- parsimonious ==0.10.0
- parso ==0.8.4
- pathspec ==0.12.1
- pexpect ==4.9.0
- pickleshare ==0.7.5
- platformdirs ==4.3.6
- prompt_toolkit ==3.0.48
- propcache ==0.2.1
- psycopg2-binary ==2.9.10
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- pycryptodome ==3.21.0
- pydantic ==2.10.3
- pydantic_core ==2.27.1
- python-dateutil ==2.9.0.post0
- pyunormalize ==16.0.0
- pyzmq ==26.2.0
- referencing ==0.35.1
- regex ==2024.11.6
- requests ==2.32.3
- rlp ==4.0.1
- rpds-py ==0.22.3
- seaborn ==0.13.2
- six ==1.17.0
- soupsieve ==2.6
- sqlalchemy_utils ==0.41.2
- stack-data ==0.6.3
- tinycss2 ==1.4.0
- toolz ==1.0.0
- tornado ==6.3.3
- traitlets ==5.14.3
- types-requests ==2.32.0.20241016
- typing_extensions ==4.12.2
- urllib3 ==2.2.3
- wcwidth ==0.2.13
- web3 ==7.6.0
- webencodings ==0.5.1
- websockets ==13.1
- yarg ==0.1.9
- yarl ==1.18.3