Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary
Repository
Cross Application Programmable IO
Basic Info
- Host: GitHub
- Owner: High-Performance-IO
- License: other
- Language: C++
- Default Branch: master
- Homepage: https://capio.hpc4ai.it/
- Size: 1.73 MB
Statistics
- Stars: 11
- Watchers: 5
- Forks: 3
- Open Issues: 3
- Releases: 1
Metadata Files
README.md
CAPIO
CAPIO (Cross-Application Programmable I/O), is a middleware aimed at injecting streaming capabilities to workflow steps without changing the application codebase. It has been proven to work with C/C++ binaries, Fortran Binaries, JAVA, python and bash.
Build and run tests
Dependencies
CAPIO depends on the following software that needs to be manually installed:
cmake >=3.15c++17or neweropenmpipthreads
The following dependencies are automatically fetched during cmake configuration phase, and compiled when required.
- syscall_intercept to intercept syscalls
- Taywee/args to parse server command line inputs
- simdjson/simdjson to parse json configuration files
Compile capio
bash
git clone https://github.com/High-Performance-IO/capio.git capio && cd capio
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build . -j$(nproc)
sudo cmake --install .
It is also possible to enable log in CAPIO, by defining -DCAPIO_LOG=TRUE.
Use CAPIO in your code
Good news! You don't need to modify your code to benefit from the features of CAPIO. You have only to do three steps ( the first is optional).
1) Write a configuration file for injecting streaming capabilities to your workflow
2) Launch the CAPIO daemons with MPI passing the (eventual) configuration file as argument on the machines in which you
want to execute your program (one daemon for each node). If you desire to specify a custom folder
for capio, set CAPIO_DIR as a environment variable.
bash
[CAPIO_DIR=your_capiodir] [mpiexec -N 1 --hostfile your_hostfile] capio_server -c conf.json
[!NOTE] if
CAPIO_DIRis not specified when launching capioserver, it will default to the current working directory of capioserver.
3) Launch your programs preloading the CAPIO shared library like this:
bash
CAPIO_DIR=your_capiodir \
CAPIO_WORKFLOW_NAME=wfname \
CAPIO_APP_NAME=appname \
LD_PRELOAD=libcapio_posix.so \
./your_app <args>
[!WARNING]
CAPIO_DIRmust be specified when launching a program with the CAPIO library. ifCAPIO_DIRis not specified, CAPIO will not intercept syscalls.
Available environment variables
CAPIO can be controlled through the usage of environment variables. The available variables are listed below:
Global environment variable
CAPIO_DIRThis environment variable tells to both server and application the mount point of capio;CAPIO_LOG_LEVELthis environment tells both server and application the log level to use. This variable works only if-DCAPIO_LOG=TRUEwas specified during cmake phase;CAPIO_LOG_PREFIXThis environment variable is defined only for capioposix applications and specifies the prefix of the logfile name to which capio will log to. The default value is `posixthread, which means that capio will log by default to a set of files calledposixthread_*.log. An equivalent behaviour can be set on the capio server using the-l` option;CAPIO_LOG_DIRThis environment variable is defined only for capioposix applications and specifies the directory name to which capio will be created. If this variable is not defined, capio will log by default to `capiologs. An equivalent behaviour can be set on the capio server using the-d` option;CAPIO_CACHE_LINES: This environment variable controls how many lines of cache are presents between posix and server applications. defaults to 10 lines;CAPIO_CACHE_LINE_SIZE: This environment variable controls the size of a single cache line. defaults to 256KB;
Server only environment variable
CAPIO_FILE_INIT_SIZE: This environment variable defines the default size of pre allocated memory for a new file handled by capio. Defaults to 4MB. Bigger sizes will reduce the overhead of malloc but will fill faster node memory. Value has to be expressed in bytes;CAPIO_PREFETCH_DATA_SIZE: If this variable is set, then data transfers between nodes will be always, at least of the given value in bytes;
Posix only environment variable
[!WARNING]
The following variables are mandatory. If not provided to a posix, application, CAPIO will not be able to correctly handle the application, according to the specifications given from the json configuration file!
CAPIO_WORKFLOW_NAME: This environment variable is used to define the scope of a workflow for a given step. Needs to be the same one as the field"name"inside the json configuration file;CAPIO_APP_NAME: This environment variable defines the app name within a workflow for a given step;
How to inject streaming capabilities into your workflow
With CAPIO is possible to run the applications of your workflow that communicates through files concurrently. CAPIO will synchronize transparently the concurrent reads and writes on those files. If a file is never modified after it is closed you can set the streaming semantics equals to "onclose" on the configuration file. In this way, all the reads done on this file will hung until the writer closes the file, allowing the consumer application to read the file even if the producer is still running. Another supported file streaming semantics is "append" in which a read is satisfied when the producer writes the requested data. This is the most aggressive (and efficient) form of streaming semantics (because the consumer can start reading while the producer is writing the file). This semantic must be used only if the producer does not modify a piece of data after it is written. The streaming semantic ontermination tells CAPIO to not allowing streaming on that file. This is the default streaming semantics if a semantics for a file is not specified. The following is an example of a simple configuration:
json
{
"name": "my_workflow",
"IO_Graph": [
{
"name": "writer",
"output_stream": [
"file0.dat",
"file1.dat",
"file2.dat"
],
"streaming": [
{
"name": ["file0.dat"],
"committed": "on_close"
},
{
"name": ["file1.dat"],
"committed": "on_close",
"mode": "no_update"
},
{
"name": ["file2.dat"],
"committed": "on_termination"
}
]
},
{
"name": "reader",
"input_stream": [
"file0.dat",
"file1.dat",
"file2.dat"
]
}
]
}
[!NOTE] We are working on an extension of the possible streaming semantics and in a detailed documentation about the configuration file!
Examples
The examples folder contains some examples that shows how to use mpi_io with CAPIO. There are also examples on how to write JSON configuration files for the semantics implemented by CAPIO:
- on_close: A pipeline composed by a producer and a consumer with "on_close" semantics
- no_update: A pipeline composed by a producer and a consumer with "no_update" semantics
- mix_semantics: A pipeline composed by a producer and a consumer with mix semantics
Report bugs + get help
[!TIP] A wiki is in development! You might want to check the wiki to get more in depth information about CAPIO!
CAPIO Team
Made with :heart: by:
Alberto Riccardo Martinelli albertoriccardo.martinelli@unito.it (designer and maintainer) \ Marco Edoardo Santimaria marcoedoardo.santimaria@unito.it (Designer and maintainer) \ Iacopo Colonnelli iacopo.colonnelli@unito.it (Workflows expert and maintainer) \ Massimo Torquati massimo.torquati@unipi.it (Designer) \ Marco Aldinucci marco.aldinucci@unito.it (Designer)
Papers
Owner
- Name: High-Performance-IO
- Login: High-Performance-IO
- Kind: organization
- Repositories: 1
- Profile: https://github.com/High-Performance-IO
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you want to cite CAPIO, please refer to the article below."
authors:
- family-names: "Martinelli"
given-names: "Alberto Riccardo"
orcid: "https://orcid.org/0000-0002-3707-7015"
- family-names: "Santimaria"
given-names: "Marco Edoardo"
orcid: "https://orcid.org/0009-0003-9886-4500"
- family-names: "Colonnelli"
given-names: "Iacopo"
orcid: "https://orcid.org/0000-0001-9290-2017"
- family-names: "Torquati"
given-names: "Massimo"
orcid: "https://orcid.org/0000-0001-6323-3459"
title: "CAPIO"
url: "https://github.com/High-Performance-IO/capio"
version: 0.1
preferred-citation:
type: conference-paper
authors:
- family-names: "Martinelli"
given-names: "Alberto Riccardo"
orcid: "https://orcid.org/0000-0002-3707-7015"
- family-names: "Torquati"
given-names: "Massimo"
orcid: "https://orcid.org/0000-0001-6323-3459"
- family-names: "Aldinucci"
given-names: "Marco"
orcid: "https://orcid.org/0000-0001-8788-0829"
- family-names: "Colonnelli"
given-names: "Iacopo"
orcid: "https://orcid.org/0000-0001-9290-2017"
- family-names: "Cantalupo"
given-names: "Barbara"
orcid: "https://orcid.org/0000-0001-7575-3902"
doi: 10.1109/HiPC58850.2023.00031
collection-title: "2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics, HiPC 2023"
conference:
name: "International Conference on High Performance Computing"
city: "Goa"
country: "IN"
date-start: "2023-12-18"
date-end: "2023-12-21"
publisher:
name: "IEEE Computer Society"
city: "Los Alamitos, California"
country: "US"
start: 153
end: 163
title: "CAPIO: a Middleware for Transparent I/O Streaming in Data-Intensive Workflows"
year: 2023
GitHub Events
Total
- Watch event: 2
- Delete event: 34
- Issue comment event: 29
- Push event: 454
- Pull request review event: 1
- Pull request event: 61
- Fork event: 4
- Create event: 34
Last Year
- Watch event: 2
- Delete event: 34
- Issue comment event: 29
- Push event: 454
- Pull request review event: 1
- Pull request event: 61
- Fork event: 4
- Create event: 34
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 91
- Average time to close issues: 3 months
- Average time to close pull requests: 18 days
- Total issue authors: 2
- Total pull request authors: 4
- Average comments per issue: 0.67
- Average comments per pull request: 0.23
- Merged pull requests: 61
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 37
- Average time to close issues: N/A
- Average time to close pull requests: 6 days
- Issue authors: 0
- Pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.46
- Merged pull requests: 23
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- marcoSanti (1)
- simoneperrotta001 (1)
Pull Request Authors
- marcoSanti (71)
- GlassOfWhiskey (19)
- Archabr1el (2)
- SilenceDesigner (1)