matchbench
Boilerplate code for benchmarking Empirical tag matching implementation
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Repository
Boilerplate code for benchmarking Empirical tag matching implementation
Basic Info
- Host: GitHub
- Owner: mmore500
- Language: Shell
- Default Branch: master
- Size: 30.3 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Tag-matching benchmark boilerplate for CSE 491.
- Free software: MIT license
Quickstart
Be sure to clone recursively & don't be concerned if it takes a mintue or two, there are a lot of submodules.
bash
git clone --recursive https://github.com/mmore500/matchbench.git
cd matchbench/cpp
make
./benchmark_suite.sh
To switch compilers
bash
make clean
export CXX=g++
make
Tag-Matching Recap
In evolutionary programming, it is important to be able to match queries to an available set of objects in a dynamic, flexible way (e.g., an "environmental signal" to a program module defined in a digital genome). A good way to do this is a tag-matching process, where a query tag is compared to a set of available tagged operands. The best-matching operand can be returned to the query. It is especially useful to be able to incorporate "regulation" into this process so that, dynamically at runtime, tagged operands can be "upregulated" or "downregulated" to affect whether they are returned in response to queries. Downregulated operands, for example, would be more likely to be outcompeted by other operands to be the answer to a query.
To enable this type of lookup, empirical provides a emp::TagDepository data structure.
(Termed so because tagged entities can be added but not removed.)
This data structure facilitates look ups where a best-matching tag or tag(s) are returned in response to query tags.
This TagDepository data structure is templated to allow for compile-time configuration, with three major components:
* tag comparison metric
* How should we test how "well" two tags match?
* example: interpret the tags as integers and take absolute difference
* example: hamming distance between tags
* example (slower but w/ promising evolutionary properties): "streak method" to compare longest-matching and -mismatching contiguous sequence of bits
* operand selection operator
* How should we decide which tagged operands to return to a query?
* example: top matching tag better than X threshold, otherwise no tag
* example: top two matching tags
* example: all tags better than X threshold
* example: randomly with probability proportional to match quality
* tag regulation implementation
* How should a tag's regulatory state (cumulative down or up regulation) affect its match score to queries?
* untested idea: only allow complete deactivation of a tag
Due to the flexibility required to handle different operators and metrics, the TagDepository is basically just a vector of tag-operand pairs.
To perform a lookup, the query tag is compared against all operand tags using the configured comparison metric to generate a vector of match scores.
Then, the operand selection operator operates on those scores to decide which operand(s) should be yielded.
Some caching of responses to queries is possible (and performed), although regulation invalidates these cached responses.
We assume fixed-length bitstring tags.
This tag-matching process turns out to be the primary time-consumer in performance profiles for much of my work, so doing it more efficiently could have big tangible benefits.
Possible Avenues for Optimization
You can look at the result of a flame graph profile I did here, but it's not very informative. Figuring out a better way to profile would be helpful.
- vectorization
A large part of the run time is taking a query tags and comparing it against all operand tags to generate a vector of tag-match scores. The operand tags are stored in sequential memory within the depository.
Seems like a good opportunity for vectorization, especially for simpler tag matching metrics like the hamming metric. (Although more expensive metrics like longest contiguous matching sequence would benefit more, they seem possible less feasible to vectorize).
- tag-matching metric
Because the metric is computed so often (between every query and all operands), performance is crucial.
- caching
It is difficult to tune the caching policy, especially when work conditions change over evolutionary time (expanding genomes, more/less regulation, etc.). Some sort of auto-tuned caching system could be helpful here.
Organization
The project is set up to compile benchmarks separately using three different Empirical library copies that can be edited fully independently:
1. cpp/third-party/Empirical_baseline
* current best-performing code; for comparison purposes, don't change
2. cpp/third-party/Empirical_control
* slower than baseline, to make sure benchmark is working correctly
3. cpp/third-party/Empirical_fiddle
* current best-performing code; make changes here
Additionally, the MatchDepository template instantiation used can be edited directly just for the "fiddle" benchmark at cpp/include/matchbench/typdef/fiddle/MatchDepository.hpp.
(You can comment/uncomment some code here to switch out the "streak" tag distance metric for hamming distance.)
Nanobench is used to benchmark each Empirical library copy.
For the microbenchark, a configurable-size randomly generated MatchDepository is prepared then configurable numbers of random lookup queries and regulation adjustments are made on that depository.
A duplicate baseline condition also_baseline is included to account for weird issues with link order affecting performance.
Benchmarks for each source tree are compiled separately than linked together when make-ed (with special care to avoid ODR violations).
You can run benchmarks at default settings via the executable ./matchbench.
To run the benchmarks across a suite of settings to designed model representative runtime conditions, use ./benchmark_suite.sh.
Interesting places to look & fiddle around inside the Empirical source tree are:
* emp::TagDepository
* tag matching data structure
* emp::SmallFifoMap
* data structure used for cache for regulated query lookup (frequently invalidated by regulation)
* emp::ApproxDualStreakMetric
* the evolutionaryily-promising but expensive tag distance metric
* emp::BitSet
* LongestSegmentOnes() (important for the streak metric) may be of particular interest
Credits
This package was created with Cookiecutter and the mmore500/cookiecutter-dishtiny-project project template.
This package uses Empirical, a library of tools for scientific software development, with emphasis on also being able to build web interfaces using Emscripten.
Owner
- Name: Matthew Andres Moreno
- Login: mmore500
- Kind: user
- Location: East Lansing, MI
- Company: @devosoft
- Website: mmore500.github.io
- Twitter: MorenoMathewA
- Repositories: 43
- Profile: https://github.com/mmore500
doctoral student, Computer Science and Engineering at Michigan State University
Citation (CITATION.cff)
date-released: 2022
license: MIT license
version: 0.0.0
authors:
- affiliation: Michigan State University
family-names: Moreno
given-names: Matthew Andres
orcid: 'https://orcid.org/0000-0003-4726-4479'
- affiliation: Michigan State University
family-names: Ofria
given-names: Charles
orcid: 'https://orcid.org/0000-0003-2924-1732'
GitHub Events
Total
Last Year
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Matthew Andres Moreno | m****g@g****m | 25 |
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0