matchbench

Boilerplate code for benchmarking Empirical tag matching implementation

https://github.com/mmore500/matchbench

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Boilerplate code for benchmarking Empirical tag matching implementation

Basic Info

Host: GitHub
Owner: mmore500
Language: Shell
Default Branch: master
Size: 30.3 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 4 years ago · Last pushed about 4 years ago

Metadata Files

Readme Contributing Code of conduct Citation

README.md

Tag-matching benchmark boilerplate for CSE 491.

Free software: MIT license

Quickstart

Be sure to clone recursively & don't be concerned if it takes a mintue or two, there are a lot of submodules. bash git clone --recursive https://github.com/mmore500/matchbench.git cd matchbench/cpp make ./benchmark_suite.sh

To switch compilers bash make clean export CXX=g++ make

Tag-Matching Recap

In evolutionary programming, it is important to be able to match queries to an available set of objects in a dynamic, flexible way (e.g., an "environmental signal" to a program module defined in a digital genome). A good way to do this is a tag-matching process, where a query tag is compared to a set of available tagged operands. The best-matching operand can be returned to the query. It is especially useful to be able to incorporate "regulation" into this process so that, dynamically at runtime, tagged operands can be "upregulated" or "downregulated" to affect whether they are returned in response to queries. Downregulated operands, for example, would be more likely to be outcompeted by other operands to be the answer to a query.

To enable this type of lookup, empirical provides a emp::TagDepository data structure. (Termed so because tagged entities can be added but not removed.) This data structure facilitates look ups where a best-matching tag or tag(s) are returned in response to query tags.

This TagDepository data structure is templated to allow for compile-time configuration, with three major components: * tag comparison metric * How should we test how "well" two tags match? * example: interpret the tags as integers and take absolute difference * example: hamming distance between tags * example (slower but w/ promising evolutionary properties): "streak method" to compare longest-matching and -mismatching contiguous sequence of bits * operand selection operator * How should we decide which tagged operands to return to a query? * example: top matching tag better than X threshold, otherwise no tag * example: top two matching tags * example: all tags better than X threshold * example: randomly with probability proportional to match quality * tag regulation implementation * How should a tag's regulatory state (cumulative down or up regulation) affect its match score to queries? * untested idea: only allow complete deactivation of a tag Due to the flexibility required to handle different operators and metrics, the TagDepository is basically just a vector of tag-operand pairs. To perform a lookup, the query tag is compared against all operand tags using the configured comparison metric to generate a vector of match scores. Then, the operand selection operator operates on those scores to decide which operand(s) should be yielded.

Some caching of responses to queries is possible (and performed), although regulation invalidates these cached responses.

We assume fixed-length bitstring tags.

This tag-matching process turns out to be the primary time-consumer in performance profiles for much of my work, so doing it more efficiently could have big tangible benefits.

Possible Avenues for Optimization

You can look at the result of a flame graph profile I did here, but it's not very informative. Figuring out a better way to profile would be helpful.

vectorization

A large part of the run time is taking a query tags and comparing it against all operand tags to generate a vector of tag-match scores. The operand tags are stored in sequential memory within the depository.

Seems like a good opportunity for vectorization, especially for simpler tag matching metrics like the hamming metric. (Although more expensive metrics like longest contiguous matching sequence would benefit more, they seem possible less feasible to vectorize).

tag-matching metric

Because the metric is computed so often (between every query and all operands), performance is crucial.

caching

It is difficult to tune the caching policy, especially when work conditions change over evolutionary time (expanding genomes, more/less regulation, etc.). Some sort of auto-tuned caching system could be helpful here.

Organization

The project is set up to compile benchmarks separately using three different Empirical library copies that can be edited fully independently: 1. cpp/third-party/Empirical_baseline * current best-performing code; for comparison purposes, don't change 2. cpp/third-party/Empirical_control * slower than baseline, to make sure benchmark is working correctly 3. cpp/third-party/Empirical_fiddle * current best-performing code; make changes here

Additionally, the MatchDepository template instantiation used can be edited directly just for the "fiddle" benchmark at cpp/include/matchbench/typdef/fiddle/MatchDepository.hpp. (You can comment/uncomment some code here to switch out the "streak" tag distance metric for hamming distance.)

Nanobench is used to benchmark each Empirical library copy. For the microbenchark, a configurable-size randomly generated MatchDepository is prepared then configurable numbers of random lookup queries and regulation adjustments are made on that depository. A duplicate baseline condition also_baseline is included to account for weird issues with link order affecting performance.

Benchmarks for each source tree are compiled separately than linked together when make-ed (with special care to avoid ODR violations). You can run benchmarks at default settings via the executable ./matchbench. To run the benchmarks across a suite of settings to designed model representative runtime conditions, use ./benchmark_suite.sh.

Interesting places to look & fiddle around inside the Empirical source tree are: * emp::TagDepository * tag matching data structure * emp::SmallFifoMap * data structure used for cache for regulated query lookup (frequently invalidated by regulation) * emp::ApproxDualStreakMetric * the evolutionaryily-promising but expensive tag distance metric * emp::BitSet * LongestSegmentOnes() (important for the streak metric) may be of particular interest

Credits

This package was created with Cookiecutter and the mmore500/cookiecutter-dishtiny-project project template.

This package uses Empirical, a library of tools for scientific software development, with emphasis on also being able to build web interfaces using Emscripten.

Owner

Name: Matthew Andres Moreno
Login: mmore500
Kind: user
Location: East Lansing, MI
Company: @devosoft

Website: mmore500.github.io
Twitter: MorenoMathewA
Repositories: 43
Profile: https://github.com/mmore500

doctoral student, Computer Science and Engineering at Michigan State University

Citation (CITATION.cff)

date-released: 2022
license: MIT license
version: 0.0.0
authors:
  - affiliation: Michigan State University
    family-names: Moreno
    given-names: Matthew Andres
    orcid: 'https://orcid.org/0000-0003-4726-4479'
  - affiliation: Michigan State University
    family-names: Ofria
    given-names: Charles
    orcid: 'https://orcid.org/0000-0003-2924-1732'

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 25
Total Committers: 1
Avg Commits per committer: 25.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Matthew Andres Moreno	m**g@g**m	25

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

matchbench

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Quickstart

Tag-Matching Recap

Possible Avenues for Optimization

Organization

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels