matchbench

Boilerplate code for benchmarking Empirical tag matching implementation

https://github.com/mmore500/matchbench

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Boilerplate code for benchmarking Empirical tag matching implementation

Basic Info
  • Host: GitHub
  • Owner: mmore500
  • Language: Shell
  • Default Branch: master
  • Size: 30.3 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed almost 4 years ago
Metadata Files
Readme Contributing Code of conduct Citation

README.md

version GitHub Actions Status Documentation Status documentation coverage code coverage status dotos GitHub stars

Tag-matching benchmark boilerplate for CSE 491.

  • Free software: MIT license

Quickstart

Be sure to clone recursively & don't be concerned if it takes a mintue or two, there are a lot of submodules. bash git clone --recursive https://github.com/mmore500/matchbench.git cd matchbench/cpp make ./benchmark_suite.sh

To switch compilers bash make clean export CXX=g++ make

Tag-Matching Recap

In evolutionary programming, it is important to be able to match queries to an available set of objects in a dynamic, flexible way (e.g., an "environmental signal" to a program module defined in a digital genome). A good way to do this is a tag-matching process, where a query tag is compared to a set of available tagged operands. The best-matching operand can be returned to the query. It is especially useful to be able to incorporate "regulation" into this process so that, dynamically at runtime, tagged operands can be "upregulated" or "downregulated" to affect whether they are returned in response to queries. Downregulated operands, for example, would be more likely to be outcompeted by other operands to be the answer to a query.

To enable this type of lookup, empirical provides a emp::TagDepository data structure. (Termed so because tagged entities can be added but not removed.) This data structure facilitates look ups where a best-matching tag or tag(s) are returned in response to query tags.

This TagDepository data structure is templated to allow for compile-time configuration, with three major components: * tag comparison metric * How should we test how "well" two tags match? * example: interpret the tags as integers and take absolute difference * example: hamming distance between tags * example (slower but w/ promising evolutionary properties): "streak method" to compare longest-matching and -mismatching contiguous sequence of bits * operand selection operator * How should we decide which tagged operands to return to a query? * example: top matching tag better than X threshold, otherwise no tag * example: top two matching tags * example: all tags better than X threshold * example: randomly with probability proportional to match quality * tag regulation implementation * How should a tag's regulatory state (cumulative down or up regulation) affect its match score to queries? * untested idea: only allow complete deactivation of a tag Due to the flexibility required to handle different operators and metrics, the TagDepository is basically just a vector of tag-operand pairs. To perform a lookup, the query tag is compared against all operand tags using the configured comparison metric to generate a vector of match scores. Then, the operand selection operator operates on those scores to decide which operand(s) should be yielded.

Some caching of responses to queries is possible (and performed), although regulation invalidates these cached responses.

We assume fixed-length bitstring tags.

This tag-matching process turns out to be the primary time-consumer in performance profiles for much of my work, so doing it more efficiently could have big tangible benefits.

Possible Avenues for Optimization

You can look at the result of a flame graph profile I did here, but it's not very informative. Figuring out a better way to profile would be helpful.

  1. vectorization

A large part of the run time is taking a query tags and comparing it against all operand tags to generate a vector of tag-match scores. The operand tags are stored in sequential memory within the depository.

Seems like a good opportunity for vectorization, especially for simpler tag matching metrics like the hamming metric. (Although more expensive metrics like longest contiguous matching sequence would benefit more, they seem possible less feasible to vectorize).

  1. tag-matching metric

Because the metric is computed so often (between every query and all operands), performance is crucial.

  1. caching

It is difficult to tune the caching policy, especially when work conditions change over evolutionary time (expanding genomes, more/less regulation, etc.). Some sort of auto-tuned caching system could be helpful here.

Organization

The project is set up to compile benchmarks separately using three different Empirical library copies that can be edited fully independently: 1. cpp/third-party/Empirical_baseline * current best-performing code; for comparison purposes, don't change 2. cpp/third-party/Empirical_control * slower than baseline, to make sure benchmark is working correctly 3. cpp/third-party/Empirical_fiddle * current best-performing code; make changes here

Additionally, the MatchDepository template instantiation used can be edited directly just for the "fiddle" benchmark at cpp/include/matchbench/typdef/fiddle/MatchDepository.hpp. (You can comment/uncomment some code here to switch out the "streak" tag distance metric for hamming distance.)

Nanobench is used to benchmark each Empirical library copy. For the microbenchark, a configurable-size randomly generated MatchDepository is prepared then configurable numbers of random lookup queries and regulation adjustments are made on that depository. A duplicate baseline condition also_baseline is included to account for weird issues with link order affecting performance.

Benchmarks for each source tree are compiled separately than linked together when make-ed (with special care to avoid ODR violations). You can run benchmarks at default settings via the executable ./matchbench. To run the benchmarks across a suite of settings to designed model representative runtime conditions, use ./benchmark_suite.sh.

Interesting places to look & fiddle around inside the Empirical source tree are: * emp::TagDepository * tag matching data structure * emp::SmallFifoMap * data structure used for cache for regulated query lookup (frequently invalidated by regulation) * emp::ApproxDualStreakMetric * the evolutionaryily-promising but expensive tag distance metric * emp::BitSet * LongestSegmentOnes() (important for the streak metric) may be of particular interest

Credits

This package was created with Cookiecutter and the mmore500/cookiecutter-dishtiny-project project template.

This package uses Empirical, a library of tools for scientific software development, with emphasis on also being able to build web interfaces using Emscripten.

Owner

  • Name: Matthew Andres Moreno
  • Login: mmore500
  • Kind: user
  • Location: East Lansing, MI
  • Company: @devosoft

doctoral student, Computer Science and Engineering at Michigan State University

Citation (CITATION.cff)

date-released: 2022
license: MIT license
version: 0.0.0
authors:
  - affiliation: Michigan State University
    family-names: Moreno
    given-names: Matthew Andres
    orcid: 'https://orcid.org/0000-0003-4726-4479'
  - affiliation: Michigan State University
    family-names: Ofria
    given-names: Charles
    orcid: 'https://orcid.org/0000-0003-2924-1732'

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 25
  • Total Committers: 1
  • Avg Commits per committer: 25.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Matthew Andres Moreno m****g@g****m 25

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels