https://github.com/barrust/count-min-sketch

Count-Min Sketch Implementation in C

https://github.com/barrust/count-min-sketch

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary

Keywords

c count-mean-min-sketch count-min-sketch data-structures probabilistic probabilistic-programming
Last synced: 5 months ago · JSON representation

Repository

Count-Min Sketch Implementation in C

Basic Info
  • Host: GitHub
  • Owner: barrust
  • License: mit
  • Language: C
  • Default Branch: master
  • Homepage:
  • Size: 63.5 KB
Statistics
  • Stars: 47
  • Watchers: 3
  • Forks: 16
  • Open Issues: 2
  • Releases: 10
Topics
c count-mean-min-sketch count-min-sketch data-structures probabilistic probabilistic-programming
Created almost 9 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog License

README.md

count-min-sketch

License: MIT GitHub release C/C++ CI codecov

A Count-Min Sketch implementation in C.

Count-Min Sketch is a probabilistic data-structure that takes sub linear space to store the probable count, or frequency, of occurrences of elements added into the data-structure. Due to the structure and strategy of storing elements, it is possible that elements are over counted but not under counted.

To use the library, copy the src/count_min_sketch.h and src/count_min_sketch.c files into your project and include it where needed.

License:

MIT 2017

Point Query Strategies

To generic method to query the count-min sketch for the number of times an element was inserted is to return the minimum value from each row in the data-structure. This is the maximum number of times that it may have been inserted, but there is a defined bias. This number is always greater than or equal to the actual value but never lower.

To help account for this bias, there are two other methods of querying the data. One is to use the mean of the results. This will result in larger answers, but is useful when elements can be removed from the count-min sketch.

The other option is to use the count-mean-min query strategy. This strategy attempts to remove the bias by taking the median value from the results of the following calculation of each row (where i is the bin result of the hash): bin[i] - ((number-elements - bin[i]) / (width - 1))

For a good description of different uses and methods of the count-min sketch, read this link.

For a python version, please check out pyprobables which has a binary compatible output.

Main Features

  • Ability to add and remove elements from the Count-Min Sketch
    • Increment or add x elements at once
    • Decrement or remove x elements at once
  • Ability to lookup elements in the data-structure
  • Add, remove, or lookup elements based on pre-calculated hashes
  • Ability to set depth & width or have the library calculate them based on error and confidence
  • Multiple lookup types:
    • Minimum: largest possible number of insertions by taking the maximum result
    • Mean: good for when removes and negatives are possible, but increases the false count
    • Mean-Min attempts to take bias into account; results are less skewed upwards compared to the mean lookup
  • Export and Import count-min sketch to file
  • Ability to merge multiple count-min sketches together

Future Enhancements

  • add method to calculate the possible bias (?)
  • add do everything directly on disk (?)
  • add import / export to hex (?)

Usage:

``` c

include

include "countminsketch.h"

CountMinSketch cms; cms_init(&cms, 10000, 7);

int i, res; for (i = 0; i < 10; i++) { res = cms_add(&cms, "this is a test"); }

res = cmscheck(&cms, "this is a test"); if (res != 10) { printf("Error with lookup: %d\n", res); } cmsdestroy(&cms); ```

Required Compile Flags

-lm

Backward Compatible Hash Function

To use the older count-min sketch (v0.1.8 or lower) that utilized the default hashing algorithm, then change use the following code as the hash function:

``` c /* NOTE: The caller will free the results / static uint64_t originaldefaulthash(unsigned int numhashes, const char* str) { uint64t results = (uint64_t)calloc(numhashes, sizeof(uint64t)); char key[17] = {0}; // largest value is 7FFF,FFFF,FFFF,FFFF results[0] = _fnv1a(str); for (unsigned int i = 1; i < numhashes; ++i) { sprintf(key, "%" PRIx64 "", results[i-1]); results[i] = oldfnv_1a(key); } return results; }

static uint64t oldfnv1a(const char* key) { // FNV-1a hash (http://www.isthe.com/chongo/tech/comp/fnv/) int i, len = strlen(key); uint64t h = 14695981039346656073ULL; // FNVOFFSET 64 bit for (i = 0; i < len; ++i){ h = h ^ (unsigned char) key[i]; h = h * 1099511628211ULL; // FNVPRIME 64 bit } return h; } ```

If using only older count-min sketch, then you can update the // FNV_OFFSET 64 bit to use 14695981039346656073ULL

Owner

  • Name: Tyler Barrus
  • Login: barrust
  • Kind: user
  • Location: Richmond Va

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 9
  • Total pull requests: 13
  • Average time to close issues: 25 days
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 3
  • Total pull request authors: 4
  • Average comments per issue: 2.44
  • Average comments per pull request: 0.77
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • barrust (7)
  • zhoujie757 (1)
  • meixsh (1)
Pull Request Authors
  • barrust (9)
  • jonahharris (1)
  • meixsh (1)
  • TheMadman (1)
Top Labels
Issue Labels
enhancement (3)
Pull Request Labels

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v2 composite
  • codecov/codecov-action v1 composite