sparsepp

A fast, memory efficient hash map for C++

https://github.com/greg7mdp/sparsepp

Keywords

hash hash-map hash-tables hashmap sparse sparse-containers sparsehash unordered

Last synced: 9 months ago · JSON representation

Repository

A fast, memory efficient hash map for C++

Basic Info

Host: GitHub
Owner: greg7mdp
License: other
Language: C++
Default Branch: master
Size: 417 KB

Statistics

Stars: 1,288
Watchers: 65
Forks: 127
Open Issues: 4
Releases: 5

Topics

hash hash-map hash-tables hashmap sparse sparse-containers sparsehash unordered

Created almost 10 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Funding License

README.md

I now recommend using the parallel hashmap instead of sparsepp, unless if you are stuck with a non C++11 compatible compiler, or if using a little bit more memory is not acceptable. I will personally switch to using the parallel hashmap, which I believe offers a significantly better tradeoff (slightly higher memory usage, much faster).

Sparsepp: A fast, memory efficient hash map for C++

Sparsepp is derived from Google's excellent sparsehash implementation. It aims to achieve the following objectives:

A drop-in alternative for unorderedmap and unorderedset.
Extremely low memory usage (typically about one byte overhead per entry), and most importantly very small memory overhead when resizing.
Very efficient, typically faster than your compiler's unordered map/set or Boost's.
C++11 support (if supported by compiler).
~~Single header~~ not anymore
Tested on Windows (vs2010-2015, g++), linux (g++, clang++) and MacOS (clang++).

We believe Sparsepp provides an unparalleled combination of performance and memory usage, and will outperform your compiler's unorderedmap on both counts. Only Google's `densehash_map` is consistently faster, at the cost of much greater memory usage (especially when the final size of the map is not known in advance, and insertions cause a resizing).

This hash map. like Google's dense_hash_map, uses open addressing (meaning that the hash table entries are conceptually stored in a big array, and collisions are not resolved using a linked list, but by probing). A big drawback of open addressing is that when the array needs to be resized larger, the high mark for memory usage is typically 3 times the current map size (as we allocate a new array twice as big as the current one, and copy from the old array to the new one). Sparsepp's implementation resolves this memory usage issue, and the memory usage high mark shows only a small bump when resizing.

For a detailed comparison of various hash implementations, including Sparsepp, please see our write-up.

Example

```c++

include

using spp::sparsehashmap;

int main() { // Create an unorderedmap of three strings (that map to strings) sparsehash_map email = { { "tom", "tom@gmail.com"}, { "jeff", "jk@gmail.com"}, { "jim", "jimg@microsoft.com"} };

// Iterate and print keys and values 
for (const auto& n : email) 
    std::cout << n.first << "'s email is: " << n.second << "\n";

// Add a new entry
email["bill"] = "bg@whatever.com";

// and print it
std::cout << "bill's email is: " << email["bill"] << "\n";

return 0;

} ```

Installation

No compilation is needed, as this is a header-only library. The installation consist in copying the sparsepp directory wherever it will be convenient to include in your project(s).

Warning - iterator and reference invalidation on erase/insert

erasing elements is likely to invalidate iterators (for example when calling erase())
inserting new elements is likely to invalidate iterators (iterator invalidation can also happen with std::unordered_map if rehashing occurs due to the insertion)
references to values stored in a sparsehashmap or set are likely to be invalidated on insert()/erase(). This is not the case for std::unordered_map or set.

Usage

As shown in the example above, you need to include the header file: #include <sparsepp/spp.h>

This provides the implementation for the following classes:

```c++ namespace spp { template ,
class EqualKey = std::equalto, class Alloc = libcallocatorwithrealloc>> class sparsehashmap;

template <class Value,
          class HashFcn  = spp_hash<Value>,
          class EqualKey = std::equal_to<Value>,
          class Alloc    = libc_allocator_with_realloc<Value>>
class sparse_hash_set;

}; ```

These classes provide the same interface as std::unorderedmap and std::unorderedset, with the following differences:

Calls to erase() may invalidate iterators. However, conformant to the C++11 standard, the position and range erase functions return an iterator pointing to the position immediately following the last of the elements erased. This makes it easy to traverse a sparse hash table and delete elements matching a condition. For example to delete odd values:

c++ for (auto it = c.begin(); it != c.end(); ) if (it->first % 2 == 1) it = c.erase(it); else ++it;

As for std::unordered_map, the order of the elements that are not erased is preserved.

Since items are not grouped into buckets, Bucket APIs have been adapted: max_bucket_count is equivalent to max_size, and bucket_count returns the sparsetable size, which is normally at least twice the number of items inserted into the hash_map.
Values inserted into sparsepp have to either be copyable and movable, or just movable. See example movable.cc.

Memory allocator on Windows (when building with Visual Studio)

When building with the Microsoft compiler, we provide a custom allocator because the default one (from the Visual C++ runtime) fragments memory when reallocating.

This is desirable only when creating large sparsepp hash maps. If you create lots of small hashmaps, memory usage may increase instead of decreasing as expected. The reason is that, for each instance of a hashmap, the custom memory allocator creates a new memory space to allocate from, which is typically 4K, so it may be a big waste if just a few items are allocated.

In order to use the custom spp allocator, define the following preprocessor variable before including <spp/spp.h>:

#define SPP_USE_SPP_ALLOC 1

Integer keys, and other hash function considerations.

For basic integer types, sparsepp provides a default hash function which does some mixing of the bits of the keys (see Integer Hashing). This prevents a pathological case where inserted keys are sequential (1, 2, 3, 4, ...), and the lookup on non-present keys becomes very slow.

Of course, the user of sparsepp may provide its own hash function, as shown below:

```c++ #include

struct Hash64 { sizet operator()(uint64t k) const { return (k ^ 14695981039346656037ULL) * 1099511628211ULL; } };

struct Hash32 { sizet operator()(uint32t k) const { return (k ^ 2166136261U) * 16777619UL; } };

int main() { spp::sparsehashmap map; ... }

```

When the user provides its own hash function, for example when inserting custom classes into a hash map, sometimes the resulting hash keys have similar low order bits and cause many collisions, decreasing the efficiency of the hash map. To address this use case, sparsepp provides an optional 'mixing' of the hash key (see Integer Hash Function which can be enabled by defining the proprocessor macro: SPPMIXHASH.

Example 2 - providing a hash function for a user-defined class

In order to use a sparsehashset or sparsehashmap, a hash function should be provided. Even though a the hash function can be provided via the HashFcn template parameter, we recommend injecting a specialization of std::hash for the class into the "std" namespace. For example:

```c++

include

using std::string;

struct Person { bool operator==(const Person &o) const { return first == o.first && last == o.last; }

string _first;
string _last;

};

namespace std { // inject specialization of std::hash for Person into namespace std // ---------------------------------------------------------------- template<> struct hash { std::sizet operator()(Person const &p) const { std::sizet seed = 0; spp::hashcombine(seed, p.first); spp::hashcombine(seed, p.last); return seed; } }; }

int main() { // As we have defined a specialization of std::hash() for Person, // we can now create sparsehashset or sparsehashmap of Persons // ---------------------------------------------------------------- spp::sparsehashset persons = { { "John", "Galt" }, { "Jane", "Doe" } }; for (auto& p: persons) std::cout << p.first << ' ' << p.last << '\n'; } ```

The std::hash specialization for Person combines the hash values for both first and last name using the convenient spp::hash_combine function, and returns the combined hash value.

spp::hashcombine is provided by the header sparsepp/spp.h. However, class definitions often appear in header files, and it is desirable to limit the size of headers included in such header files, so we provide the very small header `sparsepp/spputils.h` for that purpose:

```c++

include

using std::string;

struct Person { bool operator==(const Person &o) const { return first == o.first && last == o.last && age == o.age; }

string _first;
string _last;
int    _age;

};

namespace std { // inject specialization of std::hash for Person into namespace std // ---------------------------------------------------------------- template<> struct hash { std::sizet operator()(Person const &p) const { std::sizet seed = 0; spp::hashcombine(seed, p.first); spp::hashcombine(seed, p.last); spp::hashcombine(seed, p.age); return seed; } }; } ```

Example 3 - serialization

sparsehashset and sparsehashmap can easily be serialized/unserialized to a file or network connection. This support is implemented in the following APIs:

```c++ template bool serialize(Serializer serializer, OUTPUT *stream);

template <typename Serializer, typename INPUT>
bool unserialize(Serializer serializer, INPUT *stream);

```

The following example demonstrates how a simple sparsehashmap can be written to a file, and then read back. The serializer we use read and writes to a file using the stdio APIs, but it would be equally simple to write a serialized using the stream APIS:

```c++

include

using spp::sparsehashmap; using namespace std;

class FileSerializer { public: // serialize basic types to FILE // ----------------------------- template bool operator()(FILE *fp, const T& value) { return fwrite((const void *)&value, sizeof(value), 1, fp) == 1; }

template <class T>
bool operator()(FILE *fp, T* value) 
{
    return fread((void *)value, sizeof(*value), 1, fp) == 1;
}

// serialize std::string to FILE
// -----------------------------
bool operator()(FILE *fp, const string& value) 
{
    const size_t size = value.size();
    return (*this)(fp, size) && fwrite(value.c_str(), size, 1, fp) == 1;
}

bool operator()(FILE *fp, string* value) 
{
    size_t size;
    if (!(*this)(fp, &size)) 
        return false;
    char* buf = new char[size];
    if (fread(buf, size, 1, fp) != 1) 
    {
        delete [] buf;
        return false;
    }
    new (value) string(buf, (size_t)size);
    delete[] buf;
    return true;
}

// serialize std::pair<const A, B> to FILE - needed for maps
// ---------------------------------------------------------
template <class A, class B>
bool operator()(FILE *fp, const std::pair<const A, B>& value)
{
    return (*this)(fp, value.first) && (*this)(fp, value.second);
}

template <class A, class B>
bool operator()(FILE *fp, std::pair<const A, B> *value) 
{
    return (*this)(fp, (A *)&value->first) && (*this)(fp, &value->second);
}

};

int main(int argc, char* argv[]) { sparsehashmap age{ { "John", 12 }, {"Jane", 13 }, { "Fred", 8 } };

// serialize age hash_map to "ages.dmp" file
FILE *out = fopen("ages.dmp", "wb");
age.serialize(FileSerializer(), out);
fclose(out);

sparse_hash_map<string, int> age_read;

// read from "ages.dmp" file into age_read hash_map 
FILE *input = fopen("ages.dmp", "rb");
age_read.unserialize(FileSerializer(), input);
fclose(input);

// print out contents of age_read to verify correct serialization
for (auto& v : age_read)
    printf("age_read: %s -> %d\n", v.first.c_str(), v.second);

} ```

Thread safety

Sparsepp follows the thread safety rules of the Standard C++ library. In Particular:

A single sparsepp hash table is thread safe for reading from multiple threads. For example, given a hash table A, it is safe to read A from thread 1 and from thread 2 simultaneously.
If a single hash table is being written to by one thread, then all reads and writes to that hash table on the same or other threads must be protected. For example, given a hash table A, if thread 1 is writing to A, then thread 2 must be prevented from reading from or writing to A.
It is safe to read and write to one instance of a type even if another thread is reading or writing to a different instance of the same type. For example, given hash tables A and B of the same type, it is safe if A is being written in thread 1 and B is being read in thread 2.

Owner

Name: Gregory Popovitch
Login: greg7mdp
Kind: user
Location: Michigan, USA

Twitter: greg7mdp
Repositories: 45
Profile: https://github.com/greg7mdp

GitHub Events

Total

Issues event: 2
Watch event: 49
Issue comment event: 2
Push event: 1
Pull request event: 1
Fork event: 5

Last Year

Issues event: 2
Watch event: 49
Issue comment event: 2
Push event: 1
Pull request event: 1
Fork event: 5

Committers

Last synced: 12 months ago

All Time

Total Commits: 167
Total Committers: 14
Avg Commits per committer: 11.929
Development Distribution Score (DDS): 0.138

Past Year

Commits: 3
Committers: 1
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
greg7mdp	g**p@g**m	144
Breno Guimaraes	b**g@g**m	5
Ka Ho Ng	k**0@g**m	4
Federico Fuga	f**a@s**m	3
Csaba	C**a@a**s	2
Todd Lipcon	t**d@c**m	1
Tessil	t**l@g**m	1
Rémi Gillig	r**g@g**m	1
Rob Patro	r****p	1
Kevin Li	l****1	1
Jeffrey Hetherly	j**8@m**m	1
Fangrui Song	i@m****e	1
Alt	p**t@m**u	1
Karol Roslaniec	k**s@l**l	1

Committer Domains (Top 20 + Academic)

linko.pl: 1 mail.ru: 1 maskray.me: 1 me.com: 1 gmx.com: 1 cloudera.com: 1 studiofuga.com: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 79
Total pull requests: 16
Average time to close issues: about 2 months
Average time to close pull requests: about 1 month
Total issue authors: 57
Total pull request authors: 13
Average comments per issue: 5.87
Average comments per pull request: 0.38
Merged pull requests: 16
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 5 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

brenoguim (4)
rob-p (4)
RaguRam9967 (3)
sagaceilo (3)
dratchkov (3)
springmeyer (2)
Ragu9967 (2)
sirilvk (2)
sam0x17 (2)
philippecp (2)
dselivanov (2)
indigox4 (2)
vspinu (2)
zelda007 (2)
toddlipcon (1)

Pull Request Authors

brenoguim (4)
studiofuga (2)
jhetherly (1)
PSIAlt (1)
Bu11etmagnet (1)
legendre6891 (1)
MaskRay (1)
khng300 (1)
Roslaniec (1)
rob-p (1)
speps (1)
toddlipcon (1)
Tessil (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 234 last-month
Total docker downloads: 22,273

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 5
Total maintainers: 1

cran.r-project.org: sparsepp

'Rcpp' Interface to 'sparsepp'

Homepage: https://github.com/greg7mdp/sparsepp
Documentation: http://cran.r-project.org/web/packages/sparsepp/sparsepp.pdf
License: BSD_3_clause + file LICENSE
Latest release: 0.2.0
published almost 9 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 234 Last month
Docker Downloads: 22,273

Rankings

Stargazers count: 0.2%

Forks count: 0.5%

Docker downloads count: 12.6%

Average: 20.1%

Dependent repos count: 24.0%

Dependent packages count: 28.8%

Downloads: 54.4%

Maintainers (1)

selivanov.dmitriy@gmail.com

Last synced: 9 months ago

sparsepp

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Sparsepp: A fast, memory efficient hash map for C++

Example

include

include

include

Installation

Warning - iterator and reference invalidation on erase/insert

Usage

Memory allocator on Windows (when building with Visual Studio)

Integer keys, and other hash function considerations.

Example 2 - providing a hash function for a user-defined class

include

include

include

include

include

include

Example 3 - serialization

include

include

Thread safety

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: sparsepp

Rankings

Maintainers (1)