https://github.com/talariadb/talaria

TalariaDB is a distributed, highly available, and low latency time-series database for Presto

https://github.com/talariadb/talaria

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.0%) to scientific vocabulary

Keywords

big-data column-store database prestodb real-time stream-processing time-series

Keywords from Contributors

distributed embedded sequences projection interactive vulnerabilities excel chart agents mesh
Last synced: 6 months ago · JSON representation

Repository

TalariaDB is a distributed, highly available, and low latency time-series database for Presto

Basic Info
  • Host: GitHub
  • Owner: talariadb
  • License: mit
  • Language: Go
  • Default Branch: master
  • Homepage:
  • Size: 12.8 MB
Statistics
  • Stars: 225
  • Watchers: 14
  • Forks: 31
  • Open Issues: 14
  • Releases: 37
Topics
big-data column-store database prestodb real-time stream-processing time-series
Created almost 6 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Talaria

Test Release Go Report Card Docker Pulls

This repository contains a fork of TalariaDB, a distributed, highly available, and low latency time-series database for Big Data systems. It was originally designed and implemented in Grab, where millions and millions of transactions and connections take place every day , which requires a platform scalable data-driven decision making.

Introduction

TalariaDB helped us to overcome the challenge of retrieving and acting upon the information from large amounts of data. It addressed our need to query at least 2-3 terabytes of data per hour with predictable low query latency and low cost. Most importantly, it plays very nicely with the different tools’ ecosystems and lets us query data using SQL.

From the original design, we have extended Talaria to be setup in a two possible ways:

  1. As an event ingestion platform. This allows you to track events using a simple gRPC endpoint from almost anywhere.
  2. As a data store for hot data. This allows you to query hot data (e.g. last 6 hours) as it goes through the data pipeline and ultimately ends up in your data lake when compacted.

Talaria is designed around event-based data model. An event is essentially a set of key-value pairs, however to make it consistent we need to define a set of commonly used keys. Each event will consist of the following:

  • Hash key (e.g: using "event" key). This represents the type of the event and could be prefixed with the source scope (eg. "table1") and using the dot as a logical separator. The separation and namespacing is not required, but strongly recommended to make your system more usable.
  • Sort key (e.g: using "time" key). This represents the time at which the update has occurred, in unix timestamp (as precise as the source allows) and encoded as a 64-bit integer value.
  • Other key-value pairs will represent various values of the columns.

Below is an example of what a payload for an event describing a table update might look like.

| KEY | VALUE | DATA TYPE | |-------------|---------------------|-------------| | event | table1.update | string | | time | 1586500157 | int64 | | column1 | hello | string | | column2 | { "name": "roman" } | json |

Talaria supports string, int32, int64, bool, float64, timestamp and json data types which are used to construct columns that can be exposed to Presto/SQL.

Event Ingestion with Talaria

If your organisation needs a reliable and scalable data ingestion platform, you can set up Talaria as one. The main advantage is that such platform is cost-efficient, does not require a complex Kafka setup and even offers in-flight query if you also point a Presto on it. The basic setup allows you to track events using a simple gRPC endpoint from almost anywhere.

alt text

In order to setup Talaria as an ingestion platform, you will need specify a table, in this case "eventlog", and enable compaction in the configuration, something along these lines:

yaml mode: staging env: staging domain: "talaria-headless.default.svc.cluster.local" storage: dir: "/data" tables: eventlog: compact: # enable compaction interval: 60 # compact every 60 seconds nameFunc: "s3://bucket/namefunc.lua" # file name function s3: # sink to Amazon S3 region: "ap-southeast-1" bucket: "bucket" ...

Once this is set up, you can point a gRPC client (see protobuf definition) directly to the ingestion endpoint. Note that we also offer some pre-generated or pre-made ingestion clients in this repository.

service Ingress { rpc Ingest(IngestRequest) returns (IngestResponse) {} }

Below is a list of currently supported sinks and their example configurations:

For Microsoft Azure Blob Storage and Azure Data Lake Gen 2, we support writing across multiple storage accounts. We supports two modes:

  1. Random choice, where each write is directed to a storage account randomly, for which you can just specficy a list of storage accouts.
  2. Weighted choice, where a set of weights (positive integers) are assigned and each write is directed to a storage account based on the specified weights.

An example of weighted choice is shown below:

yaml - azure: container: a_container prefix: a_prefix blobServiceURL: .storage.microsoft.net storageAccounts: - a_storage_account - b_storage_account storageAccountWeights: [1, 2]

Random choice and weighed choice are particularly useful for some key scenarios:

  • High throughput deployment where the I/O generate by Talaria exceedes the limitation of the stroage accounts.
  • When deploying on internal endpoints with multiple VPNs links and you want to split the network traffic across multiple network links.

Hot Data Query with Talaria

If your organisation requires querying of either hot data (e.g. last n hours) or in-flight data (i.e as ingested), you can also configure Talaria to serve it to Presto using built-in Presto Thrift connector.

alt text

In the example configuration below we're setting up an s3 + sqs writer to continously ingest files from an S3 bucket and an "eventlog" table which will be exposed to Presto.

yaml mode: staging env: staging domain: "talaria-headless.default.svc.cluster.local" writers: grpc: port: 8080 s3sqs: region: "ap-southeast-1" queue: "queue-url" waitTimeout: 1 retries: 5 readers: presto: schema: data port: 8042 storage: dir: "/data" tables: eventlog: ttl: 3600 # data is persisted for 1 hour hashBy: event sortBy: time ...

Once you have set up Talaria, you'll need to configure Presto to talk to it using the Thrift Connector. You would need to make sure that: 1. In the properties file you have configured to talk to Talaria through a kubernetes load balancer. 2. Presto can access directly the nodes, without the load balancer.

Once this is done, you should be able to query your data via Presto.

sql select * from talaria.data.eventlog where event = 'table1.update' limit 1000

Ingesting Files Into Talaria

To ingest existing ORC, CSV or Parquet files from a storage URL (imagine S3 or Azure Blob Storage), use the Talaria File Ingestion Client:

https://github.com/atris/TalariaFileIngestionClient

Quick Start

The easiest way to get started would be using the provided helm chart.

Contributing

We are open to contributions, feel free to submit a pull request and we'll review it as quickly as we can. TalariaDB is maintained by: * Roman Atachiants * Yichao Wang * Chun Rong Phang * Ankit Kumar Sinha * Atri Sharma * Qiao Wei * Oscar Cassetti * Manoj Babu Katragadda * Jeffrey Lean

License

TalariaDB is licensed under the MIT License.

Owner

  • Name: Talaria
  • Login: talariadb
  • Kind: organization

GitHub Events

Total
  • Watch event: 6
Last Year
  • Watch event: 6

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 247
  • Total Committers: 21
  • Avg Commits per committer: 11.762
  • Development Distribution Score (DDS): 0.591
Past Year
  • Commits: 2
  • Committers: 2
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
Roman Atachiants r****s@g****m 101
Roman Atachiants r****s@g****m 50
Jeffrey lean 5****n 11
Phang Chun Rong c****g@g****m 10
atlas-booker c****y@g****m 10
WangBeyond w****d@g****m 9
Yichao Wang y****g@g****m 9
Atri Sharma a****t@g****m 7
Ankit Sinha a****a@g****m 7
Chunrong Phang c****g@g****m 7
Manoj Babu m****t@g****m 6
TiewKH t****5@h****m 5
Ankit Kumar Sinha 4****a 5
Oscar Cassetti o****g@g****m 2
Wei 4****g 2
Ankit kumar sinha a****n@m****t 1
Ankit kumar sinha a****a@g****m 1
dependabot[bot] 4****] 1
Steve M g****g 1
Eng Zer Jun e****n@g****m 1
stack_underFlow v****2@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 25
  • Total pull requests: 80
  • Average time to close issues: 4 months
  • Average time to close pull requests: 28 days
  • Total issue authors: 7
  • Total pull request authors: 13
  • Average comments per issue: 2.28
  • Average comments per pull request: 0.88
  • Merged pull requests: 57
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • atlas-comstock (17)
  • tardunge (3)
  • crphang (1)
  • panamafrancis (1)
  • VicLin66 (1)
  • gedw99 (1)
  • kumarankit1234 (1)
Pull Request Authors
  • atlas-comstock (12)
  • jeffreylean (11)
  • atris (10)
  • kelindar (10)
  • tardunge (8)
  • ocassetti (6)
  • TiewKH (6)
  • dependabot[bot] (4)
  • kumarankit1234 (4)
  • a9kitkumarsinha (3)
  • qiaowei-g (3)
  • crphang (2)
  • Juneezee (1)
  • gearcog (1)
Top Labels
Issue Labels
question (1)
Pull Request Labels
dependencies (3) go (3) enhancement (3) documentation (1)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 37
proxy.golang.org: github.com/talariadb/talaria
  • Versions: 37
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 7.0%
Average: 8.2%
Dependent repos count: 9.3%
Last synced: 6 months ago

Dependencies

go.mod go
  • cloud.google.com/go v0.108.0
  • cloud.google.com/go/bigquery v1.45.0
  • cloud.google.com/go/compute v1.15.1
  • cloud.google.com/go/compute/metadata v0.2.3
  • cloud.google.com/go/iam v0.10.0
  • cloud.google.com/go/pubsub v1.27.1
  • cloud.google.com/go/storage v1.28.1
  • github.com/Azure/azure-pipeline-go v0.2.3
  • github.com/Azure/azure-sdk-for-go v42.1.0+incompatible
  • github.com/Azure/azure-storage-blob-go v0.13.0
  • github.com/Azure/go-autorest v14.2.0+incompatible
  • github.com/Azure/go-autorest/autorest v0.11.17
  • github.com/Azure/go-autorest/autorest/adal v0.9.11
  • github.com/Azure/go-autorest/autorest/azure/auth v0.5.7
  • github.com/Azure/go-autorest/autorest/azure/cli v0.4.2
  • github.com/Azure/go-autorest/autorest/date v0.3.0
  • github.com/Azure/go-autorest/autorest/to v0.3.0
  • github.com/Azure/go-autorest/logger v0.2.0
  • github.com/Azure/go-autorest/tracing v0.6.0
  • github.com/DataDog/datadog-go v3.7.1+incompatible
  • github.com/Knetic/govaluate v3.0.0+incompatible
  • github.com/apache/thrift v0.13.0
  • github.com/armon/go-metrics v0.3.3
  • github.com/aws/aws-sdk-go v1.33.0
  • github.com/beorn7/perks v1.0.1
  • github.com/bool64/shared v0.1.4
  • github.com/cespare/xxhash v1.1.0
  • github.com/cespare/xxhash/v2 v2.1.1
  • github.com/crphang/orc v0.0.7
  • github.com/davecgh/go-spew v1.1.1
  • github.com/dgraph-io/badger/v3 v3.2103.1
  • github.com/dgraph-io/ristretto v0.1.0
  • github.com/dgryski/go-farm v0.0.0-20200201041132-a6ae2369ad13
  • github.com/dimchansky/utfbom v1.1.1
  • github.com/dnaeon/go-vcr v1.0.1
  • github.com/dustin/go-humanize v1.0.0
  • github.com/emitter-io/address v1.0.0
  • github.com/form3tech-oss/jwt-go v3.2.2+incompatible
  • github.com/fraugster/parquet-go v0.3.0
  • github.com/gogo/protobuf v1.3.2
  • github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
  • github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da
  • github.com/golang/protobuf v1.5.2
  • github.com/golang/snappy v0.0.3
  • github.com/google/btree v1.0.0
  • github.com/google/flatbuffers v1.12.0
  • github.com/google/go-cmp v0.5.9
  • github.com/google/uuid v1.3.0
  • github.com/googleapis/enterprise-certificate-proxy v0.2.1
  • github.com/googleapis/gax-go/v2 v2.7.0
  • github.com/gopherjs/gopherjs v0.0.0-20200209183636-89e6cbcd0b6d
  • github.com/gorilla/mux v1.8.0
  • github.com/grab/async v0.0.5
  • github.com/grpc-ecosystem/go-grpc-middleware v1.3.0
  • github.com/hako/durafmt v0.0.0-20191009132224-3f39dc1ed9f4
  • github.com/hashicorp/errwrap v1.0.0
  • github.com/hashicorp/go-immutable-radix v1.2.0
  • github.com/hashicorp/go-msgpack v0.5.5
  • github.com/hashicorp/go-multierror v1.1.0
  • github.com/hashicorp/go-sockaddr v1.0.2
  • github.com/hashicorp/golang-lru v0.5.4
  • github.com/hashicorp/memberlist v0.2.2
  • github.com/iancoleman/orderedmap v0.2.0
  • github.com/imroc/req v0.3.0
  • github.com/jmespath/go-jmespath v0.3.0
  • github.com/kelindar/binary v1.0.9
  • github.com/kelindar/loader v0.0.11
  • github.com/kelindar/lua v0.0.7
  • github.com/klauspost/compress v1.15.10
  • github.com/mattn/go-ieproxy v0.0.1
  • github.com/matttproud/golang_protobuf_extensions v1.0.1
  • github.com/miekg/dns v1.1.29
  • github.com/minio/highwayhash v1.0.2
  • github.com/mitchellh/go-homedir v1.1.0
  • github.com/mroth/weightedrand v0.4.1
  • github.com/myteksi/hystrix-go v1.1.3
  • github.com/nats-io/jwt/v2 v2.3.0
  • github.com/nats-io/nats-server/v2 v2.9.1
  • github.com/nats-io/nats.go v1.17.0
  • github.com/nats-io/nkeys v0.3.0
  • github.com/nats-io/nuid v1.0.1
  • github.com/pkg/errors v0.9.1
  • github.com/pmezard/go-difflib v1.0.0
  • github.com/prometheus/client_golang v1.7.1
  • github.com/prometheus/client_model v0.2.0
  • github.com/prometheus/common v0.10.0
  • github.com/prometheus/procfs v0.1.3
  • github.com/samuel/go-thrift v0.0.0-20191111193933-5165175b40af
  • github.com/satori/go.uuid v1.2.0
  • github.com/sean-/seed v0.0.0-20170313163322-e2103e2c3529
  • github.com/sercand/kuberesolver/v3 v3.0.0
  • github.com/sergi/go-diff v1.2.0
  • github.com/smartystreets/goconvey v1.6.4
  • github.com/spf13/afero v1.9.2
  • github.com/stretchr/objx v0.5.0
  • github.com/stretchr/testify v1.8.1
  • github.com/swaggest/assertjson v1.7.0
  • github.com/twmb/murmur3 v1.1.3
  • github.com/yudai/gojsondiff v1.0.0
  • github.com/yudai/golcs v0.0.0-20170316035057-ecda9a501e82
  • github.com/yuin/gopher-lua v0.0.0-20191220021717-ab39c6098bdb
  • go.nhat.io/grpcmock v0.20.0
  • go.nhat.io/matcher/v2 v2.0.0
  • go.opencensus.io v0.24.0
  • go.uber.org/atomic v1.9.0
  • golang.org/x/crypto v0.0.0-20220919173607-35f4265a4bc0
  • golang.org/x/net v0.5.0
  • golang.org/x/oauth2 v0.4.0
  • golang.org/x/sync v0.1.0
  • golang.org/x/sys v0.4.0
  • golang.org/x/text v0.6.0
  • golang.org/x/time v0.1.0
  • golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2
  • google.golang.org/api v0.107.0
  • google.golang.org/appengine v1.6.7
  • google.golang.org/genproto v0.0.0-20230113154510-dbe35b8444a5
  • google.golang.org/grpc v1.52.0
  • google.golang.org/protobuf v1.28.1
  • gopkg.in/yaml.v2 v2.4.0
  • gopkg.in/yaml.v3 v3.0.1
  • layeh.com/gopher-luar v1.0.7
go.sum go
  • 730 dependencies
.github/workflows/edge.yml actions
  • actions/checkout v1 composite
  • actions/setup-go v1 composite
  • docker/login-action v1 composite
.github/workflows/latest.yml actions
  • actions/checkout v1 composite
  • actions/setup-go v1 composite
  • docker/login-action v1 composite
.github/workflows/release.yml actions
  • actions/checkout v1 composite
  • actions/setup-go v1 composite
  • docker/login-action v1 composite
.github/workflows/test.yml actions
  • actions/checkout v1 composite
  • actions/setup-go v1 composite
Dockerfile docker
  • debian latest build
  • golang 1.17 build
client/java-client/build.gradle maven
  • org.apache.tomcat:annotations-api 6.0.53 compileOnly
  • com.google.protobuf:protobuf-java-util ${protobufVersion} implementation
  • io.grpc:grpc-protobuf ${grpcVersion} implementation
  • io.grpc:grpc-stub ${grpcVersion} implementation
  • io.grpc:grpc-netty-shaded ${grpcVersion} runtimeOnly
  • io.grpc:grpc-testing ${grpcVersion} testImplementation
  • io.grpc:grpc-testing * testImplementation
  • org.junit.jupiter:junit-jupiter-api 5.8.2 testImplementation
  • org.mockito:mockito-core 4.6.1 testImplementation
  • org.junit.jupiter:junit-jupiter-engine * testRuntimeOnly
client/python/setup.py pypi
  • grpcio >=1.36.0
  • protobuf *