My take on Time Series Database requirements

October 19, 2017

Some time ago, doing some research related to my current work, i stumble upon this excellent article by Baron Schwartz on the topic of Time Series Database requirements. This article is in my opinion the best thing on my reading list for that subject and in this post i will try to expose my own take on this matter building on top of it.

Definition of Data Type

Baron identify series using two elements: source identifier and metric identifier. I have seem this used on existing products, but from my perspective this is needed for organization purposes and not for series identification. For me the only hard requirement on this matter is that series can be uniquely identified. Therefore a single series identifier is sufficient. You can always provide organization using some kind of namespacing over identifiers (consider the following identifiers “source1.metric1” and “source1.metric2”) or using an independent component to build the relations between series and other elements (groups, assets, source, etc.). The latter open the possibility for custom organization structures.

An element of interest for me related to the data type is temporal resolution. Different applications may have different requirements respect to time. Therefore the Time Series Database need to be able to set temporal resolution per series.

Respect to the type of the measurement values (floating point numbers for Baron), i’m not entirely convinced of this. For me the type (and format) of the measurement values is application (and series) specific. Take for example a series of notifications mails sent to a client. This is a time series too. I know that this generalization may prevent the TSDB implementation from applying fully the principle of “take the query to the data”. But in the end the kind of analysis that can be performed on the data is application specific too and therefore implementing analysis features at the query level may hurt TSDB implementation generality.

With this in mind for me time series can be defined as follows:

A series is identified using a unique identifier. This UID can be human readable or not. The only requirement is that is unique for the set of series stored on the TSDB.
It can be characterized by its specific temporal resolution and data format. Additional characteristics could be data retention policy and tags.
A series consists of a sequence of {timestamp, value} measurements ordered by timestamp. The timestamp is a numerical value representing an amount of time expressed using the specific series temporal resolution and the value is a BLOB.

Workload Characteristics, Performance, Scaling Characteristics and Operational Requirements

I agree fully with Baron in respect to the different characteristics and optimization objectives. Nothing different on this.

Language and/or API Design

Here strives the biggest difference between Baron approach and mine: The role of the Time Series Database respect to analytics. I appreciate the “take the query to the data” approach. But as i said before, data analytics is application specific. Therefore for me Time Series Database is not the place for general analytics, but just time related functions (aggregation, etc.).

Search This Blog

yeiniel