My take on Time Series Database requirements
Some time ago, doing some research related to my current work, i stumble upon this excellent article by Baron Schwartz on the topic of Time Series Database requirements. This article is in my opinion the best thing on my reading list for that subject and in this post i will try to expose my own take on this matter building on top of it.
An element of interest for me related to the data type is temporal resolution. Different applications may have different requirements respect to time. Therefore the Time Series Database need to be able to set temporal resolution per series.
Respect to the type of the measurement values (floating point numbers for Baron), i’m not entirely convinced of this. For me the type (and format) of the measurement values is application (and series) specific. Take for example a series of notifications mails sent to a client. This is a time series too. I know that this generalization may prevent the TSDB implementation from applying fully the principle of “take the query to the data”. But in the end the kind of analysis that can be performed on the data is application specific too and therefore implementing analysis features at the query level may hurt TSDB implementation generality.
I agree fully with Baron in respect to the different characteristics and optimization objectives. Nothing different on this.
Definition of Data Type
Baron identify series using two elements: source identifier and metric identifier. I have seem this used on existing products, but from my perspective this is needed for organization purposes and not for series identification. For me the only hard requirement on this matter is that series can be uniquely identified. Therefore a single series identifier is sufficient. You can always provide organization using some kind of namespacing over identifiers (consider the following identifiers “source1.metric1” and “source1.metric2”) or using an independent component to build the relations between series and other elements (groups, assets, source, etc.). The latter open the possibility for custom organization structures.An element of interest for me related to the data type is temporal resolution. Different applications may have different requirements respect to time. Therefore the Time Series Database need to be able to set temporal resolution per series.
Respect to the type of the measurement values (floating point numbers for Baron), i’m not entirely convinced of this. For me the type (and format) of the measurement values is application (and series) specific. Take for example a series of notifications mails sent to a client. This is a time series too. I know that this generalization may prevent the TSDB implementation from applying fully the principle of “take the query to the data”. But in the end the kind of analysis that can be performed on the data is application specific too and therefore implementing analysis features at the query level may hurt TSDB implementation generality.
With this in mind for me time series can be defined as follows:
- A series is identified using a unique identifier. This UID can be human readable or not. The only requirement is that is unique for the set of series stored on the TSDB.
- It can be characterized by its specific temporal resolution and data format. Additional characteristics could be data retention policy and tags.
- A series consists of a sequence of {timestamp, value} measurements ordered by timestamp. The timestamp is a numerical value representing an amount of time expressed using the specific series temporal resolution and the value is a BLOB.
Workload Characteristics, Performance, Scaling Characteristics and Operational Requirements
I agree fully with Baron in respect to the different characteristics and optimization objectives. Nothing different on this.
Comments
Post a Comment