I came recently to design a solution for MongoDB to store information that has time as the main axis of analysis. This information should be stored in a way that is easy enough to query and aggregate using many different time granularity (monthly, weekly, daily, …). Information should also be stored in a way that does not consume too much disk space and is optimal in performance for MongoDB to maintain. In a word, I need to transform MongoDB in a Time series database.

## Time series

Let’s start from the beginning. What a time series is. Citing Wikipedia:

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

InfluxDB Key Concepts page gives us an example extremely easy to understand of what a time series is. Imagine you have two scientists that have to record the number of two type of insects in two different locations. Using a tabular view, we come across something like the following.

As you can see, the main dimension of this data mart is time. Events stored in it span through time. In this example, What we can refer to as basic time granularity is set to minutes. This means that we cannot produce an analysis of these data that has a granularity less than the basic (minutes).

Columns like butterflies and honeybees are called fields. Fields are made up of field keys and field values. Field keys (i.e. butterflies and honeybees) are strings and they store metadata; the field key butterflies tells us that the field values 12-7 refer to butterflies and the field key honeybees tells us that the field values 23-22 refer to, well, honeybees.

Field values are the data; a field value is always associated with a timestamp in a time series database. The collection of field-key and field-value pairs makes up a field set.

Last, but not least, columns like location and scientist are called tags. Also in this case, a tag is made up of a tag key and a tag value. We can look at tags like indexes, that help the access to the time series. They are not mandatory, but they helps a lot.

We have time series, now. But, are these databases only a bunch of tables that forces the presence of a timestamp? No, the main characteristic of a time series database is that it should have some powerful tools to aggregate data (fields) over time. Let’s say, if we need to know how much butterflies were counted by scientist “perpetua” in “location 1” during the last year, it should be easy to retrieve this information.

## Time series in MongoDB

MongoDB is a general purpose document oriented database. This means that information inside the database is stored as document. MongoDB uses BSON format, a binary variant of JSON documents. A document in MongoDB looks like the following.

Among the types that are available for field values we find string, numbers, timestamps, arrays, documents, and so on. How can we design a document schema such that MongoDB can manage time series?

For a number of reasons that will be analyzed later in this document, the best way to treat time series in MongoDB is using a subdocument for each level of aggregation we want to manage. Let’s convert the above example about scientists and insects to MongoDB.

Suppose that you want to maintain minutes as the maximum granularity over time (basic granularity). Then, imagine that you also want to give access to a partially aggregate information in hours, days and months. The final optimal document schema you should use is the following.

Such a big document, isn’t it? As you can see the trick here is to have a subdocument level for each granularity we need in our analysis. Tags are in the main document, let’s say at level 0. Fields are partially aggregated at each level (1, 2, …). The aggregation over time is determined by the value of the date property at each level. Documents are always complete. This means that we will find a subdocument for each minute / hour / day, whether the fields value are 0 or not.

### Why this? Why that?

So far so good. Now, the question is: why do we use this complex document schema? Which are the pro and cons?

First of all, if we model our events using an 1:1 approach with respect to the documents we would end up with one document per event.

While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Moreover, to aggregate results that span through an interval, MongoDB needs to access to a possibly large number of documents.

## Three step initialization

Then, do we reach a dead end, a cul-de-sac? At first sight it may seem so. Fortunately, me and my colleagues found a workaround. We can call it three step initialization.

Remember that MongoDB guarantees the atomicity of operations on a single document. With this fact in mind we can operate in the following way:

1. Try to update the document, incrementing properly the counters at a specified time chunk. Do not make an upsert, just a old-fashioned update operation. Remember that the execution of an update statement returns the number of documents written. If the number of documents written is greater than zero, you’re done.
2. If the number of documents written by the update is zero, then it means that the relative document to update is not yet present in the collection. Try to insert the whole document for the specified tags. Put all the counters (field values) to zero. Also the execution of an insert statement returns the number of documents written. If it returns zero or throws an exception, never mind: it means that some other process had already insert the document for the same tags.
3. Execute the same above update again.

The code should looks like something similar to the following code snippet. Here we want to add 1 to butterflies and honeybees field values for the date 2015-08-01T00:06:00.000Z, and tags location 1 and scientist “langstroth”.

Clearly, what makes the above procedure working is the guarantee of atomicity on document modification. I know, the procedure is a little bit creepy, but we did not find anything better at the moment. Do you have any better idea? If so, try to explain it in the comment section. Thanks!