Using Bits for Efficient Storage in Postgres

An example of how you can efficiently store millions of data points in Postgres using binary

Chris Vaughn
Open Digerati

--

The Goal

The goal of this post is to outline an approach for storing daily user check-ins in Postgres in a way that is storage efficient. We used this approach to implement our Streaks feature in the YouVersion Bible App. We believe streaks are a motivator to help people build the habit of reading the Bible every day.

The Problem

We have an API endpoint that will get called when the app launches for the first time in a day. At YouVersion we use Postgres as our primary data store, so it was the natural choice for the data storage backend for this new feature.

In its most basic form, a streaks type of feature is a count of the number of days in a row a user has used the app. We’ll first break down the minimum viable approach and build to our final solution. Our first approach was to use the following Postgres schema and update query.

Let’s break this down.

  1. The WHERE clause uses the date (from the user’s device) and the stored last_checkin_dt to prevent multiple check-ins per day.
  2. If no more than 1 day has passed between this check-in and the last check-in we increment the count by 1 otherwise we reset the count to 1.
  3. We use GREATEST to update the longest_streak.
  4. We are optimistic that UPDATE is going to return data. If it doesn’t, we have additional logic to INSERT a row if necessary. We are still using Postgres 9.4 so we are aren’t able to use the ON CONFLICT syntax for upserts.

This approach works great, it’s simple, and it’s what we launched with for v1 of the feature. Unfortunately, this didn’t completely solve the problem we needed to address. Many of our users use our app in times where they don’t have excellent internet connections. We’ve made significant progress in having our app function without connectivity. The Streaks feature is meant to motivate people to a daily habit. The problem is, by allowing a user to lose their current streak because they used the app offline, we are devaluing the feature. We needed Streaks to support offline use. To support offline use and users with multiple devices, we’d need to store a server-side record of each day each user used the app. This way, as we received a complete picture from previously offline devices, we can recalculate a user’s streak.

A starting point for approaching this type of problem would be to store something like user_id, checkin_dt for each user and each day. Some back-of-the-envelope calculations showed that storing the data this way would give us a table that would grow by 4GB a year per million users. With multiple millions of users, this would be one of our fastest growing tables and would guarantee to grow by more each year as we grow the size of our user base. Another strike against this approach is that it would also be awkward to query. We would need to use either windowing or joins or both, to look for contiguous days to calculate current streak & longest streak.

Our Solution

Since check-ins are binary, we knew we could use bits to track each day. A set bit is a day the user checked-in and an unset bit a day the user didn’t check-in. There are several options for storing bits in Postgres. After considering the options, we landed on saving check-ins by month. Saving by month allowed us to use a 32-bit integer for each month. In Postgres we stored this as an array of 12 int4s. Using the same approach as before for estimating table size showed that this approach would only grow by 78MB a year per million users. Much better! Working with binary data isn’t something we typically do, especially in Postgres, but with a little experimentation, I was able to see that this was going to be manageable using a few Postgres functions to help us along.

We store 1 row per user per year, and new table schema looks like

To check-in for a user, we use the following helper functions and update query.

To break this down a bit, we first find the month index in the array of ints storing check-ins. We then pass that int and the date of the check-in to the set_checkin function. Using bitwise operators we shift a 1 over to the day we want to set, and we use OR to set the bit representing that day. We store the days so that the last day of the month is the least significant bit to make the calculation of the current streak easier.

Next, we need to be able to re-calculate the streak. We’ll first look at calculating that streak for one month.

The current_streak_for_month function takes a check-ins int and the date to calculate the streak on.

  1. We first shift the bits over so the day we are interested in is the least significant bit.
  2. We use bitwise operations to find the first 0 (unchecked day) from the least significant bit.
  3. The resulting bits have a 1 on the day where a streak ended, and the rest of the bits are 0.
  4. We use log base 2 to get the number of days of the current streak.

An Example:

Now that we have the basis for computing a single month’s streak we can use a couple more functions to compute a streak for a year and all years. The approach here is to start with the current month and keep going back until we reach a broken streak.

We’ve been pleased with the efficiency of this solution. Now that we are storing the data in this granular way we’ve also found that adding functionality to the feature is possible just by running different calculations on the binary data. We recently added Perfect Weeks where we can tell the user how many weeks (Sunday — Saturday) they’ve spent in the app this year.

Open Digerati is an open source initiative from Life.Church. We’re passionate about leveraging technology to reach the world for Christ, and connecting with others who share that passion. Join us on Slack and Facebook.

--

--