Reading Plan Recommendations using Python and Apache Spark

Justin Richert

Published in

Open Digerati

14 min readMay 3, 2018

Learn how YouVersion is using machine learning to make intelligent Bible plan recommendations.

Photo by Maarten van den Heuvel on Unsplash

The Goal

My goal in this post is simply to share how we at YouVersion are leveraging machine learning tools to generate product recommendations. This article does not aim to teach fundamentals of machine learning or data science (as if one post even could do that), but it does aim to help others considering building a recommendation engine by seeing how we approached the problem. The tools and libraries mentioned as a part of this should be considered, but they should not be seen as promoted or the only way to build this kind of a system. They are simply what we utilized to accomplish our goal.

The Problem

At YouVersion, our goal is to lead people to engage with Bible scripture. A major feature in our effort to do so is something called plans — an organized collection of Bible verse references and optional devotional content intended to help guide readers through part or all of the Bible or understand specific topics from the perspective of scripture. Although this is not the only feature that drives engagement, because we consider this a core “product” in the app, we are always looking for new ways to increase users’ engagement with this feature. Something that others have been doing for several years that we wanted to tackle was intelligent, automated recommendations. Essentially, we wanted a group of plans that would be recommended to a user based on a plan that they just finished reading.

The fundamental problems that we needed to solve in order to create this recommendation system were what data to use, how to handle that data, and how to generate our model from that data. I’ll talk about each of these in the following sections.

The Data

Because we knew we were comparing plans to other plans, one approach that came to mind was to use the attributes of each Plan to see how similar it was to others.

Here are some key attributes that our data have:

Title
Length (in days)
Name of publisher/provider of the plan
Description
Verse references included in the plan
All the devotional text

While these are really great attributes to work with in some ways, it actually poses a bit of mountainous task. Although length is perhaps an easy field to compare as it is just an integer, it is not terribly meaningful. It may be pretty safe to say that year long plans are likely very comparable in appeal to other year long plans, but shorter plans (e.g., 7 days) are much more diverse in content and therefore are not nearly as comparable to other shorter plans. Of course, building a model depends on many factors, not just one, so we would include things such as title, publisher, description, and all devotional content included in the plan. This is where it would get a bit tedious, though. We would need to build a meaningful index out of all of the text and then use that index to determine word importance and so on and so forth. The model would likely be pretty complex, perhaps too much so for our purposes.

Because of all the data processing necessary to draw similarities based on attributes, we instead looked to user data to better understand the relationships between the products. A much simpler solution to our problem because of the data available was to compare completions of plans (very similar to the purchase of products) — specifically which plans are completed with other plans by users. That’s a bit of an abstract thing to explain with just words, so here’s a basic diagram mocked up to display how our data is arranged for the process (Figure 1.a), and how it would be aggregated in a matrix (Figure 1.b):

Figure 1.a — An example dataset. (Spacing added for clarity)

Figure 1.b — Sparse matrix formed by data from example_data.txt in Figure 1.a. Data represents completion count at coordinates.

This data will inform us of what plans are commonly clustered together. For instance, if a user is searching for plans about the topic of “joy”, and they find a couple different plans that address that subject and complete them, these plans have become somewhat related. That is a little bit of an over-simplification of what is happening here, but that’s the core idea. To provide a glimpse into the mathematics behind this approach, look at plans 1 and 5. Both user 1234 and 1235 have completed both plans. This implies to our system that they are related. That could just be coincidence, but of course with millions of records, coincidence does not take precedence. The translation from Figure 1.a to Figure 1.b is probably pretty obvious, but I just wanted to take the time to be sure it was clear how we arrange our data. Further into the post, I’ll explain how I went about loading the real data in, and how we translated it to a matrix in Python. We actually use data in Google BigQuery as opposed to a flat file. Loading the data presents a large problem, as we have over 10 million unique users who have completed at least one reading plan, and we have close to 4,000 plans. This makes for a large matrix when working with the real data. Also, as mentioned in Figure 1.b’s description, the data is relatively sparse, meaning that many of the coordinate points will be zero. This is because most of the 10 million users will not even have completed 100 plans, let alone 4,000. I’ll return to why this is important in a bit.

The Process

I will talk mostly about the third piece of the data flow diagrammed in Figure 2 above, but this provides some insight into exactly how the data is coming in and going out of the Apache Spark cluster, the workhorse of this implementation. Later, I will briefly mention the scope outside of the flow given above, but that will be more specific to our processes and less helpful to you.

Apache Spark

Image taken from https://spark.apache.org/docs/latest/cluster-overview.html

I won’t cover everything there is to know about Apache Spark but I wanted to take a moment to cover some of what Spark is about. Here is the overview from their documentation:

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

As stated above, Spark is a cluster computing system. This allows for processing data in parallel, greatly speeding up the processing. Cluster computing is often the right solution when it comes to processing large amounts of data, like the data we handle for our recommendation engine. To familiarize yourself with Spark in whatever language you choose to use it in, look through the Quick Start tutorial as well as these examples.

A fundamental component of a cluster computing system like Apache Spark is the data it is processing. The data has to be able to be operated on in parallel to take advantage of the resources you have. Spark uses RDDs, or Resilient Distributed Datasets. These RDDs are just a group of elements that can be split up across the nodes of the system. Operations can be efficiently applied to all of the individual elements of the dataset across all of your nodes. They are called resilient because they are able to recover from node failure as well — i.e., the chunk of data is systematically generated, so you can resubmit the identical chunk to a functioning node if a node fails. In the first step of our spark script as seen below in Figure 4.a for example, a single element is a single record from BigQuery like the lines in the text file above — a user id, a plan id, and a completion count.

So now that we’ve walked through what Apache Spark is and why we’re using it, the next question is how we get ahold of a cluster computing system for ourselves.

Using Google Dataproc

We used Google Dataproc to create a cluster with Apache Spark already installed on it. Using Dataproc is extremely simple, but you will need to have a Google Cloud project to use Dataproc. We use command line commands to spin up our Dataproc cluster each time we run the recommendation engine, but the console interface provides you with a look at what all is available to get you started. You can look through all the options and configure your first cluster. Then, once you have configured everything in the console, you can click “Create” to launch your cluster. For the command line interface part, you will find a hyperlink at the bottom of the page for REST or command line equivalent execution. If you click command line, it should present a modal containing something like the following:

gcloud dataproc clusters create your_cluster_name --zone southamerica-east1-b --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --project your_project_name

This is not our configuration — it is just an example of what you should see when getting the command line cluster create command from the console. At the time of this post, we use three nodes total — one master and two slaves — which meets our needs. We are able to process all of our data in under an hour with this configuration.

This utilizes the gcloud tool, so if this is the method you use, you’ll need to install and configure the gcloud tool on whatever machine will be initiating the process. For instance, we run a cron on one of our servers that runs this command and several others including job submission to orchestrate the running of the recommendation engine.

It is completely fine to use Amazon for this as well — they have some great resources for similar implementations. Because our data was coming from Google BigQuery, and because Dataproc allowed us to spin up a cluster with Spark already installed (Amazon did not at the time we implemented this — they may now), this was our best option, but it is by no means the only option.

The BigQuery Connector

Apache Spark has many “connectors” that serve as a way to specify a source location for your data and load it directly in as an RDD. The advantage of this is honestly a lot in the ease of doing it, but it also increases the efficiency of the process by cutting out any middle step to load, process, and format the data into an RDD. It also prevents needing to download large files to whatever machine is the master node for the Spark cluster. The setup for using the BigQuery connector is relatively simple:

Figure 3: Setting up the BigQuery connector and using it to retrieve our data

In the gist above sc is our SparkContext object. For another example from Google themselves, checkout this tutorial: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example.

I will note that using this connector requires having your data in a table that already has the appropriate schema. By this I mean that, at least according to research that I did, you cannot query data from a table as a middle step between your BigQuery table and the way you’d like to organize the data in your Spark dataset. It also requires that the data first be loaded into Google Cloud Storage and then into your script (but this is handled under the hood with the above configuration — you just have to supply it a path to where you want the intermediary data).

Selecting a Data Structure

This was perhaps the biggest challenge I faced while working on this project. I was learning Spark at the same time as I implemented this, so it took some hunting to find something that met the need I had. This is where I will refer back to my mention of the challenge of loading in the data and the sparse matrix. When I pull data from BigQuery, which is represented in a table very similar to the example dataset text file in Figure 1.a, I have a lot of specific coordinates to supply. What I don’t want to manage is an exhaustive list of user ids and plan ids to ensure that every possible coordinate pair is populated with a 0 if there is no completion value.

Enter in the CoordinateMatrix data pyspark structure (see the docs here). This is an excerpt from those docs:

“A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double),where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.”

This perfectly fit the need I had. If your data is smaller and more dense, there are other matrix data structures provided in the pyspark library that may meet your specific needs. To create a CoordinateMatrix out of the data coming in from BigQuery, I simply did the following:

Figure 4.a: Loading plan completion data into a CoordinateMatrix using MatrixEntry

Each record is loaded as json using the map function (a built-in for RDDs that allows you to apply a custom function to each element of that RDD), and then each record is formed into aMatrixEntry using the map function as well. The table_data variable in the gist above is the direct result of using the BigQuery connector shown in Figure 3.

Note: CoordinateMatrix and MatrixEntry are imported from pyspark.mllib.linalg.distributed

Finally, after loading in all of our data, I convert it to an IndexedRowMatrix and use the columnSimilarities function to calculate the similarity between each plan vector:

Figure 3.b: Calculating cosine similarities using a function of IndexedRowMatrix from the pyspark library

I’ll slow down a bit here to step over the above lines of code. The first line compiles all the MatrixEntry elements into a CoordinateMatrix. This is then converted to an IndexedRowMatrix. You might be thinking, “why bother with the CoordinateMatrix in the first place then?” This goes back to what data I have initially. The CoordinateMatrix allowed me to describe my data as specific entries in a sparse matrix rather than requiring that I provide a value for every coordinate pair in the matrix. Unfortunately, the CoordinateMatrix object does not have a columnSimilarities function, a key piece of my code. The columnSimilarities function produces an RDD where each vector is simply a plan id and an array of tuples, where the first index of each tuple is another plan id and the second index of the tuple is a similarity score. I then sort this vector using the RDDs built-in map function and passing in a custom sorting function that sorts the array based on the similarity score. I then save this to a text file and process it in an external script which loads it into redis for use within our API. And that’s it! Apache Spark and the pyspark library provides a lot of high level tools that made the implementation of this efficient and relatively easy.

A brief note about the columnSimilarities function

The walkthrough above should equip you with enough info to build a recommendation engine of your own given similar data, but it does skim over a key part in how the similarities were calculated: the columnSimilarities function of the IndexedRowMatrix data structure. In short, this produces the cosine of the angle between two vectors — in our case the two vectors are two plan vectors (or elements of an RDD) with users’ completion counts populating the vectors. Essentially, the smaller the angle, the more alike two vectors are. Check out the “Further Reading” section below for a link to a fantastic explanation of the math behind cosine similarity.

Orchestration

Although it is not the focus of this post, I wanted to take a moment to talk about how this is actually implemented in full in our production environment. I won’t provide specific examples or code here, but I’ll note each of the steps:

An orchestration script is executed by a cron configured on one of our production servers.
The orchestration script uses the gcloud command line interface to spin up a Dataproc cluster.
The orchestration script again uses the gcloud command line interface to submit a spark job, pushing our python spark script to the master node of the Dataproc cluster.
The spark script actually stores data back to Google Cloud Storage, so after completion of the spark script, the orchestration script downloads the resulting data to local, temporary storage.
The orchestration script runs a function that ingests the local data and pushes it onto our redis server which powers recommendations for our production APIs.
Finally, the orchestration script cleans up all of the residual temporary data from local storage and Google Storage and tears down the Dataproc cluster.

Results and Closing

To demonstrate the outcome of the system this post has covered and to provide you with some examples of what is possible with a system like this one, I wanted to share some of our results that we’ve seen. Listed below are a few examples, plan title first, and a list of related plan titles following.

You and Me Forever: Marriage in Light of Eternity — “The First Few Years of Marriage”, “Your Home Matters”, “The Vow”, “Lies That Can Ruin a Marriage”, “Rooting Out Relationship Killers”,….
Guardians of Ancora Bible Plan: Ancora Kids Find a Roman— “Guardians of Ancora Bible Plan: Ancora Kids on a Boat”, “Guardians of Ancora Bible Plan: Ancora Kids Learn to Pray”,….
Pray Like Jesus by Pastor Mark Driscoll — “Live by the Spirit: Devotions with John Piper”, “Prayer is Our Source of Power”, “Worship Changes Everything”, “Armor of God”, “Love Like Jesus”, “Grow in Prayer!”,…

These are just a few examples of the successes of the system. There are other examples of plan recommendations that are perhaps more loosely related but address similar topics or needs and yet others that perhaps don’t appear related at all (though this is the minority)— it’s entirely based on the collective habits of our users. A small downside to using this system is that there is no surprise to most of the recommendations. Because of the way we utilize the system, that is a good thing. However, it can put users in a little bit of a rut if it is the only way we are pushing engagement in the feature. In the future, we hope to implement a second recommendation engine that finds similar users across our system and recommends plans based on that similarity. These types of recommendations aim to bring a little bit more wonder at how we found a plan that fit what they were looking for or their general tastes so well.

Hopefully this post is a good starting place for anyone looking to build a recommendation engine. If there is anything I haven’t made clear or if there is something you feel like is missing from this post, feel free to comment below! I’d be happy to answer any questions. Thanks for reading!