Permutive with BigQuery: a fully-managed data warehouse

Permutive powers personalization through queries that execute in real-time while users interact with your website or app. However, to do this effectively, we've found that it is important that our customers are able to perform arbitrary analytics over their historical data. This can yield valuable insights which can feed directly into Permutive's real-time queries. Up until now, Permutive has not provided this functionality out-of-the-box. Our new integration with Google BigQuery provides this missing piece.

What is BigQuery?

Google BigQuery is a powerful cloud-based analytics tool, capable of performing complex queries over terabytes of data in seconds. It is extremely scalable: the time taken to execute queries tends to remain constant regardless of how much data you're querying over, while cost scales linearly. All queries are expressed in SQL, so you don't have to be an engineer to use it. From an engineering perspective, BigQuery is a pleasure to work with. There is no need for us to provision instances or define indices over your data - in fact, the only requirement is that we provide a schema alongside any data we import. Tables in BigQuery are effectively append-only logs - a great fit for event data collected through Permutive.

Our Integration

Permutive's BigQuery integration gives you a powerful data warehousing solution which requires minimal effort to set up and maintain.

Once the integration is enabled in your dashboard, we'll create a BigQuery dataset and share it with you, so you can access it via your Google account. We will then periodically prepare, transform and load your data into BigQuery. Currently this process takes place four times a day, meaning there is at most a six hour lag in your BigQuery data.

By default, all your event data will be exported to your BigQuery dataset, with a new table being created per event type per day. For event data, we create tables on a daily basis to enable efficient querying. Since BigQuery doesn't rely on indexed columns, the table-per-day approach enables you to write queries which only analyze recent data or specific time ranges. We also export your entity collections to BigQuery, so that you can join entities to event data in your queries. Since entity collections are usually relatively small and can change over time, we maintain a single table in BigQuery for each of your collections.

Extracting Insights

There are a myriad of insights you can gain from having your data in BigQuery, and many of these insights can be used to drive real-time personalization on your website or app. Let's take a look at a couple of examples.

Say we're a publisher looking to set up an email subscription popup on our site, which shows after a user has read a number of articles. With Permutive, it's easy to set-up this behavior - our Publisher integration automatically collects article views and engagement data for you and Permutive's targeted overlays make it very easy to create and trigger a modal. But after how many articles should we show the popup? Using BigQuery, it's easy to look at the distribution of how many articles users are viewing before leaving the site:

SELECT pageviews, COUNT(user_id) user_count  
FROM (  
  SELECT user_id, COUNT(event_id) pageviews
  FROM
    TABLE_DATE_RANGE([Permutive.articlepageview_events], TIMESTAMP('20160301'), TIMESTAMP('20160401'))
  GROUP BY user_id 
  HAVING pageviews > 0
)
GROUP BY pageviews  
ORDER BY pageviews  

Running this query over a gigabytes worth of data takes just under 5 seconds with BigQuery. Plotting the distribution, we may get something like this:

It is clear that most users leave the site after engaging with between one and three articles. The largest drop comes between three and four article views, so the third article may be the best time to show our email subscription popup. We can use this knowledge to set-up our query in the Permutive dashboard:

Another use of BigQuery that has proved enormously useful involves estimating segment sizes. Personalization commonly revolves around targeting a specific segment of your user base and changing the experience for users in that segment.

For example, say we're an e-retailer, looking to promote a special offer for pet-owners and animal-lovers. We may define this segment with a heuristic, such as users who have viewed more than n pet-related items. But how can we choose a good value for n? If n is too large, our segment may be so small that barely anyone sees the special offer. On the other hand, if n is too large then we may end up frustrating our users by targeting them with irrelevant offers.

BigQuery can help us estimate the segment sizes with various values for n. The example query below is a little more complex: we first query our product entity collection to find products that are in the pets section or have relevant tags. We then join these products with pageview event data over the past month. Finally, we perform a couple of group bys to get a count of the number of users that would be in our segment for varying numbers of events.

SELECT count_events, COUNT(user_id)  
FROM (  
  SELECT user_id, COUNT(user_id) count_events
  FROM (
    SELECT event_id, user_id,
    FROM (
      SELECT user_id, event_id, properties._product
      FROM
        TABLE_DATE_RANGE([Permutive.productpageview_events], TIMESTAMP('20160301'), TIMESTAMP('20160401'))) pageviews
      JOIN (
        SELECT entity_id, properties.section, properties.tags
        FROM
          FLATTEN([Permutive.products], properties.tags)
      ) products
      ON
        pageviews.properties._product = products.entity_id
      WHERE
        products.properties.section = 'pet-supplies'
        OR products.properties.tags = 'pets'
        OR products.properties.tags = 'animals'
        OR products.properties.tags = 'dogs'
        OR products.properties.tags = 'cats'
      GROUP BY event_id, user_id 
    )
  GROUP BY user_id 
)
GROUP BY count_events  
ORDER BY count_events  
LIMIT 10  

The result set gives us an estimation of how many users would be targeted by our segment over a period of one month, for values of n between 1 and 10. We can now make an informed decision when defining our Permutive query.


These examples provide a glimpse of the type of insights you can gain from using Permutive with BigQuery, and how you can use them to drive personalization.

For us, we are glad to be able to provide our customers with a fully managed data warehousing solution, enabling complete access to the granular data that is being collected with Permutive. If you would like to try it out for yourself, then get in touch!