Building a Model-Driven Enterprise

Chang She
Tubi Engineering
Published in
9 min readDec 9, 2019

--

Part I — Introduction to Machine learning, data science, and data engineering at Tubi TV

Tubi is the market leader in free TV. It is our singular mission to make high-quality entertainment available to everyone. So what’s the secret sauce that’s enabled an upstart like Tubi to be successful without the infinitely deep pockets of industry giants like Disney, Apple, or Netflix? We believe it is our model-driven approach to making decisions throughout the company, focused around the three data disciplines of machine learning, data science, and data engineering.

This blog post is the first in a series of related articles where we will take you through some of our most critical data projects from the data warehousing foundations to operationalizing and automating advanced machine learning. This post will outline many of the projects. Subsequent articles will present more in-depth details about each one separately for much fuller discussions.

What is Model-Driven?

Maybe you’ve heard of data-driven companies, but what is model-driven? The core principles boil down to the following:

  1. A data-driven approach is static and historical, while a model-driven approach is dynamic, forward-looking, and predictive.
  2. Model-driven companies make disciplined decisions with clear hypotheses formed ex-ante, rather than creating ex-post narratives crafted from big-data fishing expeditions.
  3. Model-driven companies rely on operationalizing data science and {machine, deep, reinforcement} learning to create a continual feedback loop that automatically learns to improve decisions via experimentation.

OK, enough philosophy and buzz words. Let’s talk shop.

A Good Data Warehouse Starts in the Client App

Data is the lifeblood of the modern enterprise. Tubi is no exception. We invest significant resources in maintaining a central data warehouse that forms the foundation of our business decisions.

At Tubi, we believe in Garbage-In-Garbage-Out, which means we eschewed the “track ’em all, let data science sort ’em all” approach to analytics event tracking. Instead, we designed an event grammar for tracking user interactions with the app so that if you can clearly describe the activity in English, it should be easy to track. We designed the schema such that most analytics use cases will correspond with explicit fields. Our approach to analytics requires more upfront investment in thinking clearly about what and how to track engagement. In return, the payoff is much higher quality data and better visibility into the business itself.

Back to school for a grammar lesson

Even though our analytics system is designed to be methodical and reduces a lot of superfluous events, we still need to handle billions of events per day. The event processing goes through several major stages. First, the incoming event from the client is translated from JSON to protobuf and undergoes some parsing, cleanup, and validation. Next, the enrichment service will augment the events with metadata about the user, the content they’re viewing, and the row of the home page that contains the content. We decided to create these “wide” events so that downstream analytics wouldn’t have to perform complex joins between multiple tables. Because we needed to track state, maintain a cache, handle retries, and make real-time API calls, we chose to use a custom Akka-stream service rather than build it on top of Spark Streaming. The result is an analytics pipeline that is scalable and does not require much babysitting.

Enriching Analytics Events

The data warehouse itself provides an easy and standardized way to run ETL workflows to build aggregations, sessions, and otherwise support our analytics needs on top of the fine-grained event data. Because Spark requires a lot of specialized knowledge and engineering support, we limit it to just the workflows that require streaming or iterative computations. For the remainder, we chose a tool called DBT. DBT is a great tool that relies on SQL, so everyone at Tubi can contribute. We used DBTs features to quickly build incremental, smaller, more explicit, and more granular models to make it easy for anyone, not just seasoned data scientists, to find the data they need. One great additional benefit of using DBT was getting the lineage of any model for free.

Automatically track provenance of views using DBT

One Data Language for the Whole Company

Just having the data is not enough. At Tubi, we also built a unified platform for finding, preparing, and analyzing data for all of our decisions. Whether you’re a seasoned veteran with a Ph.D. in machine learning or a junior analyst just getting started in your career, you need an effective way to work with our data, and you need to be able to share that work.

To realize this dream of a one-stop-shop for analysis, we used JupyterHub on Kubernetes as a basis. We built many customizations to make it not just possible, but delightful to analyze data at any scale. We call this environment the Tubi Data Runtime.

One big pain-point we identified was the delay in getting the data. We found that the vast majority of queries resulted in 1GB of data or less — something that pandas should be able to handle comfortably. However, the native pandas-SQL connector is too slow and locks up a connection while the response data is being transferred back. Instead, we created a custom pandas-redshift connector that takes advantage of the python multiprocessing module and Redshift’s UNLOAD command. With this, we were able to reduce the query time to be even faster than querying using Spark, then calling toPandas.

For visualizations, we built a JupyterLab extension and added a display(df) function that enabled doing basic data visualizations with no code required. This extension integrated the Nteract data explorer directly into a cell of JupyterLab. The combination of our redshift connector and this visualization extension means that basic data analysis requires no more than 1 line of python: “display(df)”. We also built in Tubi-specific visualization functionality to help us visually inspect recommendation results and explore our content universe.

Look ma: no code (Jupyter + Nteract Data Explorer)
Exploring our content space (Jupyter + Plotly Dash)

For collaboration, we use an EFS drive for shared notebooks with regular backups in case of a catastrophe. JupyterLab also does not have a “link sharing” feature by default, so we created a custom extension to generate a non-user-specific URL for a shareable notebook. As usage for Tubi Data Runtime has increased over time, we’re collecting more and more extensions that offer productivity gains for our users.

The Tubi Data Runtime also has many other features. These include an AWS Kinesis stream sniffer to help test and debug event streams, an easy way to connect to Spark clusters for working with terabyte-scale or above, and finally, access to GPU instances for deep learning.

Rate of Iteration is Everything

At Tubi, we believe our chances of success depend primarily on sustaining a high velocity of iteration. One great example is how we’ve been improving our home-screen personalization. We iterate on different containers, ranking algorithms, models for different user segments, and much more. We also have to take into account global constraints and objectives like recommendation diversity, cold start, promotional levers, etc. Every time you open the Tubi home-screen, it represents the combination of a lot of different algorithms and systems.

Jeff Bezos famously said, “We should be trying to figure out a way for teams to communicate less with each other, not more.” To us, this doesn’t mean we’re not allowed to share pizza between teams. It means finding the right combination of technical design and processes, so we don’t have to spend so much time coordinating across teams. One such example is our ML iteration cycle. In the past, different ML algorithms were served by various services managed by different teams. New experiments required service redeploys that had to be coordinated and scheduled between multiple ML and backend teams. It made for a frustrating experience every time we wanted to conduct an online A/B test in production.

Unnecessary tight coupling is more than just bad technical design

To address this pain-point, we created two Akka-based services using Scala and gRPC. One is our ML serving layer, designed so the ML team can conduct as many tests as they want whenever they want. We call this our Ranking Service, which serves batch pre-computed recommendations. Using ScyllaDB, we’re able to achieve personalization in ~30ms, without the aid of an in-memory cache. We’ve also designed the system so that a new ML experiment simply requires the ML team to re-compute the batch job and send the recommendations via AWS Kinesis; no deployments required.

The other service is our new experimentation engine, which allows us to configure, schedule, and organize our experiments better than solutions available today (commercial or open-source). As we scaled up experimentation across the product, we realized we needed better ways to schedule future experiments, coordinate experiment rollouts, and plan/execute many concurrent experiments that may or may not interfere with each other. After looking at work from PlanOut, Tang et al., and open-source projects like Wasabi, we developed our new experiment engine with a domain-model that matched our needs, was flexible enough to support 10x the rate of experimentation, and achieved a latency of under 5ms.

Experimentation service latency

With ranking service and popper, we’ve improved our rate of experimentation by more than 5x, and we’re just getting started.

Autonomous Learning — Holy Grail or AI Snake Oil?

There is a lot of hype around ML/AI these days, but that doesn’t mean there’s not real nuggets of gold. Often, the actual valuable innovations coming out are in the “boring” bits of AI, i.e., technologies that help smaller companies successfully use {machine, deep, reinforcement} learning in production.

Of course, a successful production machine learning platform needs to take into account much more than just the online serving components. After all, the actual models and features have to come from somewhere. For a machine learning engineer, bringing an idea to life is often a daunting task. They must: analyze the data, create features from appropriate sources, integrate the new features into a model, perform offline evaluation to tune the model, and, finally, schedule an online A/B test.

Way oversimplified workflow of ML lifecycle

What if it didn’t have to be so complicated? Your feature store should be able to trigger the model registry as soon as new features are added so that new candidate models are automatically created that incorporate the new features. These new candidate models are then automatically evaluated and tuned to create a few viable candidates for online testing. Once these viable candidates are produced, A/B tests can be approved with a push of a button to ship new experiments into production. Finally, a complete end-to-end model tracking system should allow the online model monitoring system to select the best model and version to use continuously. At Tubi, we’re laying the groundwork for this level of “self-driving” ML, which follows the mantra of human-driven, machine-assisted.

Onward and Upward

Software ate the world over the last decade. In the next decade, models will run the world. This is a tectonic shift that’s coming whether we’re ready for it or not. At Tubi engineering, we’re proving that a small and independent company can be successful in this new world without the near-infinite resources and monopolistic clout of the industry giants. It’s led us to solve immensely challenging problems from creating an enterprise-grade analysis environment to automating machine learning experimentation. What’s even more exciting is that our journey is just beginning. If you enjoy working on hard engineering problems and want your work to make a tremendous impact, come join us!

--

--