Forecasting at Tubi

TJ Jiang
Tubi Engineering
Published in
12 min readJul 30, 2020

--

Overview

KPI forecasting is critical to Tubi’s business. The Tubi data science team has developed a robust forecasting system to help set the company budget, both internal and external facing goals, as well as anomaly detection so that we can understand when something is wrong with our service. In this blogpost, we will describe domain-specific visualization strategies, background on the prophet library that powers the forecast model, custom modeling methodology, production workflow, and evaluation techniques.

To motivate the development of forecasting models, let’s first look at some of our use-cases.

Application: Planning Dashboards

Tubi’s approach to forecasting utilizes a combination of statistics libraries and business knowledge. It should be noted early on that we do not treat our forecasts as guarantees; it is impossible to make a completely objective model that is always 100% accurate; but the exercise of KPI forecasting enables us to gain a much deeper understanding of our business and our data.

Our forecasts are visualized on an interactive dashboard as such:

Current Month TVT (Total Viewtime) Forecasts

This visualization may not be immediately intuitive to understand, but is in fact a very informative way to look at forecasts. The y axis represents metric value, while the x axis represents the day on which the forecast was generated. The orange bars represent the value that was projected for a calendar month generated on the date displayed on the x-axis. What that means is that all of the orange bars on the chart represent forecasts for the same time period, only generated on different days. In other words, this is a time series of a forecast.

In addition, the black dots represent the 14 previous day moving average of the projected value for a calendar month. This is a convenient way to visually remove some of the daily fluctuations in the projections, and is especially useful during a period of high uncertainty, such as when our company encounters out of sample events that significantly alter user or advertiser behavior on a day to day basis.

The gray lines represent the 95% confidence intervals; these bounds will generally converge as we move closer to the end of the calendar month, as the projected value consists of ever more actual previous day values and fewer projected day values. However, a monotonic decrease of confidence intervals over time is not guaranteed: certain time periods, such as those that involve more complex seasonality mixes or those that involve large changes in metric values in general (think holiday season) can be more difficult to forecast than others.

Finally, when goals have been set, they are plotted alongside the forecasts as a horizontal line element. This way, we have a reference point to compare to.

Our forecasts are generated as a daily time series that is then aggregated into calendar periods on dashboards. We generate a refreshed forecast using the latest data on a daily cadence. This represents the best forecast we can make given information available by that time. Forecasts are generally consumed for the current calendar month and quarter. As we go further along into a calendar period, we incrementally replace forecasted values with actual values from days that have passed.

On dashboards, we visualize forecasts for calendar periods going a month before the period started. Displaying historical forecasts instead of only the latest forecasts has multiple benefits: for one, it helps us get a sense for how our forecasts are changing over time with new data. Of course, we expect that the forecasts will change with new data, but a useful threshold to monitor is whether the confidence intervals are converging or diverging in width, or when the moving average falls out of the confidence intervals. This could be a signal that we need to re-visit the model; is it due to a special event that caused metrics to change? If so, we can add a holiday. Is it due to improperly tuned seasonalities? If so, we can adjust model inputs and parameters accordingly.

In addition, having a stable history of forecasted values adds confidence to the stakeholders as to the reliability of the forecasts. We have learned that perhaps the most critical and tricky aspect of developing a forecasting model is earning institutional buy-in, and what better way than to demonstrate that your forecasts have been historically stable and accurate.

Application: Anomaly Detection

Another important use of the forecasts are of metric anomaly detection. See a daily visualization of forecasts vs actual metric value below:

Revenue Forecast vs Actual — April 2020

Tubi’s project managers keep a close eye on KPI values and diligently investigate week over week drops. Once data quality issues have been ruled out as the cause, one of the next steps for triage can now involve comparing the forecast as counterfactual against the actual metric values. This way, we can rule out expected changes in metric value, such as those due to periodic seasonality or non-periodic special events (such as the launch of a competitor).

In the above case, we are in fact observing the beginning of decrease in advertiser activity due to the economic impact associated with the ongoing COVID-19 pandemic. Here is a fantastic example of an out of sample event resulting in major uncertainty. In economic parlance, sources of prediction error are distinguished between risk and uncertainty. Risk refers to situations where we may not know the outcome but we have a good understanding of the odds, such as a dice roll. The outcome of each roll may be random, but we know that it will land between 1 and 6, there is no chance of getting 14 or -2. Uncertainty on the other hand refers to situations for which we do not understand the odds to begin with. We may make some assumptions, but we have no real objective way of quantifying how this event will affect things in the future. The best we can do is learn from them when they occur such that we can better plan for them if we expect the same situations to surface again.

Methodology — Prophet

Prophet is an open source forecasting library published by Facebook. It is well suited to forecasting business time series data, and defines a flexible API that has limited exposure of heavy math nomenclature.

Using a backend powered by the statistical package STAN, Prophet decomposes the time series in a similar fashion as a GAM (generalized additive model), with time as a regressor and components representing trend, periodic seasonality (i.e. weekly and yearly change patterns), non-periodic events (referred to as holiday in library nomenclature), and an error term that represents variance not accounted for by the model.

The library ships with capable cross validation and diagnostics utilities that we extend upon for more advanced functionality. This empowers the analyst-in-the-loop development heuristic that the library authors frequently allude to: surface problems using visualizations and diagnostics functions, and apply statistical and domain knowledge to address these problems. Simulated historical backtesting serves as the cross validation methodology to evaluate out of sample accuracy. In addition to the built-in error metrics such as RMSE (root mean squared error) and MAPE (mean absolute percentage error), we added another error metric that aligns closer with our key use-case: calendar month error percentage. We space the backtests apart by 30 days, and calculate the 30 day forecast error in steps. The cumulative error as a percentage of actual value serves as our key accuracy measure during model development.

One additional heuristic we developed is a parallelized hyperparameter optimization function. Using the scikit ParameterGrid data structure to define the hyperparameter space, we implement a simple grid-search algorithm to perform an exhaustive search, using simulated historic backtesting and the calendar month error percentage error metric to define the most skilled model. Furthermore, we utilize the joblib parallelization library to unroll the loop, taking advantage of multicore processing to speed up the search. While search efficiency could be dramatically improved using more intelligent algorithms such as random search or simulated annealing, hyperparameter optimization was neither a bottleneck nor a critical component of our model development strategy.

Although we took the time to extend prophet with a hyperparameter optimization function, we do not recommend simply plugging your model into a massive parameter sweep and accepting the most accurate model without additional introspection. Time series cross validation is tricky, and what makes for a good in sample fit may not end up being generalizable to new data. Going by hyperparameter optimization alone may be a fast path to over-fitting.

That said, this method does have its place. At a minimum, it is a great way to gain a feel for how the model reacts to different values in certain parameters that affect the model in non-intuitive ways. This includes parameters such as:

  • prior_scales (for seasonality): how many terms do we include for the Fourier series representing a seasonal term? The higher the number of terms, the more flexible the curve will be, and the faster it will react to changes in the time series
  • changepoint_prior_scale: similar to prior_scales, how flexible should we make the rate of change in trend changepoint frequency?
  • changepoint_end: what is the latest point in the training time series at which to set a change in trend?

For seasonality however, this is something for which you should examine the time series yourself, and make domain-knowledge backed judgements. For instance, most human activity follows a seven day periodicity; in our case, viewership tends to be lower during the weekdays, and higher on the weekend. This is readily evident when visually examining the time series (all of the spikes in value correspond to a Sunday):

Total View Time Time Series

When it comes to ad-revenue however, this is often more driven by digital marketing budgets, which resets on a quarterly basis (observe the periodic changes highlighted by gray vertical lines):

Advertisement Revenue Time Series

Feature Engineering

Another useful feature of the Prophet library is its support for exogenous time series regressors and special events/holidays. The proper application of domain knowledge in constructing these input features can dramatically improve model performance.

One interesting case of a regressor was using the staffing of our sales team as a predictor of revenue. Broadly speaking, Tubi’s advertisement revenue can be broken down into two buckets: programmatic, which is the automated ad market facilitated by an exchange, and direct, which are deals signed with brands facilitated by our sales team. By encoding the historic and planned future headcount of the sales team as a time series, we are able to quantitatively represent the qualitative effect of our growing business to forecast the direct component of our revenue. Simply put, in general more sales people correspond to more direct deals and revenue. This also allows us to simulate changes in future revenue using various scenarios in future headcount growth.

Holidays are the Prophet nomenclature for non-periodic events that meaningfully affect the time series such as media releases about Tubi or product feature releases. This is a great opportunity to work with our business teams who will have a close eye on such special events, and engineering teams who will be monitoring any launch issues and outages. Diligently accounting for such events may remove noise from seasonality and trend curve fitting, such that the model does not attribute a spurious metric change to the wrong component. Holidays can also be set in the future, if this information is available, so that the model knows to make the appropriate adjustments learning from metric changes from previous occurrences.

Deployment

Our forecasting model is coded into a pyspark job that is scheduled using Airflow. The forecast output is generated in the form of pandas dataframes. These dataframes are saved in parquet format to our AWS S3 bucket. A spectrum migration is set up such that we can query these S3 files directly. To avoid constantly accessing our data lake which can accrue significant costs over time, we materialize this data into our data warehouse using a framework called DBT. At this point, the forecasts are ready to be queried via redshift and visualized on dashboards.

Evaluation

One final piece of the project is backtesting and scoring the forecasts that we have made. While this is similar to the cross validation error calculations made during model development, the difference is that we will be using modified scoring metrics and making the results available for our stakeholders to view.

The monthly total error scoring metric remains the same. However, we wanted an additional criteria that expresses how reliable our projections were before the calendar period being forecast started, taking into account the confidence intervals. An obvious approach may be to simply calculate the ratio of projections where the confidence intervals ended up containing the true value. However, our forecasts do not necessarily use the same confidence level: some metrics may need the confidence level to be set more conservatively. Thus, we have adapted the brier score for this purpose, with probability set at the confidence level:

Brier Score Loss Formula

Where N is the number of predictions, f is the probability assigned to the prediction, and o is the binary coded actual outcome.

Why make forecast scoring publicly visible to our stakeholders? We are big believers in ownership and accountability. Hiding the forecasts and only taking credit when results are accurate would be taking the easy road out, and does not encourage iterative improvement. By being transparent about the performance of our model, we demonstrate that we have the motivation to continue improving the model. This helps to develop trust and buy-in for what we forecast. In addition, we set alerts on these scores, reminding us to check in once in a while if model performance has fallen significantly.

Driving Model Adoption

Earlier in the post, we had described how showing a stable and accurate history of forecasts can be a powerful tool in building trust in your model from stakeholders. This subject could easily be its own article, but I will list a few major lessons learned here:

  • Set expectations: it is very hard to gain confidence in your model, but very easy to lose it. Just one or two forecast failures (either real or perceived) can cause stakeholders to abandon your approach, even though there may be good reasons why the outcome was non-ideal. Make sure to communicate clearly the capabilities of your model, and that forecasts are not guaranteed and can change for many reasons, some of which were previously described in this blogpost.
  • Repeated impressions: even if you give a masterful keynote on your model, stakeholders aren’t necessarily going to accept or even remember your work from a single exposure. You have to continuously advertise your model until you reach a critical mass of institutional buy-in, after which your model is the default choice for solving particular types of problems.
  • Evangelize by education and mutual learning: don’t introduce your model to stakeholders as a technically superior black box; take an educational and collaborative approach instead. Meet with stakeholders and learn what their existing workflow is for forecasting, and assess whether there may be mutual learning opportunities. Explain what your model assumptions and inputs are and ask for feedback from the domain experts. Who knows, these discussions may lead to breakthroughs in your own feature engineering.

The Road Ahead

While the forecast models were originally developed to help our product team specifically for reporting, goal setting and anomaly detection purposes, they have slowly been gaining wider adoption across the company. The modular nature of the project’s software design ensured that extending the forecasts is simple and efficient.

From backtesting and evaluation, we have seen that our model performance is steadily improving over time, an exception being the initial period of user and advertiser behavior change caused by COVID-19. Forecasts can never be assumed to be a guarantee, however; at the end of the day, these are models of our business, and we will never reach a point where 100% of data and nuance are objectively accounted for. However, the key is for us to use the forecast as a tool to help us understand our business and guide our decision making, such that the collaboration between analyst and model can achieve performance better than either individually.

If you are a data scientist interested in solving these types of challenges, we’d love to hear from you!

--

--