How the Rosetta Stone Tool Unifies Tubi’s Content Management System

Rudra Roy Choudhury
Tubi Engineering
Published in
5 min readDec 19, 2023

--

Rudra Roy Choudhury, Yuanbo Chen and John Trenkle

As the most-watched free TV and movie streaming service in the United States, Tubi is committed to offering universal access to stories from all around the globe. Tubi recently surpassed 74 million monthly active users in November 2023.

Quality content is a cornerstone of Tubi’s success. With over 200,000 movies and TV episodes available to captivate diverse audiences, Tubi’s content library is the largest in the world. New content that includes rich metadata from various 3rd party sources is added to Tubi daily.

The challenge of having such a vast amount of content is how to effectively manage and understand it. The most effective way that we know to accurately attribute each piece of metadata to the appropriate title is to unify them across different ID spaces.

We built the Rosetta Stone system in order to automate this unification.

Rosetta Stone is a flexible ID mapping system

To keep our momentum in the industry, Tubi’s impressively large content catalog must be properly managed and maintained. The content management approach for Rosetta Stone is to build adaptive mapping capabilities that seamlessly transition from one metadata space to another.

This flexible ID mapping system enables the following applications:

  • Unified tracking and understanding–Establishing a standardized ID space is pivotal for monitoring and connecting all content. A standardized ID space forms the foundation for our platform domain. It also influences various platform domain aspects, including content handling, analysis, recommendations, partner payments, and overall platform operations.
  • Handling unidentified information–We frequently encounter data that is lacking an identifiable content ID. In these cases, an automated matching system is invaluable, especially in early data cleaning where the current, manual lookup of IDs by curators is both time-consuming and inefficient. Our aim is to minimize the amount of time wasted so that experts can focus on more impactful tasks.
  • Systematic mapping for resource utilization–With a wealth of resources, including metadata, text descriptions, images, reviews, popularity ratings, and performance metrics, systematic mapping between ID spaces allows us to enhance the metadata for the original title. This comprehensive approach contributes to a better understanding of content and improves recommendations.

Large Language Models offer a new perspective

Tubi chose to leverage similarity-based ranking in embedding space as the method to tackle ID matching based on our research and analysis. Tubi had previously experimented with other techniques to tackle these issues, but the results were poor.

Now, with the latest advancements in Large Language Models (LLMs), there are powerful new methods available to reliably match content, while also yielding confidence metrics.

LLM technology maps text to a unified semantic space, enhancing fuzzy matching across diverse ID spaces. It excels in categorizing and identifying similar content styles by recognizing texts from various sources as the same, and compensating for missing metadata to accurately position content.

As a product-focused company, Tubi aims to leverage LLMs to build a superior content metadata embedding space to be utilized by a wide range of teams and use cases.

Rosetta Stone functional workflow

The following three steps describe the Rosetta Stone functional workflow.

  1. We create a fundamental bank of embeddings along with all their associated, already-established IDs. The following image illustrates the problem’s combinatorial aspect, which is why it is important to exercise caution when building this set. Still, accommodating more variants enhances the accuracy and inclusion of our best ID match in a limited set of candidates based on the analysis of similarity scores.
Rosetta Stone functional workflow

2. To find a match for a current request, we create a structured text entry. based on the available data. This representation takes the form of a key-value relationship, with an inherent order for the available features. The string is embedded using LLMs.

3. We perform a K-Nearest Neighbors operation to find and rank matching titles from the pre-calculated embedding set. The likelihood that the most similar match is the correct one is high. However, the service should still provide multiple hypotheses based on confidence levels and potential ambiguity among other highly ranked candidates.

Matching unknown metadata against Rosetta Stone

Based on the content recognition and matching system supported by these LLMs, we successfully established a unified ID mapping system, using a reference third party data source (hereafter “reference data”) with very high coverage of Tubi’s content library as our standard ID space. We also matched the generated IDs with the reference IDs from several widely used and comprehensive 3rd-party content libraries that yielded high accuracy results.

Mapping of IDs across multiple content spaces

How Rosetta Stone improved Tubi’s Content Management System

The Rosetta Stone tool has had a significant and immediate impact at Tubi. Most notably is the correction of inaccurate reference data information within our content library. For example, there is specific content such as foreign language and film aliases that have a high data inaccuracy.

There are multiple teams (and a few backend apps) at Tubi that rely on statistics based on the reference ID. Thus, if the reference information is incorrect, it multiplies the negative impact generated by the upper-level applications. The correction of the reference IDs has made our content library more accurate and robust, instilling greater confidence in other teams to use post-statistics based on reference.

Matching 3rd party databases with Rosetta Stone

Tubi uses Rosetta Stone to match 3rd party movies

Tubi has incorporated multiple commercial 3rd-party movie and TV show databases to complement missing information and enhance title descriptions. While we can match some 3rd-party results with our existing library through strict matching conditions, there are still many 3rd-party results that cannot be matched due to lack of information. Discarding these results directly would be a waste. With Rosetta Stone, we have retrieved numerous results from these initially unmatched 3rd-party sources, greatly supplementing our own content library information.

Summary

Tubi’s Rosetta Stone is a powerful system for managing complex content metadata that helps our company scale and enrich its content libraries, while delighting viewers with highly personalized content.

Acknowledgments

Many thanks also to the Product Team and Machine Learning Team at Tubi for their collaborative effort in rolling out Rosetta Stone.

The authors would also like to thank Machine Learning Staff Tech Lead Claire Dorman and Vice President of Engineering, Machine Learning Jaya Kawale for reviewing the article.

We’re hiring!

If you’re interested in large-scale, high-impact projects like Rosetta Stone, why not join us? Tubi Engineering is hiring across a range of positions in Machine Learning, Machine Learning Infrastructure, Data, and more. You can check out Tubi careers here.

--

--

Product Manager at Tubi | Author of "I Am The Next PM" | Featured Speaker & Mentor