Building machine learning products: lessons from Papercup's product team | Papercup Blog

Building machine learning products: lessons from Papercup's product team

January 31, 2024

7 min read

There is certainly no shortage of product building advice out there, but the majority of it doesn’t hold for products with machine learning (ML) at their core. As the boom in such products hitting the market shows no sign of abating, the aim of this article is to impart practical learnings from the Papercup product team’s experience utilizing a vast array of new capabilities. In turn, we hope to facilitate the development of other valuable products that can also harness machine learning.

Papercup is building a cross-lingual, speech-to-speech engine, initially focused on dubbing video for content owners and making the world’s content watchable in any language. To date, videos we've localized have reached over 750m non-English speakers. Our first product consists of a multi-stage pipeline, the output of which is validated and improved by skilled linguists. We utilize both off-the-shelf APIs for machine learning as well as researching and developing our own proprietary speech models, focused on expressivity. We are now a team of around 60 people, and roughly a third of our company is machine learning engineers and researchers.

We are going to take a look at:

What do we mean by ML products?
What type of markets are now within reach?
Key concepts
Teams: Who can build ML products?
Implications for product & startup economics
Reading list & references

💡 Disclosure: I am not an expert on the ethics, politics or in-depth development of machine learning models. I can only speak to this from a product and startup lens.

What do we mean by ML products?

To put it simply: products with machine learning models at their core, that would not have been viable without them. ML-enabled features including risk analysis, text prediction and recommendation systems have been deployed at scale for several decades in industries such as finance and the telecommunications sector. Machine learning has long been the driver of social media algorithms and autocomplete, but the swift adoption of ChatGPT marked a significant step change in product capabilities.

Former GitHub CEO Nat Friedman spoke to Ben Thompson at Stratechery about the paucity of real-world AI applications beyond Github Copilot:

“I left GitHub thinking, “Well, the AI revolution’s here and there’s now going to be an immediate wave of other people tinkering with these models and developing products”, and then there kind of wasn’t and I thought that was really surprising. So the situation that we’re in now is the researchers have just raced ahead and they’ve delivered this bounty of new capabilities to the world in an accelerated way, they’re doing it every day.

So we now have this capability overhang that’s just hanging out over the world and, bizarrely, entrepreneurs and product people have only just begun to digest these new capabilities and to ask the question, ‘What’s the product you can now build that you couldn’t build before that people really want to use?’ I think we actually have a shortage.” Source ($)

When we refer to ML products right now, we refer to a new set of capabilities with the potential to become new products. For these products to become a reality, entrepreneurs and product people have to figure out how to best harness them for real world uses.

What type of markets are now within reach?

One of the most on-point descriptions I have found on the potential impact of ML products on markets was written by Matt Bornstein & Martin Casado (a16z):

“AI has enormous potential to disrupt markets that have traditionally been out of reach for software. These markets – which have relied on humans to navigate natural language, images, and physical space – represent a huge opportunity, potentially worth trillions of dollars globally.”

Most AI applications look and feel like normal software. They rely on conventional code to perform tasks like interfacing with users, managing data, or integrating with other systems. The heart of the application, though, is a set of trained data models. These models interpret images, transcribe speech, generate natural language, and perform other complex tasks." Source

While ML products definitely look like traditional software, they do in fact feel different. The key concepts below go some way to explaining why. These are ideas I’ve found helpful for building products and providing users previously impossible affordances.

Key ML concepts for product managers

Failure modes

ML models fail frequently. It’s the product managers’ job to make sure they fail gracefully. Nat Friedman described the question succinctly reflecting on building GitHub’s Copilot:

“How do you take a model which is actually pretty frequently wrong and still make it useful?” Source ($)

Good examples of handling model failure gracefully:

Midjourney

In-painting allows users to reconstruct unsatisfactory or incomplete segments of their text-to-image response. Midjourney does not set the expectation of generating a pixel perfect image, as it allows the user to pick and choose what areas to regenerate.

Github Copilot

‍After initially prototyping a question-and-answer chatbot, GitHub built a code-synthesis autocomplete UI, which attempts to provide complete code segments to the developer. This provides a frequent touchpoint for the model and user, allowing the developer to cultivate an intuition for when they may or may not benefit from the suggestions.

Not so good examples of handling model failure:

Tesla Full Self-Driving

Tesla’s FSD feature allows the car to drive self-directed, with the user instructed to pay attention and be prepared to use the wheel and brake at a moment’s notice. With full self-driving still beyond the horizon, the implementation of this feature encourages the user to relax and trust the underlying model in most circumstances. It just so happens that the moment the model fails the users’ response may be one of the more consequential events in their life.

Additional valid strategies to mitigate failure risk include model ensembles, model monitoring, redundancy, human oversight and explainability (ask ChatGPT for further reading).

Who can build ML products?

As an ML product manager, the shape of the teams you have exposure to will differ from those of PMs with strictly hardware or software backgrounds. These teams may have different constraints, timelines and hobbies to other engineering teams you have worked with. Here’s an illustration of the responsibilities of an ML product team based on my experience.

Core Research

‍Core research teams look for new knowledge. They research and develop new modeling solutions, and often read and write academic papers. This team can conduct fundamental or targeted research depending on their organization’s size, structure and timelines for research and development. Their cycles are usually slower than other groups within research and development, but their discoveries can set the foundation for long-term strategic defensibility.

Applied Research

‍Applied research teams look for solutions. They collaborate with core research and engineering teams to ensure the successful deployment and utilization of both proprietary models as well as off-the-shelf and open source projects where required. Their product cycles are closer to traditional software projects.

ML Platform

‍ML platforms teams, akin to DevOps, build tools and services to simplify, optimize and accelerate the ML lifecycle.

Data‍

Data may be an individual team, or it may be a shared responsibility across the research and development function. Data may be created in a whole host of ways depending on the use case (commissions, synthetically generated, purchased, annotated etc.).

Data cleaning and preparation is a primary role for the data function. Depending on the product use case, this may necessitate a full stack team i.e. for quality control or human feedback. Data is essential to the functioning of an ML-enabled product team, but plenty has been written about this elsewhere.

Research ≠ Production

ML research teams will look to narrow the scope of the problem they are working on in order to build something which has not been possible before. Papers and sample results can be closely controlled and will not always replicate or translate to real world impact.

On the other hand, sometimes models are released that have their true power under-appreciated at the time. There was nearly two years between the release of comparatively unheralded GPT-3, and the industry shifting ChatGPT, which was powered by a GPT-3 variant. OpenAI devoted significant effort to the productization of GPT-3.5, specifically on reinforcement learning from human feedback and the form factor to fine tune the model’s performance, with great success.

Understanding these caveats is key to managing expectations while still moving quickly and iterating on ML products.

Timelines and certainty

ML research cycles are longer than engineering. But paradoxically, industry progress in ML is (broadly) significantly faster than engineering. Skilled ML engineers can move extremely quickly, although model training time can still create bottlenecks.

Advances in one sub-field of ML is usually a leading indicator of progress in another, given the generalizability of tooling, architecture and modeling techniques. An ongoing example of this is the application of transformer architectures from large language models (LLMs) to large speech models (LSMs). Google’s BERT family of models, first introduced in 2018, is a prime example of a paradigm shifting technology arising from research, which impacted first it’s sub-field (Natural Language Processing, or NLP), before generalizing across machine learning and technology as a whole.

Models: Internal, off-the-shelf or API

ML products require performant ML models. These can be built internally, taken from the open source community, or queried via API. Especially for startups with limited resources (compute, capital, time), teams usually will not build and train their own base models from scratch. However, the right answer for your team will vary depending on what you are building and your product strategy.

Papercup’s ML dubbing system requires models capable of generating expressive speech, beyond the levels capable of most text-to-speech (TTS) systems. It makes sense for us to focus our model building capabilities here – on what is a more defensible intellectual property. Other parts of our video production pipeline and analytics stack require less bespoke models.

Our engineering and ML teams will almost always build on the open source communities or call on OpenAI’s API when needed to avoid reinventing the wheel.

Deterministic vs Probabilistic

One of the things we talk about when we say machine learning products feel different to software, is the deterministic (software) versus probabilistic (machine learning) dichotomy.

When you input ‘5+7’ into a calculator, you will always receive the number ‘12’. There's no uncertainty, no variation in output for the same input, and no learning from past computations. It's entirely deterministic.

Machine learning output, conversely, is probabilistic. ChatGPT doesn’t have any internal record of right and wrong, “but rather a statistical model about what bits of language go together under different contexts” (Ben Thompson $). It doesn't follow a rigid deterministic rule to decide the response; instead, it assesses the characteristics of the calculation based on what it learned during training and assigns a probability to its answer.

This brings us back to the key question for machine learning products: How do you take a model which is actually pretty frequently wrong and still make that useful?

Human-in-the-loop

One key way we can address the challenge of models being frequently wrong is by integrating a human-in-the-loop into products. This allows the user to evaluate and change the outcome of an event or process. These human-in-the-loop capabilities can comprise of either internal teams or end users. This enables products to go-to-market with contingency for failure modes. As an example, ChatGPT retains a team of moderators to monitor the outputs of its system and ensure alignment between the product goals and the model.

Source: State of GPT - Andrej Karpathy: Microsoft Build 2023

Furthermore, ChatGPT has gone through extensive reinforcement learning from human feedback (RLHF), which allowed OpenAI to fine tune GPT-3.5 to work within a chat interface. To my understanding, this was largely complete before ChatGPT’s launch, in contrast to its near real-time moderator monitoring, but it is indicative of the ability of human feedback to shape models for production.

Feedback loops

In machine learning, feedback loops refer to the process of using the output of a model to influence future inputs, thereby (theoretically) continually improving the model's performance.

Feedback loops are a very important topic when building products, especially from a long-term value creation and defensibility perspective. If machine learning models are frequently wrong, you likely incur some marginal cost from maintaining human oversight (either in terms of money or burden on the user). In theory, your models should be learning over time and improving with increased exposure to larger data sets.

In reality, some feedback loops will improve the performance of your model over time and others will actively degrade the performance of the machine learning system. Over time they can contribute to model drift, where a machine learning model’s performance deteriorates with continued use.

For reference, think about the challenge of retraining future versions of GPT on the Internet’s data. With more and more text and websites hosting SEO-filler produced by GPT with little quality control, ChatGPT’s performance could degrade over time due to the output of its system influencing future training inputs.

Inference

Once a model has gone through an (often lengthy) training process and has been deployed, it can then be queried by users. Inference is the process in which a model is trained on new, unseen data. In the context of model inference there are three key considerations:

Latency refers to the time taken for a machine learning model to process an input and return an output. Understanding latency is a key factor in informing product UI as well as the model selection and infrastructure for optimal inference time.

Midjourney's (v 5.2) current latency for image generation sits around one minute. This is not the fastest possible text-to-image generator, but users will be using Midjourney’s latest version for the features and high quality output, rather than speed of generation.

GitHub’s Copilot, on the other hand, must provide near real-time latency in order to keep up with the speed at which a developer can write the code. In this instance, a smaller model optimized for speed may provide a better overall user experience, at the expense of introducing more frequent failure modes.

Cost will be incurred each time the model is utilized. After deciding on the optimal model, size, inference speed and UI you must grapple with the unit cost of your machine learning product. Multiplying matrices is more computationally expensive than database look-ups.

If costs are of particular significance to you, read Matt Bornstein (a16z) in the links below for a more comprehensive overview.

Performance of of a model takes into account how well it is executing on its assigned task, and how frequently failure modes are encountered.

You do not need to rely on one model only. Systems can first query fast (or even on-device models) to generate an initial response. If the output is unsatisfactory, the product can then call on a more powerful version of the model from the cloud.

Tradeoffs in inference are usually a three-way triangle between cost, latency and performance. Optimizing any two of these factors will result in a reduction in the third.

You deploy your model on CPU as is: You get same performance, cheaper cost and a hit on latency
You optimize your model for int8 and deploy: Good latency, cheap, but slight hit on performance
You use GPUs all the way through: Good performance, good latency, take a hit on cost

Deflationary AI

Daniel Gross, the former head of ML at Apple, wrote one of the more compelling essays about the real world implications of machine learning (source). Software, despite changing our world significantly over the last two decades, has only made a marginal impact in economic productivity indicators. Gross argues that with new machine learning applications capable of work that would otherwise have to be done by a human, we are more likely to see the economic impact as machine learning models are deployed to products. Daniel articulates huge productivity gains as the possible impact of deflationary AI.

Drawing from our reference products, Papercup and Synthesia are both good examples of this possibility.

Papercup dubs videos (using a combination of human effort, software and ML) at a fraction of the cost of cost of the traditional dubbing industry. Papercup’s expressivity range is not yet surpassing voice actors, but other benefits in lower cost, scalability and additional services make up for this. The dubbing industry is growing its capacity orders of magnitude slower than the rate of the growth of content. To make the world’s content accessible to all, we must rely on a mix of human intelligence and ML applications.

Synthesia enables users to create simple videos, with human avatars, without having to pay an actor and studio. The avatars are less emotive than real actors. However, the ability to quickly iterate through versions and make small changes, without having to pay for a studio, team and actor for every adjustment, far outweighs these limitations.

ML applications look like disruptive innovations in the Christensen sense - “smaller companies with fewer resources… begin by successfully targeting those overlooked segments, gaining a foothold by delivering more suitable functionality - frequently at a lower price”.

Creating structured data from unstructured

The power of AI is its ability to turn unstructured data into structured data, such as taking a video file and generating meta tags and text that relate to the video's topic. Or, as Matt Hartman of Factorial Capital writes, taking information that would just be "at worst sitting in a document" or "at best "typed into a database" and structuring it so that it is organised into data that can be queried. Hartman uses this example: "We could compile a list of all of the NDAs which are coming up for expiration in the next three months."

Source: Matthew Hartman substack

At Papercup, we structure data relating to the video action using LLMs. The queryable data is invaluable for multiple parts of our process.

Automation neglect

User access to high-performing machine learning models will not always contribute to the anticipated impact on productivity, absent a thoughtful design implementation. This may require iterations on UI and UX to ensure the model is utilizable in the most appropriate manner.

Researchers at MIT & Harvard uncovered this bias, where users implicitly discount ML predictions, favoring their own judgment - which they termed “automation neglect”. Additionally they found that the user efficiency per task (in their study of radiologists) actually went down, when faced with ML predictions.

They correctly identified this as a major barrier to effective human-ML collaboration, and it will be a key challenge for any team building ML products. There is not yet a standard playbook for how to overcome this bias and teams must take a thoughtful approach and test with users to arrive at the optimal solution.

Metrics

Often in ML companies, accuracy and other model-related metrics might be monitored and applied directly as product metrics. In our experience, these tend towards being vanity metrics that will not accurately capture the degree of value added to the user.

What matters is overall usability of the model and whether it crosses the "threshold of human acceptance". It is important to identify key product metrics that quantify the products’ ability to solve jobs-to-be-done for the customer, rather than focus solely on model performance.

Implications for product & startup economics

Machine learning products have the potential to unlock markets previously untapped by software. Having looked at useful mental models and the potential of machine learning products out in the wild, the question inevitably becomes - how do you make money here (except by being NVDA)?

Daniel Gross mapped the relationships below, when he explored where margin may accrue in the AGI supply chain:

Zeiss (Inputs) -> ASML (Suppliers) -> TSMC (Fabrication) -> NVDA (Fabless) -> MSFT (Cloud provider) -> OpenAI (Model creator) -> Datasets (Data) -> ApplicationCo (Applications) Source

Papercup is narrowly focused on expressive speech generation. In order to achieve this it necessitates us to integrate the final three rungs of the chain - model creation, data and applications. This is by no means the right approach for every product, but is the one which makes most sense in our strategic context.

There remains many interesting open questions as to where margin will accrue in the value chain:

Will OpenAI pursue its ApplicationCo consumer business, or will they refocus on building foundation models for others?
Can any ApplicationCo prove defensibility via data network effects?
Can powerful-enough models be run on-device for consumers, reducing the inference costs and improving unit economics?
What tooling will emerge to help the processing of rich media like images, audio, and video, which are common in machine learning?
Will any inference move from the cloud to on-premise to more closely manage unit costs?
How will companies' profitability ‘tame the tail’ of long-tailed distributions, which are ‘extremely common in machine learning, reflecting the state of the real world and typical data collection practices’? Source
After the initial bubble, how will investors' expectations for margins reset within machine learning products?
More generally, how defensible are machine learning products?

It looks extremely likely that machine learning products will continue to shape our world, long after the LLM hype has disappeared. Hopefully some of our experiences so far at Papercup will prove helpful to some people out there.

Thanks to my colleagues who contributed (both directly and indirectly), the writers I have linked to, as well as ChatGPT for helping get this written.

If you know any product people or entrepreneurs interested in joining a community of ML product please reach out to me directly: kilian@papercup.com

Reading list & References

https://base10.vc/research/generative-ai

https://aigrant.org/

https://a16z.com/2020/08/12/taming-the-tail-adventures-in-improving-ai-economics/

https://a16z.com/2022/11/16/creativity-as-an-app/

https://future.com/new-business-ai-different-traditional-software/

https://stratechery.com/wp-content/uploads/2022/10/DanielGross-AI-native-deflation.pdf

https://www.intercom.com/blog/podcasts/intercom-on-product-chatgpt-ai-ml-revolution/

https://venturebeat.com/ai/how-ai-iteration-can-uplevel-the-customer-experience/

https://blueprintcdn.com/wp-content/uploads/2023/07/Blueprint-Discussion-Paper-2023.10-Agarwal-Moehring-Rajpurkar-Salz_2.pdf

https://madewithml.com/courses/mlops/product-design/

In this article

We are going to take a look at:

What type of markets are now within reach?

Key ML concepts for product managers

Who can build ML products?

Research ≠ Production

Timelines and certainty

Models: Internal, off-the-shelf or API

Creating structured data from unstructured

Automation neglect

Metrics

Reading list & References

Related Blogs

The overall average watch time and completion on our new Spanish Sky News channel is so far above and beyond what we had expected. That’s a testament to the quality of the Papercup solution and then how it has transformed into positive user behavior that shows us how they consume content.

Andy Gill

Audience & Partnerships Manager at Sky News

The primary driver for looking at AI dubbing was being able to increase our viewer base and revenue and to do that through using videos from our existing archive. Previously, we have tried different approaches for translations before such as subtitling but more recently we’ve been interested by the new YouTube MLA feature, which has the potential to drive audience expansion.

John Montoya

Senior Director, Content Strategy at Vice