How to be a good machine learning product manager

There are a lot of interesting meetups at Seattle, and I try to attend one every couple weeks. Ruben Lozano Aguilera was the speaker for this meetup on Oct 17. Ruben is a product manager at Google Cloud, and before that he was a product manager at Amazon.

What is ML?

Programming transforms data + rules into answers.

Machine learning turns data + answers into rules.

When should you use ML?

Use ML if the problem:

  • handles complex logic
  • scales up really fast
  • requires specialized personalization
  • adapts in real-time

For example ML is a good fit for the "search" problem.  Search requires complex logic, for which it is not easy to develop rules.  It scales up really fast in terms of new keywords, combinations and content. It requires personalization depending on the context, and has some real-time adaptation component as well.

Another important point is that the problem should have existing examples of actual answers. When you bootstrap from a good enough dataset, you can scale further, because data -> predictions -> customer experience -> more traffic -> more data.

Some popular ML problems are ranking, recommendation, classification, regression, clustering, and anomaly detection.

Don't use ML when your problem:

  • can be solved by simple rules
  • does not adapt to new data
  • requires 100% accuracy
  • requires full interpretability/why-provenance

The data requires further consideration: Can you use data? Is it available, accessible, and sufficient? Is high quality? relevant, fresh, representative, unbiased? Is it appropriate to use the data: privacy, security concerns?

For the following, can you use ML or not?

  • What apparel items should be protected by copyright? No. This is dangerous financially, you need to get 100% accuracy.
  • Which resumes should we prioritize to interview for our candidate pipeline? No, this may be based on biased data.
  • What products should be exclusively sold to Hispanics in the US? No. This is discriminatory and creepy.
  • Which sellers have the greatest revenue potential? Yes.
  • Where should Amazon build next head quarters? No. This is not a repeatable problem; there is only one label: Seattle.
  • Which search queries should we scope for the Amazon fresh store? Yes.

What is the ML lifecycle?

For productizing ML, you need people, processes, and tools/systems.

The people come from two domains:

  • Math, statistics: ml scientist, applied scientist, resarch scientist, data scientist 
  • Software, programming: business intelligence engineer, data engineer, software engineer, dev manager, technical program manager

The ML lifecycle involves 4 phases: problem, data, features, and model.

To formulate the problem, you need to clarify what to solve, establish measurable goals, and determine what to predict.

For the data phase, you need to

  • select data: available, missing data, discarding data (data cleaning)
  • preprocess data: formatting, cleaning, sampling

For the features phase, you need to consider scaling, decomposition, aggregation, and discard any features that are not relevant.

Finally, for the model phase, you first divide the data set into training data and test data, could be 70+30 or 90+10. Then comes the model training (using whatever algorithm you are using), which produces the ML model. You then test this output ML model with the test data.

To productize your model, you should integrate the ML solution with existing software and keep it running over time. At this point considerations about the deployment environment, data storage, security and privacy, monitoring & maintenance come in to play. Some great ML solutions cannot be productized due to high implementation costs or inability to be tested in practice.

The product manager is very much involved in the first 2 phases: formulating the problem and selecting the data. Product manager is also involved in feature selection but not much involved with the final model phase.

MAD questions

1) Umm, deep learning?
Since the presentation didn't mention any deep learning specific problems/tasks/observations, I asked Ruben about what significance deep learning had on the projects he worked on. Turns out, he didn't use any. He said that simpler ML models were enough for the tasks they undertook so he never needed a deep-learning solution. He also said that deep-learning was very expensive up until a couple years ago, and that was also a factor.

With TensorFlow, Google is supposedly using deep-learning a lot, likely more for image and voice processing. But, is there a study about the prominence of deep-learning use among ML solutions in the industry?

2) How do you troubleshoot issues with productizing ML?
As we covered above there are many things that can go wrong, such as  unanticipated bias in your data, in your method, conclusions. How do you check for these? Ruben answered they brainstorm and think very deeply about what could go wrong, and identify these issues. It seems like this needs more processes and tool support. Having seen how TLA+ specifications and model checking create wonders for checking problems with distributed/concurrent systems, I am wondering if similar design level tool support could be developed for ML solutions.

3) How do we learn/teach empathy?
Ruben was a great speaker. He used beautifully designed slides. After all he is a product manager and sympathizes with the users/audience. In Q&A session he mentioned that empathy is the most important skill for a ML product manager. I believe empathizing with your audience also goes a long way in public speaking. How do we learn/teach empathy? This is so basic that you expect/hope we learn this as kids. But it looks like we keep forgetting about this and fail to empathize. Also, there is always levels to things. How do we get better at this?

4) Is ML/DL too application-coupled?
I have a some understanding of ML/DL domain, since I started learning about it in 2016. I am still amazed at how tightly application-coupled is the ML/DL work. On one hand this is good, this makes ML/DL very practical and very applicable. On the other hand, this makes it harder to study the principles and systematize knowledge.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success