The Product Manager

Are We Overpromising and Under-delivering on AI?

Hannah Clark - The Product Manager

AI can solve Olympic-level math problems... and still fumble basic arithmetic. So what gives? According to Dhruv Batra, the answer lies in the “jaggedness” of intelligence—how AI can excel in some areas while completely breaking down in others. Dhruv, co-founder and Chief Scientist at Yutori, joins Hannah Clark to unpack the cognitive dissonance users feel when a model dazzles one moment and disappoints the next.

They explore how user expectations—shaped by decades of intuitive UI patterns and human conversations—often collide with the underlying limits of AI systems. From browser agents and automation to long-term feedback loops and trust-building, this conversation is a candid look at what today’s AI can actually do (and where it’s still bluffing). If you’re building with AI or trying to scope what’s possible, this one will recalibrate your expectations—in a good way.

Resources from this episode:

Hannah Clark:

Innovation is cumulative—and by that, I mean that the ways we solve problems now couldn't be effective if not for the ways we solved them before. And while these days, the words 'training data' are typically used in the context of AI development, it's worth remembering that users are also consumers and retainers of enormous amounts of training data from years of discovering and adopting every piece of software they've ever used. So while we are busy obsessing over use cases and new features for our own AI products, users are following a different script. They're operating with preferences, habits, and most importantly, expectations they've picked up since the very first time they opened a web browser. My guest today is Dhruv Batra, co-founder and Chief Scientist at Yutori. As you're about to hear, Dhruv's experience in AI research, development, training and leadership spans over 20 years. So as you can imagine, he's got more fascinating insights on the tech than we could possibly cover in one episode. With that in mind, when I asked Dhruv what he'd most like to communicate to product leaders, he didn't hesitate. He told me it's that AI capabilities are extremely jagged. And you're about to hear exactly what that means for your users, for your organization, and for the near future of product. Let's jump in. Oh, by the way, we hold conversations like this every week, so if this sounds interesting to you, why not subscribe? Okay, now let's jump in. Welcome back to The Product Manager podcast. I'm here today with Dhruv Batra, who's the  co-founder and Chief Scientist at Yutori. Dhruv, thank you so much for joining me today.

Dhruv Batra:

Of course. Thank you for having me, Hannah.

Hannah Clark:

So let's get started with a little bit of background info. Can you tell us a little bit about your background and how your journey through AI research from deep learning into today's generative revolution has shaped your perspective on where we are now with this tech?

Dhruv Batra:

So, I'm an AI researcher. Been in the field almost 20 years at this point. AI research in modern discussion seems to start around the 2022 ChatGPT revolution. I entered the field in 2005 before the last epoch of deep learning. I, you know, I got my PhD at CMU working in core machine learning problems applied to computer vision problems, like detecting objects and images. Over the years, I've built chat bots, built the first systems that could answer questions about images, hold a dialogue about images. I was a professor at Georgia Tech for many years. I created the deep learning class. I also spent eight years at Meta. I was a senior director, leading FAIR Embodied AI. FAIR is Meta's fundamental AI research division. Embodied AI is AI for robotics and AI for smart glasses. So one of my teams at Meta had built the earliest version of an image question answering model that shipped as a multimodal assistant on the first version of RayBan Meta sunglasses. Other teams of mine built the world's fastest 3D simulator for training virtual robots and simulation before we deployed them on the Boston Dynamics bot robot. I've sort of seen the spectrum from computer vision, chat bots, robotics, and I've, I'm just sort of fascinated about intelligence and building intelligence systems, and that's what has taken me on my journey today to Yutori.

Hannah Clark:

Clearly you're a very qualified person to speak on this topic, which is something that I think all of us wanna know as much as possible about. And I'm really excited about today's topic 'cause we're gonna be looking a lot more closely at the expectations versus reality when it comes to the state of AI technology right now, which I feel like you need to have a certain set of qualifications to really be able to speak to this topic and answer some of these questions that are really on our minds. So we're gonna be looking at it today from three angles, the user side, the business side, and the technology side of AI. Starting with the user side, right now, obviously we're caught in huge hype cycle around AI, but users can often experience some really wildly inconsistent results depending on the tools and the use cases that they're pursuing. So what do you think is causing the gap right now between what people on the user side expect AI to do and what it can actually deliver right now?

Dhruv Batra:

I think that's a great question. It speaks to a problem that lies at the heart of, you know, not just building products, but also AI research, and that gets to this nature of what is often referred to as the jagged nature of intelligence. As with many of these topics, there's a famous XKCD comic of a PM like figure asking an engineer like figure to, Hey, can you build me an app? Every time someone, a user takes a picture, I want to know whether that picture is in a national park and the engineer responds. Sure, that's sounds like simple GPS based lookup query into a database. Gimme a few hours, this should be doable. And then the next sentence from the PM is let me know if the picture is off a board. And the response from the engineer is, I'll need a research team, $50 million in five years, and maybe we can answer that question. Now the specific example in that story is no longer valid. Computer vision has made enough progress that now we can, we consider detecting species of birds or dogs a solved problem. But I think the point it's trying to illustrate is that there are extremely sharp transitions from trivial problems to impossibly hard problems. And that sharpness is difficult for people to conceptualize, to predict. This is true not just for users of technologies. It's also true for builders of technologies, builders of products, and it's also true for AI researchers. We, you know, it's not really the credentials that matter, but when you spend time building this technology, different researchers end up building mental models of what machines can do and cannot do. And in today's world, for example, we joke about how we have built chatbots that can answer international math on PR questions, yet simultaneously they make mistakes like saying 9.11 is greater than 9.9 that no human would do. But that just is a kind of mistake that chatbots make. So where does that leave us? First of all, why does that happen? Where does that leave us? That happens for a few different reasons. We are building intelligent systems that are at a different point in the intelligent landscape and humans approach intelligent systems with their human understanding from dealing with other humans. When I talk to a person, they tell me that they have gone to high school or college or university or have a PhD at different spectrums of their expertise. I expect different things from them. When someone says that they have a PhD in chemistry, I don't expect them to make a mistake of the kind that, you know, 9.11 is greater than 9.9. I just expect them to be mathematically numerate, generally well-informed by the world and so on. Those expectations break down when we are dealing with AI systems and they break down because we cannot rely on the same shared assumptions. Performance on certain tasks requires training for those tasks. And even though we have built general purpose systems now over the last few years, there is a very specific thing that we mean when we say generality. And that makes it very hard for consumers to build mental models. And so there is this, perhaps a frustrating experience where people arrive at a product. It says it can do a lot of things. You ask it to do the thing that is listed on the website from the manufacturer of that product, and maybe it does it, you ask it a slight variation of that question and it's unable to do it. And so that can be a frustrating experience.

Hannah Clark:

Absolutely. Yeah. And this is such a new consumer behavior as well, whereas, you know, we're kind of trained on features that are very specific, very intuitive, very easy to use. So there's definitely this matter of people calling upon their existing training of interacting with chat or with another human being and applying those expectations to a feature that's kind of largely undefined. We don't really quite grasp the limitations in different, you could say competencies. So yeah, very complex technology that we all are kind of learning how to use together. So when we think about everyday tasks that AI could automate, things like booking travel, managing schedules, those kinds of things, what makes tasks like that harder to solve than people assume?

Dhruv Batra:

So I'll use Yutori and what we're building as an example of these things. So at Yutori, we're building personal assistance that can automate mundane workflows on the web. Our first product is called Scouts. It's a team of agents that monitor anything on the web for you. And it was extremely important to us that we clearly state this expectation that, you know, this product monitors a piece of information. You cannot ask it to book anything, buy something for you. It'll not create slides for you. It will not do your homework. It will, you know, not write code for you. It's not everything that you can do on a browser, but what it can do for individual consumers is let me know when my favorite artist comes into town. They might announce it on a few different websites. I might go to those websites at some frequency. I would just like you to check it up, that frequency. You know, maybe I'm looking for reservations for something that requires filling some lightweight form on a browser, clicking buttons. I would like this agent to do that for me, and then tell me what information is available. Maybe I'm, i'm a recruiter and I'm tracking changing roles by a particular set of people, and if they announce it on X versus LinkedIn versus their blog post, let me know when that happens. So why is this hard? This seems like a trivial thing. Humans sit down, open their browser, go to a particular page, fill out certain things. Why is this hard? Fundamentally, these problems are hard because they are what are known as sequential decision making problems. You are in a particular state, maybe you're on a webpage. You have to take up few different actions you have, you know, websites are laid out for human consumption. Reading code on HTML, it's wildly inconsistent about how buttons are annotated or labeled on webpages across websites. And so fundamentally, this is a perception problem. You have to click a button, then something happens. You maybe scroll the page, fill out something, something happens. Any mistake that you make along the way, simply cascades and earlier mistakes lead to later failures. This is a similar kind of problem that robotics and self-driving industry dealt with. We realize that if robots make a mistake, those mistakes cascade on themselves. If you're slightly deviating from a lane, now you are no longer at the center of the lane. You have to. I have a course correction maneuver. Similarly, browser automation agents that we are building, if you have ended up in a part of a webpage that is either hung or is you're not supposed to be there, you're not gonna find that response. So there's an error recovery that you have to learn when you're operating out in the wild. There are also read only tasks versus write tasks. If you fill out a form and you click submit, some websites will not let you go back to fill it out again, which means that's an irre recoverable mistake that you have made. Training for irre recoverable mistakes is difficult, and for that you have to create replicas of the real world. This is what roboticists deal with creating 3D simulators of the world, almost like virtual gameplay, training virtual robots and simulation, then deploying them in the real world. This is what we do with browser automation agents as well. When we have to train them to fill out a form and click submit, or maybe buy something on the web, those are going to be irre recoverable errors if you make them. So you have to train in simulation. Those are some of the things that cause these problems to be difficult, and it's often difficult to know or to know what you did along the way as an AI agent that contributed to your success or failures, and that's known as the credit assignment problem.

Hannah Clark:

All of these things that as humans, you know, at this point we're also well trained to do some of these procedures. It seems like a simple task, but on a technical level, we're looking at a much more complex. I know that doesn't even take into account things like preferences. You know, what time, where in the restaurant do you wanna sit at the bar? You know, there's all these other kinds of matters that I can imagine are just impossible to do from a coding perspective.

Dhruv Batra:

Here's a small example I'll give that, that communicates this. Humans are used to certain design patterns. For example, when you go on a webpage and maybe you're trying to book a reservation or book an appointment, there's a common design pattern that if a date or a time slot is grayed out or has been struck through, you understand that it is not available, even though there is no text above it or around it saying that this slot is not available. You understand it because you've been exposed to that design pattern on websites that you have seen or in various pieces of writing that you've seen across the things. How do machines understand this? Well, they may have read a lot of books, but you have to interact with websites to understand that grade out text means something. And this is just one example of the kinds of design patterns that are meant for human consumption that machines have to absorb. That clicking on this button will not do anything, and there's no text describing its purpose. You just have to know what this means.

Hannah Clark:

This is so interesting. This reminds me of a conversation way back with Nimrod Priell, who's the Founder of Cord. When we talked about the evolution of user behavior and how these incremental changes and how we understand UX elements and just the general layout and design of websites and technology over the years is kind of this compounding asset that all of us really take for granted. And this is something that's very difficult for us. It's kind of a shared language at this point that we've kind of shared over many years of technology progression. It's very difficult to communicate to a machine. So I just think this is such a fascinating area and I wanna dig a little bit deeper into the consumer behavior side as well. So as an extension of some of these kinds of behaviors and patterns that we've kind of internalized over time. Now this is an ongoing process. So what are some of the shifts in how people interact with technology that product leaders should be preparing for in the near future?

Dhruv Batra:

The emergence of AI products on the consumer market has certainly shifted people's expectations. There are now children growing up who will just expect to be able to talk to machines. There's always this future drama episode or you know, science fiction thing that children growing up in technologically advanced civilizations, if they're exposed to an older technology. Wonder, why can't I talk to my tv? Like, why is this not understanding me? I think we're seeing that shift in expectations from consumer behaviors as well, that you just. You feel like you want to be able to express yourself, you feel like I should just be able to talk to the machine. It should have a certain general purpose capabilities. It should be able to hold a coherent dialogue. It should understand my usage patterns, and that's kind of what motivated our work and vision at tutorial as well. We see the evolution of the web in the last 30 years as just incremental advancements over a core technology of connecting content and services to humans. The web is primarily laid for human consumption, and that has been because it has just been human eyeballs on the web. Now people just expect to be able to tell the machines what they want the machine to do on their computer and on their browser. Why should I as a person, sit down, click buttons, fill out my name, my address, credit card details in order to buy something or to procure some information? This should be something that should be automatable, and I think that's what we're seeing. Seeing as a shift in consumer behavior. This idea that I should have 30 tabs open to search for this one item that I'm looking for. Read 20 different reviews already. People just want to ask a. Deep research system or a monitoring system. Why don't you let me know when this happens? The next step after that is, okay, if you have told me that this my favorite artist is coming into town and they're performing this on Friday, why don't you go ahead and buy the tickets for me while am I sitting there? Mucking through forms and things. I think the shift in consumer expectations is moving up a level in abstraction, talking to software, expecting software to automate the mundane pieces of their lives, and it almost sort of becomes a task list with superpowers, and I think there will be a notion of proactivity. We don't want to sit down and explain every single time. Here is who I am, here is what my preferences are. You know, there's a notion of memory. Once you understand memory and personalization, why don't you do something proactively? Why am I having to sit down and say things? So it's sort of like everybody gets an assistant and a superpowered employee or a chief of staff.

Hannah Clark:

And you can see how some of the technologies that we interact with every day kind of contribute to that as well. You think about the for you page on TikTok and where there's the technology that learns your preferences, learns your, the things that you're likely to interact with and engage with. And we kind of apply that same logic to the technology that we're using now that we know knows a lot about us, we know knows a lot about our preferences and habits. So, yeah, it's interesting how kind of our technology landscape is sort of training some of these expectations. I think that those are interesting relationships to tip, to be paying attention to in terms of anticipating what consumers are going to expect, which I think is a great segue into the business side. So right now there's no surprise here we see a lot of companies that are rushing to market with AI products and AI features, and they're, you know, very quick to promise transformative capabilities to varying degrees of success. What would you say are the biggest mistakes that you see product teams making when it comes to scoping and positioning their AI features?

Dhruv Batra:

I think this, again goes back to the question of the jagged nature of intelligence. You have to be extremely careful. This affects not just the consumers, but also the builders. You have to be extremely careful not to promise the sky because you will not be able to deliver it on day one. Yet at the same time, the expectations of generality from consumers are climbing. They expect you to be able to not be extremely narrow applications because ChatGPT answers any question, so why don't you do anything? And so there is this trap of falling into this design pattern of a text box as the entryway into anything. And you don't tell your user anything. You promise the word. My agent can do anything that leads to frustration from the users because they're not going to, you know, first of all, they are dealing with the problem of just a blank canvas. What are the kinds of things that I can ask here, and if I'm not calibrated, I will ask for things that the agent will not be able to do and they will be frustrated. Your users will. Sure. This is one of the reasons why for our first product, we crafted a fairly narrow scope of capability. Scouts are agents that monitor anything on the web for you. They don't log into services and they don't take right actions there. It's a read-only monitoring product. However, we did not say this is Amazon Price Monitoring or this is Ticketmaster event monitoring. Any digital information that is available on the web that you could open up a browser and get access to these agents will tell you, and you will get an email whenever that happens. So just tell us in natural language and at whatever frequency you'd like to be monitored. The reason why we did that, it was extremely important for us to deliver this capability. This is a read-only capability. We're not making, I recoverable mistakes. If we make a purchase decision on your behalf, you will get frustrated if the wrong thing is purchased for you. Yet there is a certain generality here in the kinds of queries you can ask and the surfaces you may be expecting this information to appear in. From there on, you have to climb the staircase of trust. Initially, we delivered value without asking you for any logins on credit card information. However, after you've seen certain value, your users will naturally expect you to be able to do more. If I'm tracking an artist coming into town, the next step is get me a ticket. If I'm tracking the availability of a restaurant reservation, the next thing is to make that reservation. If I'm a recruiter and I'm tracking the movement of a particular candidate, the next step is drafting an outreach email. And so from a builder's perspective, I am a newbie. I'm an AI researcher. I don't feel like I'm in a state to be able to give advice. I can just point to the caution that we adhere to, which is that there is a jagged nature of intelligence. You're not gonna be some tasks you're going to be able to solve. Some not. Typically the tasks that you are able to solve are the ones where you have the ability to practice and therefore it has to be tasked where the mistakes are not too costly and you build from there incrementally.

Hannah Clark:

Okay. This is very wise words, and I see this often where. I completely see what you mean in terms of frustrating the users with having limitations that are just opaque to the user's experience. You know, they're walking into a chat bot and they're going to naturally interact with it the way that you would expect a chat with a live agent. And that can lead to a lot of frustration, which you know, there's a cost benefit analysis to not limiting the scope of what's possible and taking the risk that your consumers are going to take away return and have a lower degree of faith across the board with the technology.

Dhruv Batra:

And nature, if they don't find value in what you said you could do, they will have suboptimal experiences and they will churn. They will not come back to you.

Hannah Clark:

Yeah. Okay. So let's dive a little bit deeper in that. I kind of think the trust is sort of, core element of why people are turning at that specific critical moment. How should product leaders think about building trust with consumers without overpromising the capabilities that aren't ready yet or that they can't deliver on?

Dhruv Batra:

I think this goes back to the previous question that we were talking about. People have to see value before they hand over credentials or sensitive information like credit cards. If you're asking consumers as a very first step, and there are certain apps out there that have tried playing this where give me your calendar, give me your email inbox access. Give me a couple of other logins before you even see what this product is capable of doing. That is an extremely risky strategy. It might make you go viral on social media, but imagine yourself, put yourself in the shoes of that user. Do I really want to hand you my work email that has sensitive documentation on it or my credit card without knowing what you can or cannot deliver? And so that's why where we started was no authentication, no writing or no, no changing the state of the world initially. It's just reading. And what that means is that you can actually, you know. AI isn't promising you a hundred percent accuracy, which means you can retry if you have made mistakes. Retrying is possible in a read-only product. It's not possible in the right product with irre recoverable mistakes. So those are the kinds of things I would, and we certainly keep in mind when we have to climb that sta of trust with our users.

Hannah Clark:

I can see this kind of notion of this is an unrelated example, but I feel like it illustrates a similar kind of point. I had a friend who moved to Canada from Brazil, and she thought Canada is the safest country in the world. Everywhere that I go is safe. And then her first week in Canada, there was a robbery on her street and suddenly she thinks everywhere is unsafe. So it's the similar kind of notion of.

Dhruv Batra:

Time to value,

Hannah Clark:

time to delight, time to value, but also how fragile and kind of fraught that initial trust building period can be when something interrupts your expectation of what is going to be delivered and kind of shakes that foundational trust. So I wanna dig in then into the technology side. I think this is a really good time to kind of switch it up. From your perspective as an AI researcher, which problems would you consider largely solved today? And which would you say are, and I know that this is loaded language that they, it's on the near term horizon'cause that can always shift. But what feels like it's very much on the near term horizon versus, you know, let's say perpetually a few years away.

Dhruv Batra:

The concrete example I'll use here because it's on my mind, is of a problem of answering questions about images. The reason why this is on my mind is last week I was at a conference ICCV, International Conference in Computer Vision. My collaborators and I received what is known as a Mark Everingham prize for work that we had done a decade ago. The work was called Visual Question Answering. We introduced to the community a dataset, a task, a benchmark, and a set of methods for building the first generation of agents that could answer any open-ended question about any natural image. Over the last 10 years, we've helped the community track progress on this by organizing annual competitions. We stopped organizing that competition in 2021 because initially when we started in 2015, most methods were incredibly poor. As you can imagine at this task of answering questions about images. In 2021 on that dataset that we had created, we basically met human accuracy. We had reached into human agreement in answering these questions, and so we stopped organizing that competition. And as I'd mentioned, you know, it's an interesting lifespan where in the course of last 10 years, you know, I found myself leading a team at FAIR that built modern methods that shipped on RayBan meta sunglasses, where you can invoke an assistant and you can say, Hey, meta, take a picture. Tell me more about this monument. That problem is a good loop. Closure from 10 years ago and when we started, there were just a host of problems that were completely open-ended problems answering questions that required reading Text in the Wild was hopeless. If you asked, what does the sign say, most of those methods weren't running OCR or optical character recognition, and so they couldn't read the text on the sign. And so what methods are having to do is. Answer based on priors. What do signs usually say? Signs say, stop or go. And so they would just, you know, make common guesses. We found that this was a common problem, that most image question answering models were heavily dominated by linguistic priors. So if you ever took a picture of bananas and you asked what color are the bananas, the model was most likely going to say yellow, because most bananas in the world are yellow. It knows that from the training dataset. It actually can't see very well. You can think of it as squinting at the image, and so it could be a picture of the green banana. It'll say the banana's yellow, because most bananas are yellow. We had extended question answering to dialogue with chatbots and core reference was an extremely hard problem. If you ask, is there a person in the image? Yes. What are they doing or what is he doing now that is a core reference into a visual entity that we just talked about in the previous round. That was a hard problem. The model was already confused. It didn't know what he referred to or what they referred to. Those problems today are considered solved. To the degree that we can measure, these problems are not open problems. However, there are still some problems that are extremely open problems. Counting objects and images is still an open problem. Take a picture today where there are more than 10 people in a crowd, and I say 10 because small number of objects you can make reasonable guesses on. But take more than 10 people in or a crowd shot, upload it to your favorite chat bot. Ask it how many people there are in the image, and just look at what response you get. And that is still an open research question. Asking about 3D spatial understanding. For example, take a picture where there's a table at the far end and maybe a bookcase closer to the camera. Ask it about the height of the table and the height of the bookend. What most chatbots today will answer is pixel heights, like whatever seems closer to camera, they will say that one is taller because they don't have a spatial understanding of depth that things that are really far away can actually be taller than things that are closer to the camera because there's a 3D world behind the scenes. Those are still open questions and more closer to AI and building agents. For example, web agents. There is still a notion of drift that happens in agents over time that you may, you know, you utility builds monitoring agents, right? Our monitoring agents are running for months monitoring certain topics. So you may ask for a particular news topic and you may monitor that. Over months. And what can happen is that slowly the agents will drift into tracking something that is deviating from your original request because nobody has built agents that run for months at a time. And then we can do credit assignment and reward them for success or failures on those long horizons.

Hannah Clark:

This is a really interesting pocket here. Because I hadn't thought about what is the criteria for how we tell an agent that has done something correctly. Generally speaking, when you get a correct output, you move on with your life. You don't necessarily say, good job, that was great, or give it, you know, details about what it did correctly or not. So this is interesting about feedback loops for the technology. Is there a user behavior that we should be adopting in order to better train the models that we're currently depending on? Or is this something that I feel like this could be a whole other show, but how do we close the feedback loop?

Dhruv Batra:

Internally we do evaluations and feedback at multiple levels of abstraction. So we will browser automation agents or browser use agents We have to manually annotate. For example, every click it does. Was it the right thing to do at that webpage or was it the wrong thing to do? Now that you can imagine it's too low level, it's too noisy. Sometimes a task can be done in multiple ways. Maybe sometimes you search type an item in a search query on a webpage. Maybe sometimes you click on a tab directly to go to that source of information. So that's too low level, too noisy. However, after a task is done, maybe you asked an.

For whether a 6:

30 PM reservation was available at a restaurant or not, and it clicked through a bunch of buttons, found a restaurant, and said yes or no. At the end of that trajectory is a proof of work, and in many of these tasks there is what is known as a generator verifier gap. Verifying proof of work is easier than actually solving the task because solving the task requires clicking lots of buttons, typing the right thing in the right search box, going to the right place in the website. But if an agent comes back to you and says, yes,

I have found a 6:

30 PM reservation, it, there's a screenshot of a webpage. We can read that webpage. We can see whether the agent was right or not. And so we build evaluations based on that. Then of course there are consumer facing feedback mechanisms where every time our agents send emails to people can do thumbs up and thumbs down. This then naturally ties into personalization. Sometimes there aren't right or wrong answers. Sometimes there are just preferences. If you are tracking a certain news item and maybe you're just bored of a certain direction that news is heading into. If you were talking to a person. You would say, no, I wanna see less of this, I wanna see more of that. And ideally, you wanna say that in natural language because that's convenient. And so you have to build in feedback mechanisms that take in natural language feedback and change the course of agents operating in the future.

Hannah Clark:

This is so interesting. I'm thinking about this in terms of also how people tend to look at the outputs of something like ChatGPT through the lens of are we happy with this as a service rather than, are we happy with this as a technology perspective? So for example, if I'm next to a colleague and the two of us are generating, let's say, meal prep plans for the week. I generate a meal prep plan and they generate a meal prep plan and it's the same meal prep plan, but I feel like it's not really got the right macros for what I was trying to target. And it's just from a service perspective, it's a thumbs down. But the person next to me says, well, they did exactly what I asked. The output is technically correct for the specifications in the prompt. This is a thumbs up. It sounds a very confusing, and there's a real nuance there about how we understand what our role as givers of feedback to these machines is. That's a whole other data set that can be very difficult to work with.

Dhruv Batra:

The traditional notions of AB testing. You know, anytime you're building a feature, split users into two groups, show one set of responses to one or as strategy. Pity does show parallel responses to the same user. Ask them to pick one. That breaks down in a lot of cases. For example, anytime you have to execute a workflow and buy something or do something, there's no AB test'cause you're not going to do the same thing two times. Your user is frustrated by that. The second, sometimes thumbs up and thumbs down or AB reactions are just too coarse for a user really to convey what they're trying to say. And sometimes you want to give them the ability, if you're sending them textual responses, is to highlight something and say, not this, or be more editorial in their feedback as they would to another human. You know, if you were taking a pass on someone's Google Doc, you wouldn't just give them a thumbs up or a thumbs down.

Hannah Clark:

Oh, that would be horrible. A thumbs down. What part of it is thumbs down.

Dhruv Batra:

Exactly. Yeah. This is like, you know, going to your editor and just hearing No, fix it. Fix what?

Hannah Clark:

Yeah. Do the whole thing over again. Yeah. Oh wow. And of course, you know, that's asking a lot. Even if we did have the capabilities to say, you know, go into an output on an LLMs output or an LLMs response to us and give that kind of critical feedback, that's a lot to ask of an user that just wants it to work. Generally speaking it's difficult enough to get people to, you know, respond to surveys even if they've got a referral code or some kind of bonus in it for them. Yeah. It makes a very difficult co-authoring task to ask folks to, to help you co-author. Anyway, what I'm getting from this is that we really need to expect a lot less from you.

Dhruv Batra:

And I think they have to feel like that feedback is an investment into personalization of the chat bot towards something that they want to be able to do. That the user cannot be your QA engineer. They cannot tell you all the things that are wrong with the product, but what they can, the experience you want them to have is that this is my assistant. And therefore any feedback I give to this assistant is personalizing it and making it better and aligning it to my preferences. That feels like the right relationship or the engagement mechanism, why they might put in time.

Hannah Clark:

Yeah, I would agree. I tend to feel that I'm a lot more patient with a model that is more forthcoming with trying to learn my preferences. I feel often when you're doing a query and some kind of an LLM and it right off the bat asks for questions to further refine, even if they seem a little trivial. To me, it kind of reinforces this behavior that whatever I'm telling them has to be a lot more specific than I anticipate in order to get the output that I want. And it kind of prepares me for disappointment if I'm not quite on base. And so I think that this is. Something important to kind of keep in mind when we're developing features that require some sort of give and take from a user in order to kind of get the output that we all want.

Dhruv Batra:

This is where we go back to the, you know, climbing the staircase of trust. If the very first thing you do to a user who's just trying to experience your product is give them a 20 question questionnaire. They're just trying to get to delight and they wanna build a quick model of what you can do and then they will iterate from there. So it can't be the very first thing you bombard your user with, but as they see some value, you can inject in those questions and preferences.

Hannah Clark:

I'm curious at this point, why now? Why have you started Yutori now? Why is this the moment that you've decided to start this project and what's fundamentally different about the landscape right now that makes this possible to exist versus, you know, you've been in the game for a long time. Why not then?

Dhruv Batra:

I am of course, susceptible to hindsight bias, but my feeling is that this is a unique time in building certain kinds of AI powered products that we couldn't build in the past. I was doing robotics before I could have spun out and started a robotics startup, and there's no shortage of those. There's plenty of those. I don't think this is the right time to start a consumer or a unstructured focused robotics company because the kinds of problems to be solved, there are still decades ahead of us. I think people forget. 2004 is when this government agency called DARPA organized the DARPA Grand Challenge, asking universities to build an autonomous self-driving cars that could drive from point A to point B in a desert. 2004 was the first time that was organized. No car finished. I was at CMU at 2005. CMU, you know, went the furthest and the first attempt, 2005, is when multiple university teams finished. Late two thousands is when a bunch of Stanford researchers get absorbed into Google. It becomes initially Google X and then mid two thousands with the Waymo project. And then ultimately in 2023 or 2024 is when we get consumer facing apps. Where in San Francisco, you can call a Waymo to your doorstep or to certain pickup locations. And think about that journey from 2004 or 2005, which was the first research prototype demo to a consumer facing product. Is available, at least in certain geographic locations, it's still not been universally rolled out. Universal rollout is, you know, still maybe another decade ahead of us. That is the challenge of hardware plus AI in software plus AI development cycles are much closer. We are finally at a place where AI systems can talk to people, so there's broad knowledge of the world being able to hold a dialogue. Perception systems have matured, at least on the web. So we can take a screenshot of the web and we can know how websites are laid out, which buttons do what. There's certain broad-based common sense understanding. And finally, the third thing, open source models have been released over the last couple of years, which allow smaller players such as ourselves to at least get started. A few years ago, if we had to build a web automation or a web agent startup, we would have to start from pre-training. Pre-training of language models and vision language models is an entirely different enterprise with an entirely different set of capital investments and compute requirements and data requirements. Today, we post train models, meaning that we start from open source models that have been. Open source vision language models, and we post them for browser automation, clicking buttons filling out forms and so on. That will not be possible a few years ago. And simultaneously, this is not a problem that has long iteration cycles and is still a decade plus ahead of us. The way I think of it is there is no possible world in which robots have arrived into our homes, but we are still sitting on our laptops, typing our names into browser fields like digital assistance will arrive before physical assistance do and digital assistance. Because it's a purely digital realm. The world of bits just moves faster. It has faster ation cycles, and a lot of the substrate on which we can develop these intelligent systems has commoditized so we can focus on the last mile problems.

Hannah Clark:

Oh, that's a very eloquent answer. Also, love the use of the word substrate. This has been a thoroughly fascinating conversation. I feel like we could have gone so much deeper, and I'm sure that a lot of folks would love to do that. So where can folks follow your work online?

Dhruv Batra:

I'm available personally at dhruvbatra.com. That's my webpage. My work is available at yutori.com and our product is called Scouts. It's available at scouts.yutori.com.

Hannah Clark:

Amazing. Well, thank you so much for joining me Dhruv, I really appreciate this.

Dhruv Batra:

Thank you for having me. This was wonderful.

Hannah Clark:

Next on The Product Manager podcast. Leading product in the age of AI means solving many of the same problems in very different ways. From development processes to distribution tactics, just about every playbook that worked a year ago is already outdated—which Webflow CPO Rachel Wolan considers both a challenge and an incredible opportunity. You'll get the answers and clarity you've been waiting for on topics like answer engine optimization, build-versus-buy, and the right way to enter the AI market. Subscribe now so you don't miss it.