The ThinkND Podcast

RISE AI, Part 5: Where are We in the Journey to a Knowledgeable Assistant?

Think ND

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 1:05:50

Episode Topic: Where are We in the Journey to a Knowledgeable Assistant?

Discover the future of AI with Meta Chief Scientist Xin “Luna” Dong. As AI assistants transition from chatbots to wearables, the demand for reliability is paramount. Learn how the Dual-Neural Knowledge framework targets hallucinations, ensuring your digital assistant provides the precise, real-time accuracy required to navigate our complex information age.

Featured Speakers:

  • Xin "Luna" Dong, Meta Wearables AI

Read this episode's recap over on the University of Notre Dame's open online learning community platform, ThinkND: https://go.nd.edu/121fe0

This podcast is a part of the ThinkND Series titled RISE AI.

Thanks for listening! The ThinkND Podcast is brought to you by ThinkND, the University of Notre Dame's online learning community. We connect you with videos, podcasts, articles, courses, and other resources to inspire minds and spark conversations on topics that matter to you — everything from faith and politics, to science, technology, and your career.

  • Learn more about ThinkND and register for upcoming live events at think.nd.edu.
  • Join our LinkedIn community for updates, episode clips, and more.

Welcome and Introduction

Speaker 10

Alright, so, so it's, it's my great pleasure to, uh, to introduce, uh, lunar Don's been. Great friend and mentor to so many of my students who have done internship with her, with her at Amazon or Meta. And we also spent some delightful time together in Jour India, where she was experimental with all kinds of Indian food in Jour. she haven't been to Jo Pool. She was like, I'm gonna cry it too. So then she gave a wonderful talk at a conference that I was chairing at, uh, at IAT campus in, in Jo put as well. So it's good to bring her on to South Bend campus here today. And so, uh, hin Luna Don, she's a chief scientist at Meta Variables ai, where she leads machine learning efforts in building an intelligent personal assistant for variable technology. And we all know, we know need one, right? With all the available technology that's, that's sort of, spanning the, the surface here. Her career spans over a decade of groundbreaking work in knowledge organization where she built. Foundational systems like the Amazon product graph, and also contributed to the Google knowledge graph, and I know how foundational those systems were. Uh, students have interned with her when she was building the Amazon product graph. I even worked for her later on. And, uh, this contributions that she has made while being in industry. So all the PhD graduates were thinking of a career in the industry. You can still have an amazing scholastic career because she was also recognized, uh, she was elected as a fellow of Association of Computing Machinery, A CM, which is quite a distinctive owner, which is like top, you know, 0.5% of computer scientists in the world. And, and also a fellow of IEE. So two very distinctive honors. And she's published broadly. She's very active in conference part, uh, organization. we were program chairs at the same time at K-D-D-U-I was for the research track and applied data science tracks. So again, as someone I I, I hold in high regard for a scholarship, for a mentorship, and she spent even time today meeting with students. From different, very disciplines, not necessarily computer scientists. And that's the joy of being at Notre Dame as well. She had launched with neuroscientist, psychologists and social scientists and others and talking about career discernment. So Luna, thank you again for making the trip from Mountain View two here and, and, and, and walking us through your journey. Thanks.

Hallucination Examples

Dual Neuro Knowledge Framework

RAG Triggering Strategy

Wearable Assistant Vision

Speaker 2

Thank you very much for the nice introduction. Hi everyone, this is Luna from Meta and uh, it is my great pleasure to come here to tell you our journey in building a trustworthy AI assistant. So responsible AI means a lot of different things and definitely means different things for different people. And for me, I contribute to providing truthful, trustworthy information to our users. And, uh, last year, early last year, I asked this question is correct, relevant information, closer or farther away. This is based on the observation that we have more and more like hallucinations from large language model generated information and do we have the trustworthy information? And, uh, now let me give you some refreshed answers. First, anecdotally, this is my favorite question. Uh, so what college is the sister College of Trinity College? Oxford. So I tried it again very recently and I tried on Google. The answer is wrong and I use the AI mode and now it gives me a correct answer. But if you look at the evidence, it is the information about Trinity College Cambridge. So that is irrelevant. I asked GPT and the answer is still wrong. The nice thing about not being famous is that you can use the same example for two years. And, uh, I said, oh, search the web and give the answer. And this time it is correct. So now another like fun example I really enjoyed and I have been asking GPT about nar Do, and this time the answer is quite amazing. The most part, uh, the, the, the part I like most is she is a regular keynote presenter at VL Db, SIG, mod, KDD, blah, blah, blah. So if you organize such conferences, take the advice from GPT and invite me for keynotes. So here is a harder question. I spent July in New York City, so this is towards the end of my stay there. I watched a whole bunch of ballet shows and then I asked, okay, tell me about some new ballet shows in New York City in the upcoming months. And the first answer it gives me is like Swan Lake, et cetera, from a BT. Those are already passed. I already watched them. And so this is basically an out of date answer and I know one that is upcoming and it is not in this list, this list. So I asked particularly about this and it actually knows about it. It knows it is like, uh, has ballet and it knows it is coming at Lincoln Center. So in other words, this answer is not complete. So now I'm asking a different question. Tell me about the past locations for Sigmod. The answer is actually quite amazing. Uh, there are some caveats. First, it does not know at this conference is during COVID. It is already, uh, virtual and second, it stopped at 1980 if you are old enough, you know, it actually went before that. So now some reasoning questions. I'm working on rac. I'm working on trustworthiness. So I ask, where can I submit like truthful information related papers to it gives me a list, which seems to be quite reasonable, but I didn't find my favorite conference there. So I ask, okay, can I submit rag papers In kd e it says, yes, you can submit it if it is about this, this, and that. Again, it's incomplete answer. It may not even like, associate rag with this like truthful information answer. Okay, so those are anecdotal answers. Now, to give a quantitative answer, we constructed a benchmark, which we call crack comprehensive rag benchmark. So we used this to host KDD Cup last year, thanks to the support from Tim. And, um, you can see that this KDD Cup attracted more than 3000 participants, more than 6,000 submissions. And let's see what it is about. So we have 4,400 question answer pairs, and they're distributed in four dimensions. It covers, oops, it covers five different domains, and the questions have different dynamicism. It could be about static questions. So the information never change. It could be about faster changing Questions change every couple of days. Uh, sorry. slow changing questions change every couple of months. Faster changing, changing every couple of days. And also real time change every other second, such as stock price. And as you can see, uh, for finance, a lot of the questions are like stock price questions. They are real time for sports and finance. A lot of fast changing questions. And for movie music, open domains is more static, slow changing questions. Also, the third dimension is about the popularity of the entities. We have head entities, super popular entities, torso entities not so popular, and long tail entities, believe it or not, long tail entities are typically 95 to 99.9% of all of the entities. And um, also we see that we have different types of the questions starting from simple questions asking about the attribute of the entities such as the capital of us. Two, questions where we propose conditions. Asking for a set of entities, asking for comparison and aggregation. Ask for reasoning like multi hop and post processing, and also false preci, premi questions where the question itself contains some errors. So to make the computation fair, we also provide the retrieval contents. Ragged means retrieval, augmented generation, right? And so here for every question we provide up to 50 webpages provided by brave API, and in total, this is two 20 K webpages. In addition, we provide the knowledge graph and the signal to noise ratio is one to 30, and we provide the APIs to access those knowledge content. Based on the availability of the retrieval content, we designed three tasks. The first task basically uhanswered the question based on the top five web search results. This is basically an answer generation task. The second task, we also provide the mock APIs for knowledge graph search. So now you need to be able to answer questions over the structured content and select the correct answers. And finally, we add all of the 50 web search results. So here you need to do some search ranking, et cetera. Let's see, what's the results before that? We note that for each question, the answer from the large language model fall in three buckets. It could be correct, which contributes to accuracy. It could be incorrect, which contributes to hallucination. it could be missing. I don't know. I don't have the information to answer the question. This contributes to missing rate. And the factuality, which is the final metric, is accuracy minus hallucination. In other words, so if you have 100 binary yes, no questions, and let's do a like flip coin and uh, let's do a random guess. You have 50, correct? 50 incorrect, and your score is zero. On the other hand, you can try your best to answer 50 of them, getting them all correct. For the rest of the 50. You say, sorry, I dunno. And your score will be 50%. So we really penalize the hallucinations and uh, as you can see. Let's first look at a large language model only. So there are about 30% accuracy, but about half of the answers are not answered, and there are some of the hallucinations. And then now we give it some search results. The accuracy increase from 34% to 36%, but also the hallucination increases as well with the web content, if we just do straightforward direct solution, it actually hallucinates even more so the would drop. And now if we, for task two, we give it KG knowledge graph. Page search results is very brief and oftentimes precise. And so you can see that the accuracy improves and the hallucination reduces, which is wonderful. And finally, when we add all of the 50 pages as the search results, again, you see accuracy increases, but hallucination could increase even more. So large language model only is not enough. Straightforward rack solution is not for, is not enough either. And um, these are our KDD winning KDD cup winning solutions and they improved the factuality from about 10% to 30%. And you can see using auto evaluation, LLM as a judge and the manual evaluation, we get very similar numbers. So this is a big boost. And finally we also evaluated the industry Soda Solutions. And as you can see, these are the several industry soda solutions. And here for correct answers, we further split it into perfect answer. Nothing is wrong or acceptable answer, mostly correct with some minor issues. And with this we see that the perfect rate is below 63%. So it's still a big gap to fail. And, uh, where are we? So we are meta wearables and um, the good news is our. To end latency is lower than all other solutions. This is because we support a voice system. The user need to speak, talk to the assistant, and then get the answer. And we don't want the users to wait too long. But there is a trade off and, uh, we can see that our quality is kind of second tier. Okay, so with more, uh, dimensions, we see that we need more improvement for, dynamic questions like real time and fast changing questions. We need improvement on torso to tail knowledge and we also need improvement for the complex questions requiring aggregations, reasoning and so on. So now let's see this quantitative answer. So, LLM only. Not good. Straightforward rag, still not good. Our KDD cup solutions boosting to 36% and industry SOTA solutions. 50% right in the middle. Okay, so we are right in the middle. And what is the path to fully trustworthy information? Fully trustworthy answers. And in our mind, we believe, do your knowledge framework is the solution. And this picture basically depicts what is due neuro knowledge. So the knowledge will stay in two forms. One is the neural form basically internalized as parameters in the large language models. Another form is the symbolic form, just like knowledge graphs, web content databases as what they are now. And, um, we can see that for the head to torso knowledge normally is not that much. We wanted to, exists in both forms and for torso to long tail knowledge. It's fine for them to stay in the symbolic form. And so this is just like human beings. We have a lot of knowledge in our mind, and when we answer questions, we can directly use it, do reasoning, and then give good answers. But on the other hand, we also have a lot of things that we don't know or we know but we don't remember. And so we look up some external books, dictionaries, uh, references to answer those questions. So this is another way to look at it in a sense. This, uh, head to torso, internalize the knowledge allows us to integrate this information into the large language models at offline time. On the other hand, those symbolic knowledge in the wild required us to use RAG to do this, uh, runtime integration. Okay. You might be thinking, yeah, large language model plus rag. I know about this. This is already an existing solution, is that all of the, the full story, but we are not able to do it very well yet from this figure. You can see that, even for like head knowledge. Our accuracy of answering the questions is not good yet. So meaning for head to torso knowledge, it is not very well internalized yet. And here is another graph. And this is what we just saw from, um, crack. And this shows even with the external information, the quality still is not perfect. It is about, 50% as we have already seen. Okay, so how can we do this better? And to really make this do neuro knowledge framework effective, we need to answer three questions. The first question, when to use what, when to use internalized the knowledge and when to use symbolic knowledge. Second. How to leverage the external data by RAC more effectively. And the third, how to internalize more knowledge into the large language models. So we need to answer these three research questions. Let's start from the first question, choosing between the internalized neuro knowledge and the external symbolic knowledge. So our intuition comes from a trial on GPT, and basically we said, here are a list of 10 questions. If you know the answer, say, uh, yes, you know the answer, give the answer. And if you are not confident, admit that, say you don't know. And as you can see here, I'm showing four questions. So we got three. Yes, those are popular knowledge and one no saying, oh, this is not a commonly known fact. Wonderful. So we thought, okay, we just need to know the answer of three sub-questions. First, does large language models know what they know? Second, if they know what they know, can we teach it to say, I don't know, rather than hallucinate if they don't know what they know, can we teach it that? And the third, based on this, what would be some effective, rag triggering strategy? So to answer the first question, we first did some experiments. On three different benchmarks about factuality questions, and here we prompted the large language models to say its own confidence. And this confidence is between zero and one. So this confidence is calibrated. If it says the confidence is 50% and the accuracy of those answers is also 50%. So in other words, this is a calibrated line, a diagonal line. And what you see here, however, is below the calibrated line. What does it mean? It means large language models are overconfident and um, interestingly, uh, these lines, this couple of lines, these are smaller models. And, uh, this is just like human beings. The more you don't know, the more confident you are. And, uh, another thing is, uh, for simple qa. So these are factual questions about those popular entities and this basically tricked the large language models to think they know. And so we see this like, a lot of overconfidence happening here. Okay, so the answer to the first question, large language models know a little bit about what they know, but they tend to be overconfident. And then can we teach it that? So the underlying idea is simple. So basically we ask it some question and we look at the answer. We use LLM as judge to compare this with ground truth. And then if the answer is correct. We tell it, to, we teach it to give the answer. If the answer is incorrect, we teach it to say, I'm unsure. This seems simple, but there are two secret sauce. The first secret sauce is to give the instruction answer only if you are confident with or without this. We call it sampling factor. The behavior can be very different. And the second secret sauce is we use the atomic facts from a knowledge graph, DBPD, because these are the building blocks from for the factual statements. Let's look at the results. So here, let me first tell you about what these bars mean. Uh, we have three groups, three, three figures representing three benchmarks. Each figure has three groups representing the baseline model, the fine tuned model using our method and the fine tuned model where we apply the dampener in the inference. So for both of these two, we applied the dampener in the training, but in inference, one applied it, one didn't. We have three bars in each group representing head torso, tail entities. And finally we have three colors per bar, uh, representing correct answers, missing answers, and hallucinated answers. And we can see that, when we, for our fine tuned model, the accuracy is similar, but the hallucination is already dropping. And um, when we apply the dampener, the hallucination dropped to below 5%, which is magic. And what is even better is that the hallucination dropped more for the long tail facts, which it does not see often in the pre-training data. So likely does not have confident knowledge in the parameters. And another amazing thing which I really like is recall that we fine tuned using DBPD, but it generalized very well on IMDB and on the crack benchmark. Instead, we also try to generate the training data from MMLU, which Conce, consists of, uh, non-factual questions in addition to factual questions, and that does not work well. And, uh, we also tested on long form answers where each answer has multiple factual statements. And we tested on MMLU, which contains non-factual questions and we see improvement on long form answer answers and no regression on reasoning map questions. And another thing I want to tell you, so have you thought about, given the answers in training, this does not work. So when we. The question and the correct answers. At post-training time for one domain, we actually teach it to ignore the knowledge it obtained at pre-training time. And then for other domains it will just hallucinate. Okay, so based on this we can teach it. And now let's use it for rack triggering. And so basically what we do is, uh, we first decide whether or not the question is asking for dynamic information. If so, we'll search the web to get the most to date information. Otherwise will basically call our fine tuned model as well as the rack pipeline in parallel. And if our fine tuned model gives an answer, we return it right away. Early stop the Rack pipeline. Otherwise, we'll basically, uh, wait for the rack pipeline to give the answer based on retrieval results. And from this figure, you can see, again, if we just use our fine-tune model, the hallucination is below 5%. But if we also add, the rack when the, fine tuned model says, says, I don't know, we can significantly improve the accuracy. There are still some hallucinations. This is from the rag pipeline. Recall that the, uh, straightforward rag is not fully effective. Okay, so this answers the first question. Now let's come to the second question. How to do rag. And, um, here is, uh, the rack, pipeline. Basically given a question, we first decide what is the intent and rewrite the query into the, API invocations. Based on that, we do web search and, uh, the web search results will be fetched and we do the content extraction. We do some post processing filtering, et cetera. Meanwhile, we also do KG search, which is typically much faster. And then we put all of the results into the prompt and let large language models to generate the answers. So from this, we can see that the rag end-to-end accuracy relies on two things. The first one is the retrieval. Recall whether or not we retrieved the correct information to answer the. The second one is the answer generation accuracy with the correct information. Do we generate the answer correctly? Unfortunately, this tool actually, uh, we need to make some trade off. So here is one example, and as we can see, this is, uh, wiki page, and this is the info box, which contains much, much less information. And the D-B-P-D-A is extracted from the info box, which also contains less information. So from DBPD to wiki info box to wiki page, uh, the retrieval recall actually increases, but the end-to-end accuracy didn't increase for question answering. Instead it decreased. Here is another example from Craig. So we can see that as we add the retrieve the pages from zero to 50, the retrieval recall increases and cap at 80%. This is basically the upper bound of our, uh, QA accuracy only with the correct information, we can give the correct answer in rag, but this is our QA accuracy. So it flattens out after 10 pages. It cannot leverage all of the retrieval results. Even worse. Uh, we see that at this line. This is factuality and recall that that's accuracy minus hallucination. It even drops. Meaning it generates hallucinations because it is distracted by the retrieval noises. So there are two gaps to fail. The first gap is try to push, the retrieval recall to 100%. And the second gap is try, try to, uh, reduce the gap between retrieval recall and the final factuality. So to do this, for retrieval recall, we can improve surgery call. We can do better on complex question decomposition, and for summarization accuracy, we can improve the retrieval precision, and also we can fine tune the model to improve the summarization robustness against the retrieval noises. So now let's see how we do this on different types of the content. So starting from texts. Where we try to improve the retrieval precision, it works. Uh, this green line is when we, get the retrieval content, chunk it to small chunks, filter those that are irrelevant to the question and re-rank them according to the chunk, relevance to the question. And as you can see, both the accuracy improves. And, uh, similarly, this is roughly the, factuality stays flat, so reduces the reducing the noise helps, but there is still a lot of noises that the model cannot handle very well. Can we fine tune the model to be more robust? And what we did is to get the fine tuning s SFT training data. By, adding some of the distractor passages. These are the passages about entities of similar names, of different timestamps, et cetera. And then, uh, we basically generate this distractor passages using large language models. And then we, uh, evaluate if they are of good quality. If not yet, we regenerate it. And finally we put the high quality distractors into the training data. Another thing we did is basically this rack is, um, kind of answer generation is a reasoning process based on a whole bunch of retrieval results. We need to decide if they're relevant and which part answers the question and then synthesize the answer. So we also used strategization. So basically for each question, it will generate several steps, automatically, as a strategy to answer the question, and then it'll execute the strategy based on the retrieval results, and then give the answer. So we generate fine tuning data where we basically generate the thoughts, we'll evaluate if they are of good quality, and then if it is not good, we regenerate the thoughts. If the answer is incorrect, we regenerate the thoughts. And finally, we generate the training data. So, through these methods, we tested on a whole bunch of, uh, benchmarks in total 12 benchmarks. Our first observation is this obviously significantly improved the over state of the art and the baseline solutions. And interestingly, we also found if we just do some naive SFT give it to the question answer payers, we actually even reduce the quality of the answers. And as you can see, when we do this, um, adding, uh, strategization, we improve the accuracy and when we add the distractors, we will reduce the hallucinations compared with the baselines. Okay? So this fine tuned model gave us the red lines, so the accuracy for, uh, for the improves, and we can see that the factuality improved even more, meaning there are less hallucinations. Okay? So some bonus of our recent results, we found if we do, reinforcement learning, we can get even better results. And, um, especially here, we basically give the reward, like the accurate read answers, better than missing answer, better than hallucinated answers. And then we see hallucination dropped and accuracy improved. Okay, next, I will very quickly talk about, uh, semi-structured data very, very quickly. And these are semi-structured data. So instead of big paragraphs, passages, instead, it could have oops. Tables, web tables. A lot of you are, familiar with this. key value pair pairs, some freeform key value pairs, and so it's hard for the large language models to understand it. There could be two tasks on semi-structured data. The first task, uh, give a question. We answer the question according to the content. Second task, we extract the knowledge triples, including subject predicate optics out of the pages. And as we can see, this is the quality of using large language models to do rag on semi-structured data. Small models still have a big gap. Large models doing well on clean pages, but not so well on whole pages. And when we do fine tuning both for QA and for knowledge extraction, we can improve the QA results. And then now if we, in addition, concatenate the triples from that webpages, the knowledge triples from that webpage to the HTML itself, we can further improve it even after we fine tuned the model. So where are the triples from? And here we have some very fascinating, results. So we can certainly, uh, run large language models to extract knowledge from those triples, from those semi-structured data. But that would not scale up. I wouldn't show you all of these information. But the key idea is we teach large language models to write scripts. To do extraction from a web domain because those are typically generated using the same template. And this for this one, we can run on the whole web and we can get reasonable results. As you can see. Um, after we add this, um, actually, the extractions have some errors, but after we add it, concatenated to the HTML files, we see that we already improve the quality even for GPT. And this is similar to adding the ground truth knowledge triples. Okay, so with this I'll quickly, conclude on the rack part. So I'll skip some of the results on knowledge graph search. but, I want to tell you a little bit more about how to internalize more factual knowledge into the large language models. This, again, is some new results I'm very like, excited about. So if you look at how we, uh, basically, answer questions for a question like, I hurt my ankle, what should I do? By the way, I don't like this example. Last time I gave this, uh, presentation and then I hurt my ankle. And so a not so knowledgeable large language model will give a very, like generic answer, but a knowledgeable, large language model will give you specific instructions, but how to internalize those information into the large language models. And, um, this is a typical like, large language model. Uh, basically each layer we have attention, we have feedforward, and there is something called the memory layers. Basically we have some key value stores for the memory, and we replace the feedforward layer with the memory layer to access external memories. And, um, what we propose is for different domains, we stack different extended memory. And let's see how this works. So for this figure, the x axis is the number of parameters we need to update in one fine tuning step. So the more parameters, the more expensive it's, and this axis is the accuracy in answering some medical questions. So this is full fine tuning where it is costly, but it gives good accuracy. You might be familiar with Laura, which is this like, um, uh, sort of, uh, just, uh, ta into it and, uh, the cost is lower, but on the other hand, the accuracy is also much lower. And this is our extended layers, extended memory layers. And as you can see, the cost is even lower than lower, but the accuracy is much higher and can even compare with full fine tuning. So this is something, kind of ongoing research and I'm very excited about. So with all of this, as you can see, last year our factuality is 41%. We just evaluated again, and this time we're 60%. Our end-to-end accuracy dropped from 3.4 second to 1.8 second. So I'm happy about the results with the knowledge that we still have a lot to improve. Okay, now let's put everything in the context. Context. So what are we building? We're building the next generation intelligent assistance for wearable devices. Basically devices like this. And there are three stages of intelligent assistance. The first one is chatbot. So basically you talk, uh, you, you type your questions, you get the answers. The second stage is, um, voice assistant, like Alexa, and you talk to it. You get the answers. And I would say this is the third stage. It's wearable. So it can get the viral input, the voice input, the context input, and then provide you the information. So how is this different? So because you can see through it, it knows what is in your view. And so your input is multimodel because you can wear it everywhere. It can know the context, like where you are, what is the time, et cetera, and answer your questions, give you contextualized answers so you can wear it for a long time. So hopefully it knows you better and give you personalized the answers. And also you want to get more information. You listen to the information rather than doing search at the same time. And it's better to be trustworthy. And um, so these are our new models. How many of you have seen the Meta Connect a few weeks ago? A few of you. I hope you have learned about those new, glasses. Like this one, it has a little display and I'm one of the few people, earliest people working on it. and I hope you also learn from the Connect how to fail a demo. And, uh, can you please play the video? Thank you. So this is a video, from last year actually. Something fun to watch.

Speaker 3

Let's talk about the promise of these devices. Generative AI and large language models have created a new way for us to interact with computers, so we should get new types of computers. Humane put its computer in a$699 pin rabbit put. Its in a$199 square meta added AI to its$299 Ray Band camera glasses with all of them. The integrated AI assistants use cameras to see and microphones to hear, and then they try to assist you, emphasis on try. I ran around with all of these devices and put them through a series of challenges. Luckily. I also took my smartphone challenge one Vision.

Speaker 4

I don't wanna break you. Look and tell me what kind of cat this is.

Speaker 3

Yes. I needed kittens and puppies for this one. So I visited Byy, a pet adoption center in New York City using the cameras and large language models. These devices can analyze what you're looking at and answer questions. At least that's the promise. The humane and rabbit did know this was a cat, but

Speaker 6

AI PIN is unable to determine the specific breed of the cat. While I'm not able to identify the specific breed. It appears to be a domestic cat

Speaker 3

meta. However, hey meta, can you tell me what I'm looking at?

Speaker 7

The cat appears to be a domestic shore there.

Speaker 3

Meta also had a guess about, oh, you are cute. My new best friend Maggie,

Speaker 7

it appears to be a black and white dog, possibly a great Dane puppy.

Speaker 3

But here's the biggest problem with the humane and the rabbit, getting them to work when you need them. This great timing.

Speaker 5

If you are referring to AI pin, it's a small wearable personal computing device with a touch pad.

Speaker 3

Do you think that's what I'm

asking

Speaker 5

you

Speaker 3

right now, the humane requires you to learn a sequence of taps and commands. We can stop and the laser projector you navigate.

Three Pillars RAG

Wearables Benchmarks

Speaker 2

I, I encourage you to watch the whole thing. Very very lovely. Okay, can you go back to the slides? Perfect. Next slide please. Lemme do this. Okay, perfect. So, uh, there are three pillars to build such AI assistant. So the knowledge augmentation, as we just heard, uh, multi-model understanding about the text, about the bureau information, and also personalization. And for all of this, this rack is kind of a fundamental for many different applications. So the retrieval content, could be web and a knowledge graph. It can also be agents. That's a gentech ai. It can be memory, this is memory, qa, and, uh, personalization. And, uh, if we own the information such as memories from the users and we can do some offline augmentation to improve the retrieval and generation at runtime, we basically do this, uh, rag based generation. We first understand the intent and then we do the retrieval. Here, the key is to maximize the retrieval recall. After that, we do some filtering and, uh, re-ranking here. The key is to improve the retrieval ranking to improve the precision of top K results. And finally, we do a large language model, answer generation, uh, summarization. And here the key is to have noise robust summarization models. And Craig is back this year. And again, many thanks to the organizers, including Tim. Uh, so, uh, this year m we have Craig. Mm. Uh, MM uh, stands for Multiterm multimodel. And again, we have thousands of participants. And, um, as, um, these are the questions you might have already noticed, the egocentric. Images are very different from what you typically take using your cell phones. This is actually the first benchmark for ecocentric images for wearable devices. And, um, uh, it's much, much harder. And recall that for text rag, uh, we are at 50% and for Craig, uh, the KDD Cup winners is at 20% soda solutions at 32%. It's a bigger gap. A lot of research opportunities. And I also want to say, um, we have recently created a suite of, wearables benchmarks. So this is for zero qa, uh, which is, uh, reasoning heavy, uh, containing 10 different types of questions. And this is memory qa, where you have the viral question answering a lot of viral memories. And finally, this is, uh, we call it uh, wear box. It contains multi-channel voice and you can try out voice invoice out models or you can try those cascaded models. So with all of this, to make this do neural knowledge framework, uh, work to give us factual, truthful information, we need to answer three questions. What, uh, one, two, uh, use what kind of knowledge, how to improve RAC, and how to internalize more knowledge into large language models. There are still a lot of research to do and, uh, we hope you will enjoy, uh, join the force. And finally we are hiring interns. let us know if you're interested and thank you very much.

Speaker 8

Thank you, Luna. We can take questions via two microphones. Please use them. Thank if you could go Yes, land.

Speaker 9

I thanks so much for an absolutely fantastic talk. thank you. There's another mode of, failed responses, which comes from misconstruing the prompt rather than failing to, um, recall the correct information. And this is not really related to the talk, I suppose, but is there any progress on that?

Speaker 2

Hmm. Very good question. So the question is about when we construct the prompt, we might also reduce the QA quality. So, we actually observed this when we, um, make this, um, products and also the prompts that work for an earlier version of the large language models may not work better for the next version. So we do a lot of, uh, evaluation over time and try to adjust it. And, uh, my belief is that we can also use the, real traffic, to build this data flywheel and use that to fine tune the model and actually related to this. Another thing related to prompting is, um, uh, we sometimes give some like, answer template and we can, uh, make the, we can make the, uh, assistant to give you correct. Very boring information. And uh, so that is another thing we are trying to improve. So basically through the, uh, reinforcement learning, flywheel data flywheel, how to make the answers more engaging. Okay.

Speaker 4

so first this is a very inspiring presentation and, uh, it almost cover all the topics that I'm most interested. And, uh, can I ask, uh, one question about the KG part because you skipped that. Sure.

Speaker 3

Yeah. But,

Speaker 4

but, uh, uh, we also, build some KG rug. I think the most difficult part is that how to do the KG search. Exactly. Actually, if you do the, like, multiple hop searching, then you'll get a very high record and the content is too long mm-hmm. For the to digest. And if you do, um, short mm-hmm. I mean like one hub search, then maybe help us. The answer is not, uh, what you want. So I'm really curious about the KG search part, but, um, fortunately it was skipped.

Speaker 2

Yeah, I, I feel sorry about that. And we have some solution I'm actually very proud of. Uh, so for KG search Exactly. Um. As you said, either you have low recall or you have high recall very noisy information. And what we did is, um, we basically tried to find a sub graph in the knowledge graph and we first like, uh, go multiple hops to basically improve the recall. And then we do predicate based filtering to improve the retrieval precision. And then we give the sub graph to the knowledge, to the large language models to leverage its reasoning capability. And we actually even like, um, fine tune the model to answer those reasoning questions. And, uh, it is better than, the state of the art, solutions. So I can show you offline.

Speaker 4

Okay. Thank you. Yeah, thank you very much. Yeah, actually I can. Ask a million of questions about this presentation, but, uh, I think it better we just, uh, connect offline. Sure. Thank you,

Speaker 10

appreciated the talk. Thanks. It touches a lot on my day to day. I was curious about the stacked extended memory. You only spent a moment there, and I, I didn't catch the citation, but what I'm curious about is how it differs from generalized, um, mixture of experts approach.

Conflicting Sources Attribution

Speaker 2

Hmm, that's a great question. So, uh, you know, for, for currently for a lot of the large language models, the architecture is MOE. So basically a part of the parameters is answering one particular sort of, uh, area of questions. And this knowledge, la uh, sorry. This memory layer, is even like, uh, doing more. It basically gets the data out and the reasoning is still in the parameters. Data is, kind of, uh, pulled out as a separate memory layer with this, uh, key value pairs. Again, I don't think all of the data should be out, but torso to tell. Uh, I would say yeah, torso knowledge. and also like, um, head knowledge for torso fields can come out to the memory layer. That's a very good question,

Speaker 11

for your open-ended questions that, that you touch upon. in our world, some of the questions does not have a factual answer. You actually have multiple sounds like something conflicting. how do you deal with that? Yeah. How do you deal with that and how would you measure it? Like mathematically, but also how would you solve that conflict?

Speaker 2

Mm, that's a very good question. I wouldn't say we have the best solution yet. And, uh, I mean, in the, in my early career I worked on knowledge theory, basically, or data theory, knowledge theory, basically to say if different, uh, data sources give conflicting information, how do we measure the trustworthiness of the data sources and give those trustworthy sources, uh, more trust in terms of their, provided information. And, uh, we haven't done that much for large language model yet. And the one thing we tried it is still preliminary, is for the large language models to learn which sources it gets the correct information. Then based on that we can do a re-ranking of the knowledge sources and use that to generate the answers.

Speaker 11

Thank you. This point of perspective, right? Uh, not to be too political, Fox News, Ms. NBC. So how conflicting same event, different interpretations of the same event.

Speaker 2

Yeah. So I have a second answer for that

Speaker 11

one. Yeah. Yeah.

Speaker 2

So oftentimes, uh, there will be like a vague, information or gray areas. I think the best answer for those information is to give the attribution, which data sources provide this piece of information, and then give them, sort of a, let the users decide what they will trust them more rather than make the decisions for them.

Speaker 8

Thank,

Speaker 10

thank you Bill. I know, uh, heres a few more folks. Dave, you'll take your questions. No follow on question to the question.

Speaker 12

Okay. Okay. So my question will be done endless. Uh, I just wanted to react on the pre previous answer. You, you mentioned that you have this filtering with predicates, right? So the, the, this gets very interesting actually, right? so can you, so predicate can mean a lot of things, right? So from simple data log something up to, I dunno, whatever kind of logic, right? So, uh, which also about traits for efficiency, but expressivity on the, uh, so how far did you go in this, well, what you call the predicates, right? Mm. To be expressive, possibly even probabilistic, right? Or so it's a huge universe on the other side, which. Actually would be, uh, interesting to explore to get this multi hop, um, functionality at the end. Right. So this is the whole point. Right.

Speaker 2

That's a great question. And uh, so basically when I talk about predicates, I was talking about in the knowledge triple, it has subjects, predicates objects, for example. person, date of birth. Date of birth is a predicate, and then particular dates. So I was more of talking about those verbs and uh, we do this, uh, like, um, basically embedding based filtering.

Speaker 12

So nothing with variables.

Speaker 2

Yeah, nothing really, complex. And then we give those complex tasks to the answer generation step. Okay.

Speaker 13

Okay. Hi. So there's a significant portion of informational content online that's produced using AI and. Is consequently referenced by, by AI when searching through LLMs in order to improve rag accuracy. I know you mentioned retrieval recall and summarization accuracy. Do you also look at ensuring that the content retrieved by RAG pipelines are validated and not, you know, do you look to see what percentage of them are generated by AI so that it's not a feedback loop of self-referencing?

Wearables Privacy Safeguards

Speaker 2

Yeah, this is a wonderful question. I think, um, this is a question we must study for responsible ai. We haven't done that yet, and the best thing we are trying to do is to provide the links for the original sources and then let the users decide. But this is not perfect. And at some point we should really, uh, see, whether or not we can distinguish them. And also, I always want to do one study. If we train large language models, pre-train, large language models on those generated contents, what would happen. Yeah. So that would be a fantastic research topic.

Speaker 14

Yeah. Uh, thank you for the talk. I'm, I'm curious, um, about, uh, wearable specific sort of, um, uh, concern. And that is, there's, uh, some privacy implications with, you know, people walking around with glasses, uh, capturing all the images and, um, things that maybe people don't want captured. So I'm wondering, is meta doing, what, what is meta doing to maybe ensure that, you know. People who don't want their images, recognized or their voices recognized, to, to, to keep that from happening.

Speaker 2

Yeah. So, uh, that's a great question and uh, this is something the company actually spend a lot of efforts. The first thing I can show you is that if I take a picture, did you see the light? So that is telling people something is happening. This is a dangerous person. And so that's number one. And then number two, so unless you are taking a picture, if you just ask questions, all of the face will be blurred before it is sent to the cloud. And then number three, of course, people do, uh, sort of, uh, sign in, for a bunch of features. For example, like we can answer memory questions anytime. You can say, Hey, meta remember this. And it will, yeah, it will basically take a picture and remember this stuff and you need to sort of, uh, basically allow the, the app, the glasses to send it to cloud to help you answer such questions. In future,

Feedback Watermark Wrap

Speaker 10

then we can follow, yeah, if you could finish in

Speaker 15

really interesting talk, really appreciate it. Question is, uh, on the user end where we, you know, enter a prompt, we get a response, and then we have some immediate feedback buttons, which I assume reinforces the model in some way. What is the, the turnaround for that actually happening, uh, and the value of that, because it sounded like it really didn't do that much in terms of improving the accuracy of, of the model. Uh, so I was wondering, we can speak a little bit to that.

Speaker 2

No, that's a great question. So basically it is saying how fast are user feedbacks, influencing the product. this is still under development and I hope it can get faster. And it basically ranges, uh, between a few days if some big bug or a few months of going through a long, like, um, a long part of the data flywheel. And it really depends. And, uh, I think ideally we want this to happen sooner. And, uh, again, I mean this is, people could, um, sort of, opt out of helping, improving the quality. But once they agree, we actually can get a lot of value out of such data,

Speaker 10

Samila, Fang and Will, and those will be the last few.

Speaker 16

Yeah. So my question actually is a follow up question, but I hope you don't mind because it's a different person asking the follow up question too, to the question that regarding the sort of the, the feedback loops, right? Mm-hmm. So we supply the synthetic data back to the generative model, yeah. At leads to model, collects, that kind of thing. I was just wondering like what's the industry, take on using watermark? We generate, the, the contents yet, because I mean from a responsible AI perspective, you do want that because it give you a way to tell the synthetic from the real data, but like as, as an industry, whereas you wanna market your product, it's everything's watermark. User property just gonna sort of, oh this is watermark and everybody know I'm using AI to generate things I wanna share with people they probably hesitant to adopt your product. Just kind of wondering what's your perspective on that?

Speaker 2

That's a very good point. Honestly, we have not used it, but, I wonder why we haven't yet. And of course we are always like, um, uh, busy with adding stuff and busy with applying existing sort of constraints, restrictions that we could think of. But watermark is a good way. We should think about it.

Speaker 16

Good. Alright. Thank you.

Speaker 2

Well, thank you.

Speaker 17

I almost retracted my question'cause it's really not all of that important. going back to your point, you said it would blur out, like if you took an image before it sent it, it would blur out my face as an example.

Speaker 8

Yeah.

Speaker 17

But if you are sitting beside me and we're having conversation or I'm talking to someone and I'm saying things I, I may not know, you could be recording that audio and using that. Which the reason I almost retracted that. I guess the same can be said for my phone or anything else, any other recording device. But is that, is, is there an issue there on the recording?

Speaker 2

Mm, it could be an issue, but we don't have that feature yet.

Speaker 17

Yeah. So that's,

Speaker 2

yeah, so we are developing like a features, like a meeting notes, voice memos, and in those cases people need to sort of, agree for that. All of the speakers need to agree. Yeah. So that's why making those features is very complex. Yeah. Welcome.

Speaker 8

Let's thank Luna again. Thank you.

Speaker 2

Thank you.

Speaker 8

Thank you again. Thank you.