/ Insights / View Recording: GPT4v in Action: Redefining Image and Video Analysis Insights View Recording: GPT4v in Action: Redefining Image and Video Analysis January 11, 2024Join us as we delve into the groundbreaking capabilities of GPT-4v, a game-changer in the realm of artificial intelligence. Witness firsthand how GPT-4v has redefined the landscape of video and image analysis, drawing parallels to the transformative impact ChatGPT had on text analysis. Prepare to be amazed as we explore the unprecedented feats achieved by GPT-4v, pushing the boundaries of what’s possible in the world of AI-driven visual understanding. Transcription Collapsed Transcription Expanded OK, welcome to a conversation about GPT 4 and multimodal AI. Nathan Lasnoski 12:52 I am super glad that you are here. This is gonna be really, really interesting if you thought that the Sir GPT 3.5 chat GPT text based use cases were interesting, I think you’re gonna get blown away. By the way, video and image analysis is being brought into multi modal AI and this is going to be in a sense like another chat sheet key moment the entire industry. Nathan Lasnoski 13:17 So what a great opportunity for us to be at the spearhead of that talking about what this means to our organizations. Nathan Lasnoski 13:24 A little bit of introduction. My name is Nathan Lasnoski. I’m concurrency chief technology officer. You can find me on LinkedIn and. Nathan Lasnoski 13:33 File different types of things that I’m talking about and today we’re going to be digging into GPT 4, so we will get going. Nathan Lasnoski 13:42 Alright, first thing I want to cover with you. I think it’s really interesting to understand the impact that AI is driving within the industry. If you look at the way that our ability to produce impact has changed over time, you can see that over time we have implemented various different types of technologies that have sort of marginally increased our ability to produce outcomes, printing press, steam, engine participation, mass production on assembly lines. Nathan Lasnoski 14:13 All of this has had impact in the way that we have produced products, capabilities, intellectual property, information to the world. But if you measure that by gross domestic product in inflation adjusted dollars, you can see things like flight, advent of the Internet, the smartphone have been very much tied to that productivity growth over time. An artificial intelligence is going to be another one of those major boosts that enable our organizations to be able to achieve more with every individual. You just think about it like in the context of what can one person produce in the world? This is the way that we’re able to take an individual and tenex them in terms of their ability to drive impact and that’s what makes me so excited about the opportunities that AI provides. Nathan Lasnoski 15:02 It’s not this like idea of individuals being replaced by technology. It’s the idea of humans being amplified by technology and amplified by the use of artificial intelligence. One way that someone had just talked to me about this was the idea of thinking about what if every individual knew how to use an assistant who could go out and execute activities for them? Nathan Lasnoski 15:24 What would that do for the way that people are able to achieve outcomes? Nathan Lasnoski 15:28 That’s a lot of what we’re talking about here, but you’re also going to see use cases that enable us to be able to apply not just text to that assistant, but vision to that assistant. And that’s what we’re gonna be digging into in this in this session. I’ve now seen eight years of artificial intelligence working in this industry and I’ve never seen the amount of boost that we are seeing in the last year and especially in the sense that every CEO is thinking about this in the context of the mission of their business. Nathan Lasnoski 15:58 Every single week we’re having executive sessions on this topic and every single week, companies are investing in what it means to be engaged in a in the context of their business. And I think it’s interesting when you look at some of these statistics, you see not only are businesses investing in AI, but the employees within those businesses are yearning for organizations to invest in AI. We’re working with a company in town and Milwaukee. It’s a couple 1000 individuals in this company and they announced that they’re gonna be going live with internal instance of GPT and some copilot capabilities. And this project they’re working on and the energy that’s being released as a result of some of these employees learning about that is amazing. Just the extent to which it kind of switches them on in their job in causes them to want to be engaged in driving more within the organization. So this is, it’s just so exciting. Where there’s never been a better time to be in tech than now. So what this session is about is about multimodality, so we have focused a lot of our energy over the last couple years and especially the last year in particular on how language has taken major steps forward around large language models. We’ll talk a little bit about that in a minute, but what’s happening now is speech and vision are coming to the fold and they’re coming to the fold in a way that allows AI in those spaces to function very similarly to how we’ve seen large language models function in comparison to legacy language model approaches. So where this is being made possible, there’s a lot of products out there, a lot of capabilities are starting to come to the fold. We’re primarily talking about today is GPT 4 Turbo with vision. So vision is like this capability in conjunction with GPT 4 that was released around Microsoft Ignite time frame that allows us to be able to not just produce images that Dolly does, but also consume images and understand what’s happening in the context and of any given still image or video image that’s that’s being analyzed by the large language model. So to compare this I wanna just sort of take a step back in time. This is a slide I used to use probably 3-4 years ago, maybe even to two years ago when we were talking about natural language processing and pre large language models. Nathan Lasnoski 18:26 We almost forget how complicated this was pre large language models like we we do these projects where we would have to ingest emails and analyze the question that was being asked to the customer and then turn that into something that gets back to the customer, customer service response or a quote or this or that. It just to get to that point, we were having to ingest 10s of thousands of emails to just even teach a natural language processing agent. How do you even ingest the email itself, let alone understand what’s actually being asked and understanding the relationships between things like King and man and queen and women, and how these things all fit together or don’t fit together? Nathan Lasnoski 19:10 And then how that relates into the bag of words and the embeddings and relationships between them. And can you even understand what’s happening in a sentence? And interpret that and do something with it. And man, it was difficult like just getting to that point was really, really hard. And then all of a sudden, this large language models drop out of the sky and projects that took us $1,000,000 to do or $2,000,000 to do can now be done for hundreds of thousands of dollars or maybe even 10s of thousands of dollars versus it being in the millions of dollars range. Nathan Lasnoski 19:43 And that’s really because a lot of the underlying work had been accomplished for us to be able to gain ground and the same kind of transformation that’s happened with natural language processing is now happening with vision. So we were looking at doing a project with a company where we were ingesting these manuals. OK, these are manuals that contain both text and images of windows, and these images are necessary to understand the installation instructions. So these are installation instructions for Windows. You can see on this left hand side you can see there’s text. You can see that there’s an image and when you apply traditional OCR to those images, this is what you get to come out the other side and you realize very quickly like oh, like, that’s not gonna work. Like for a human to understand these lines and things attached to them, and what this stuff is like, we get that immediately. It’s just how our brain works. But then you apply to OCR. A doesn’t even find the image, B doesn’t know what the images are because it doesn’t know how to think about in the context of space. So to solve for this problem, we would have to go through traditional Sheen learning techniques to be able to say this is a hammer. This is a, you know, J roller. This is safety glasses. Here’s 100 pictures of safety glasses. Here’s 100 pictures of what aren’t safety glasses like to kind of help it to be able to recognize these things in the context of the document, and that becomes onerous exercise to be able to get any forward movement in multi modal types of work instructions. So, like the promise of like, hey, cool, put your work instructions behind this. Chat experience is like way more complicated than anyone who wants it to be, because it ingests all these different parts of of the text, and then you see on the right hand like how do I even gain ground then? And this is this is kind of something that then caused these projects to become super complicated. Now we have large language models. I guess you could call them large image models that are built to be able to understand in a certain sense what is happening with an image. So I’m going to show you a lot of examples of what’s now possible with these large image models, and then also we’re the same gaps exist, like are there loosens in the context of large language models? Absolutely. Do you have to build modules to be able to look for that hallucination also? Absolutely same kind of things gonna happen in image recognition. It’s going to get some things right. It might get some things wrong and we need to build some packages around that as well, so I’m going to here’s an example. Umm, that you’ve probably seen before. At least I have. It’s a installation installation screen for a scan of initiation for an IKEA shelf. I have a bunch of these in my den. They’re starting to fall apart, unfortunately, but one of the things that I did recently is put some new ones together and you can see here, step 7 what’s happening in step seven. Nathan Lasnoski 23:04 I’m attaching the back of the panel to the rest of the shelf and I’m putting the shelf together now. Nathan Lasnoski 23:12 Notice that there’s really no text in this, right? So a large language model would have no ability to understand what’s happening with this. It’d be like what? Like it might be able to tell me where the numbers are, that’s about all it could find out. So what I decided to do is apply this to GPT 4V and say what can it understand about this image so you can see very quickly it’s able to understand that step seven is attaching the back panel of the bookcase and it did that without having to understand any text about what’s happening within that image. Nathan Lasnoski 23:49 So really interesting, it’s also describing the diagram also shows how to make sure that the back panel is flush with the top and bottom of the bookcase to sort of describe other steps that are happening in the context that. So it’s not just like just telling me that there is a bookcase, it’s indicating qualities about what it’s understanding. So it’s putting down the context of what’s happening within the diagram. In a sense, it’s starting to reason through this image in a similar way to which I would reason through it as a human. Nathan Lasnoski 24:22 Now it’s not directly analogous, but it’s a JSON and it’s lot closer than the way that we used to do simple image analysis or image understanding in the past. So let’s take a more complicated scenario that we’ve also tried to solve for. So this is a layout of a boat cockpit. OK, so this is a boat cockpit that has a key associated with it and it also has a. So the key with the abcdef down here and then also has the pointing of that key to various elements within the cockpit and then recognition of what’s actually happening within that cockpit. So I took that. Same thing I said. According to the image, where am I? I find the trim tab control now. This is very interesting because it gets one thing very right and one thing very store to right. So the trim tab control is labeled K is it? Yes, it is and can be found on the right side of the helm. K right side of the helm. So it got that right. OK, so like that was my core question. It is used to adjust the angle so it gives me some extra information. I didn’t ask for but still helpful. That’s what the trim tab control is doing, and then this next thing it says very interesting. You can see it in the image where you sent me next to the speaker H and ignition key. Well, I was like, well, that’s odd because it is kind of next to age, but it’s not next to this J. Ignition key. It’s over here. That’s weird. Like, how is that next to it? And then I was like ohh, I see what it’s saying. It’s between those two, you know, it’s right. It’s it is kind of next to it right here, right. So it’s interesting how it kind of reasoned through that like it kind of got things like largely right, but then also kind of messed up like that, Jay, things right. So it’s interesting how it’s sort of reasons through something, but in comparison to what was possible before, it’s like ridiculous. Like you couldn’t do anything even close to this before. It’s amazing. So very, very interesting like so I could see how like in applying this to a use case where you have a lot of images, you might have to put some protections in to validate certain things or check for certain things. And there’s ways to accomplish that. So here’s another example. Nathan Lasnoski 26:51 Manual fire extinguishing system pull handle to manually activate the engine compartment. Fire extinguisher. One pull PIN 2 pull handle. OK, what? So I asked what do I do if the bolts on fire according to the attached image to execute the manual fire suppression system you need to do what pull the pin and then pull the handle on the fire extinguisher located in the engine compartment. Pull the handle as shown in the left side of the image shown right here and the steps are numbered on the right side, so we got the locations right and file the instructions and stay safe, right? So that’s I thought that was pretty excellent. Like it did a good job of helping to understand how to interpret this. So like if I had for example a right, it’s chat experience where a customer could say my boat is on fire, what do I do? It could say you need to pull the pin and then pull the handle. And here’s where to find it. Now you could see that there’s all sorts of like potential problems associated with building that, but like just at a very basic sense, that is a possible thing that was very difficult to do yesterday and is very much easier to do, not easy, but easier to do today, which is super interesting. So then I was like, OK, let’s take something really wild. Like, let’s give it a picture of a living room that is flooded. Like what’s going on here? Like I can understand what’s going on here, let’s ask it to provide me some perspective of what’s going on in this picture. So I just simply asked what is going on in this image. Is it a problem? And then says the image shows a room that is flooded with water, which is indeed a problem as it can cause damage to the property and belongings. The room appears to be a living area with furniture and other items partially submerged in floodwater. It looks like the flood had was caused by a leak or a storm. Based on related search results, I hope the owner is safe and dry. So why is that interesting? But let’s let’s imagine for a second like you build a image capture solution for your factory or for homes or for some other kind of use case. And it’s armed at one point. Or maybe not armed in another, depending upon the privacy requirements. And rather than having to train it about like, well, this is water and this is where the the locations of the, this is what a fireplace looks like. Nathan Lasnoski 29:25 And this is a chest of drawers or whatever. And so this is wrong or this isn’t wrong. It can recognize to a large extent like what’s happening here and is this problematic and being able to then respond to that. So like as a person walking inside, the yellow lines, are they not walking inside the yellow lines in a manufacturing floor they hanging from the manufacturing floor and their ladder has fallen down and we need to be able to pick them up like we can look for things that a human would just immediately recognize our is problematic. And this being able to then respond to that, let me give you like another kind of example that I think relates to this. So there was a story where Tesla cars were running over Kangaroos in Australia and the reason they’re being run over is because it recognized that particular image as a Bunny rabbit. Umm, so it thought that the kangaroo was a Bunny rabbit and it was like, well, whatever. OK, I’m just gonna run the Bunny rabbit over. That’s safer for the driver than you know, trying to break or whatever in that kind of example, it was just because you trained it with 100 images and you trained with hundred insurance and it thinks it’s just a big Bunny rabbit. Well, in this context you can see it’s reasoning is substantially more capable than what was possible with other kinds of platforms. Now you’re probably not embedding this as the way to interpret the real time driving, but you see how that translates into a lot of places where we would be wanting to have some sort of genericness. So like maybe another way to say this is humans possess a remarkable ability to. Generally, look at a situation like to step back from a situation and reason through it and understand things in a broader context than just an image itself. AI is not all the way there yet, but it’s a lot closer than it was yesterday, and I think that’s a really fascinating part of what was possible with this. So to get a little closer to when we know what’s good looks like this is an example of how we would go about doing that. So in this example, we’re indicating to the AI what good looks like, and then we’re asking it tell me about the which of these things aren’t good. So you can see in the green circle we are teaching open AI what at what a good looking like top of a screw looks like. You could see this like almost any scenario, right? Like, I gotta manufacturing floor. This is how the sticker needs to be applied to the box when it leaves the facility, and if the stickers aren’t applied that way, stop it. In QA, we need to reapply it. This meat that’s leaving to go to Costco doesn’t have the sticker applied to let the guy check out. Stop it before it leaves. Apply the new sticker right? Like you can tell it what good looks like, but it’s smarter than what previously we’re able to train things around, so this is good. Then we got a series of things that may or may not be good. So then we asked that question and it says the screw head with the green circle is the standard for a non defective piece.Great. Thanks. The screw this screw head number one is defective. Yes, clearly like that would not be where I’d want my screws delivered to my house if it was in a brand new package. #2 is not defective. Yep, good. Thank you. And then #3 and #4, we’re also defective, but you can see it’s also making some additional reasoning on like why it’s giving me those extra details. And I think that’s really interesting because it can provide additional data back to the QC that a human would probably provide without having to ask the human to type it up. Now I think that’s another area that this is really going to provide. Value is a human may still be involved and then inspecting these, but they’re going to do so after a lot of the initial work has been done, and even some of the initial write up has been done. So this is an example for video insurance. I had a really great video struggling to play in this presentation, but essentially what happens in this video, this was real time. The guy walks around his car, OK, it takes the video of the car and then I asked open AI to provide me with a summary of the walk around. So the guy basically took his camera, walked around the camera. OK. And then it’s got damaged. And he said give me a summary of The Walking around and it rode up the summary for me. So rather than having to have a person write up the summary that AI actually saw the problems, the chips, the dense the this, the that and it wrote that up into a summary that then could be leveraged by an individual who is going and taking action on that. So think about that in the go through that for second, says the rear side of the blue Toyota Camry has sustained significant damage characterized by deep scratches and scuff marks. These marks are most prominent around the wheel arc and extend to the rear door and the area around the fuel cap. The rear bumper displays noticeable scuff marks with paint chipping and then blah blah blah blah blah, right? So you’re seeing this do a really nice job of not only seeing the damage, but writing up the the report and then doing something with that and then an individual can then inspect what’s actually happening out of this. So I was on a call with another individual this week and he said that’s scary good. I’m like, yeah, it’s pretty interesting, right? Like that I could apply this not only to still images, but actually walk around a vehicle with a video and have it then document what’s happening in the context of that. That the use cases associated with that are nearly endless. Really, really crazy opportunities as a result of what that can provide in the market. This is probably a good time to. Also note that if you have questions as we’re going through this, feel free to put them into the chat. I think that’s something else that would be worthwhile. We’ll also have some plenty of time at the end of this to dig into some questions as you have them. OK, so they’re here’s our audience participation part of this presentation. Umm, I’d like to walk through a little bit of what it means to reason through this. Like, how does a human reason it, and then how might the AI start to reason through some of these similar situations where we’re seeing partial information? OK, so let’s start with this. So I want you all to put into the chats. You notice that this is a. You know, it’s a, it’s something handwritten. OK, I want you all to put in the chat. What you think this says? What do you think that says? I’ll give you a give you 15 five seconds. Walk, mark. Nice. OK, so this is what or is it, mark? That was really good call. So this is what the AI indicated that thoughts of the text in the image is not clearly readable due to its handwritten nature, but appears to be a signature that looks like Mark or something similar. The other text portions are blocked out and cannot be read. This is what I actually think. Was the most interesting part of its response. The fact that it knows that there are other portions of text that should be here that can’t be read like that. It recognizes that that the first this first part of it is not the only element of the image that it’s actually recognizing that there are other parts of this document that are missing. OK, I’m like, how does it even know that? Like, I mean I know that, but like how does it? How does it know that that’s probably one of the coolest things about all of this is like that. It kind of is able to understand that. OK, next thing. OK, what’s the second word? Lots of stake. Lots of steak picks. There’s a freak in there too. OK, so it’s getting where you’re getting tax extracted from the images, milk and steak. It appears to be part of a shopping list or a list of ingredients needed for a recipe, as it is list of food items. The other parts of the list are not visible or have been obscured. OK, I think I’m starting to see milk now that I saw steak in. Just getting it too. So then, here’s the rest of the list. Clearly it’s a shopping list, right? Clearly it’s a shopping list. The text extracted from this images eggs dozen milk, steak Mayo, organic bread, beer, one pack not full fridge Bunny. Nathan Lasnoski 39:20 This appears to be handwritten shopping list detailing specific items in. In some cases, quantities or types that need to be purchased. The note on the beer not full fridge suggests a reminder or emphasis on moderation or limiting the purchase quantity. Nathan Lasnoski 39:36 Also like this this part of it for me is where this truly sets apart. Like there are other OCR systems. The way that it reasoned through pretty quickly, once this saw milk and steak, what this was and now what the intention behind why that was written, the of not full fridge part was really interesting. So I think what’s also interesting about this is you can see the combination of the large language model and the large text model. Sorry and the large image model right? The idea that it’s understanding content, but then it’s also relating that reasoning through it and convey that back in text away that a human might is where like it kind of connects both sides of the coin. Right. Like we’re working with like one hand behind our back. When we couldn’t use images in the same way that we could use text, now we can kind of bring both of those two things together to accomplish some really interesting ends. So I think that was pretty fascinating. Umm, so now this is this is an applied to analysis scenario. So what you can see here is a is a power BI report and in this power BI report you know you have a whole bunch of things right? You have total sales, you have average sales, you have like craft that indicates the reservation type and couples and just Barry stuff, right? That you things you would normally see. So we prompted it. You’re an AI assistant, which helps inexperienced users of Power BI. Nathan Lasnoski 41:13 How and didn’t even spell it right? How do you filter this down to just Canada? Nathan Lasnoski 41:20 To filter the report down to just Canada, follow these steps. Click on the Canada label and one of the visuals, blah blah blah. If there are additional filter settings options available as look to the right side of the screen where there might be the filter. If you need to reset the filter view the entire data set again, click anywhere outside the selected filter label which is true or use the reset filter option of vailable. So I think. That idea is pretty cool. I think this next thing is really where it gets very interesting. In two sentences, summarize the key takeaway from this graph in one sentence, summarize a non obvious takeaway from the chart. Kind of like I would right. Like, maybe I’d find the obvious things and not find the non obvious things, but what are the non obvious takeaways from the chart? The key takeaway from this graph is that the total sales are 61,509,600 and $82, with the average of 1363 sales per reservation, and it continues a non obvious takeaway is that while sports activities lead in sales by type of interest, so right here type of interest, the majority of sales by source type come from email, not social media or a website. So they mostly come from email. So I think it’s, it’s interesting, which is like kind of what I’d be thinking right? Like sports activities will probably be more aligned to like people browsing websites. Actually, social media, it’s thinking that email would be less associated with sports activities via some of the other data. Like, I think that’s pretty interesting. How not only does it find the key takeaway, which is sort of the obvious things, right? Here’s our total sales. Here’s our average sales per reservation, but being able to ask questions about the actual content within the graph itself, some of these things like please questions right here. This is something I think you’ll see lit up in copilot for power BI. So like when you’re in power BI, you’re gonna be able to use copilot to be able to actually create the actual copilot experience. But you’ll also be able to use it in a context of asking questions about the graph that you’re inspecting and seeing. I’ve seen some other companies that have positioned chat modules of their. They’re sales experience or their customer service experience where the customer can ask questions of data in their business system or even ask questions. Now we’ll be able to ask questions now of like visuals within their business system. Now work with this. Get dangerous if you’re applying this stock without having any kind of grounding. So you could see how, like you know, there’s a lot of conversations about how open AI is able to pass the bar or able to pass like a medical exam or things like that, right? Umm, but I think you’re quickly going to find that, you know it could be applied to other things like reading an XRAY now to do so without grounding is probably going to misdiagnose a fair amount of what’s happening within those X rays. But with grounding you’ll get to a position where it could understand that data faster than a human could, because it’s ability to inspect and understand what’s happening within the diagram and have a body of knowledge that is larger than what a human can learn in their lifetime could truly advance that particular space. So there’s some really interesting news cases where we’re gonna be able to apply image analysis to things that were challenging before, like a lot of the use cases that have been tied to image analysis have struggled because they’ve used legacy machine learning techniques when they sort of large image models are allowing it to take a step back from that and reason more like the way human reason in the same context. So let’s take another use case, and this is something. Actually, we’re gonna talk more about in a upcoming week. Is this idea of code generation? But in this context we have this graph OK and it represents images that are in this sort of pre training model. And the question is generate Python code to draw similar curves and it did so it was able to interpret what’s happening in the context of that image and then produce a set of output that represents how to create those curves with text. Why this is so exciting is this enables every person who’s producing code or producing pseudo code to be able to accomplish more. And I think over the next several years, we’re gonna see more and more code development happen via English via some kind of spoken language that is not dependent on a person understanding how to actually write the code. It’s more dependent on the person being able to express their desire and their needs to a coding platform in a sense that like every individual, will be able to hand off activities to an assistant to be able to execute on the work that that assistant could be doing. And this would be an example of it being able to do that same thing. So really exciting that that you know, individuals able to like basically hand off something and say, hey, go, go produce this, what does this look like, do something with it and be able to then respond to it and take action. So whether it’s a direct coder or it’s someone who’s trying to produce some sort of like something in Excel or something, an accountant who is trying to produce something, any of this can then start to relate to like I want something that looks like this here. One other use case that I’ve been seeing there was a earlier scenario where we were talking about code generation. Just even just past generation is an example of how this all fits in and my image. OK, let’s talk a little bit more about language. So in this idea about language you can see here we have extract all the text in this image and these are variety of different spoken languages. So very quickly is able to abstract that. That’s not that difficult, right? Like we have other OCR systems that can extract text from images. Provide the likely language for each word. Now it gets interesting very quickly. Provides the correct language for each word, understanding that you know some of these aren’t even written the way that they would be written in that language. Japanese as a character language. Yet it’s still understood that Arigato has some relationship to Japanese. I’m not necessarily sure what that means yet, but all of this kind of fitting into that, right? So then what are the words saying? They’re all saying thank you in different languages, but then here’s where I think it gets even more interesting. Is there a possible mistranslation in this list? Does anybody know what it is? OK, I happen to know this one. I happen to know this one because my kids take Latin, so they they are in Latin, so they learn some things. Yes, there is a mess translation in this list. Godzilla’s TV is Latin for thanks to you, but thank you. In Latin, is commonly translated as just gratias. Also Spasiba is should also be spelled posibo for. Thank you. In Russian, I don’t know that one that’s that is very interesting. So I think just I you noticed this with like text language with already with chat GPT right? Like you’d you would ask a question, and if you like misspelled something, it would usually still be able to like answer it, which is like completely different than the way most things happen in the past. Like most search platforms or chatting, you just misspell it slightly and it can’t do anything and there’s just they were so stupid. Now we’re getting to a point where, like you combine the image analysis with text with the ability to say to do even do translation and you’re getting some really interesting abilities here. So where that comes into play might be extracting text, right? So we’ve got a whole bunch of projects going on right now where we’re extracting text from things, automating a process. So we’re taking this balance sheet. And we are asking extract as markdown use tables if possible. So it did right? So it extracted that into markdown tables and then what it did is it turned that aim to JSON. So then I could use that JSON for a variety of purposes. So this ability for it to do that quickly with very little if no training is pretty incredible, like versus even what I would have had to do before on the tech side because this is an image, right? So I’m I’m having to interpret something from an image and then translate that into markdown and then translate that into JSON and do that in a way that doesn’t require a lot of manual work like the old Azure form recognizer for example, in comparison to what I’m able to do directly with the new capabilities is like night and day like it’s ability to grab things from the image. So this is getting to a point where you can accelerate the ability for you to take data written unwritten text in, leverage that to be able to accelerate processes that would be able to use it to accomplish valuable ends. So as I’m getting to kind of the end of the examples, man, there’s a lot of them. Probably like how many more examples this guy got. Like here is here’s kind of where I’m getting the of some of these examples. This was one that Microsoft did for Microsoft Ignite. This was not mine, but what I thought was really interesting is they created these images based upon text. OK, so like in this this ignite example, they said. Alright cool. Create me a website based upon the image understanding I have and this is something that people are now doing a lot with open AI, right? What they’re doing is they are saying here is my item descriptions. Umm. Or here’s the here’s the item. Here’s the here’s the thing that I need go right up. Item descriptions for my website like Amazon is doing this for example like right up all the descriptions of the things that I’m selling and it can do that pretty eloquently and get them out there. Now, sometimes there’s problems that actually was a more recent problem of like, where it wrote something up wrong? Umm, but ultimately like you’re seeing a lot of this, this action be taken. So this is an example of a chat interface then layered on to that. So hi, how can I be helpful today? How should I take care of my tents so it’s finding that on the website and it is giving me the references as to where on the website that might be found? But I think this is more even more interesting because this is something that I might do which is hi. How can I be helpful today? I’m. I’m in a camping location, OK, and I see a tent that’s like, ohh. That tents really cool, but I don’t know the guy and I’m not going to go. He’s not even there. I’m not going to go up and, like, walk around his campsite because that’s kind of weird. Like if I’m doing that like, that’s not. That’s a little sketchy, right? So I’m just going to maybe take a picture, which is also kind of equally sketchy, but then I’m gonna take that picture and say, hey, can you recommend a tent? Now it looks like this for under $200.00 and you know I would totally do that with like other use cases as well, like a backpack or a this or that and said, hey there I’d be happy to help you find a tent that fits your needs. Budget based on the request, we’re orange known temp. I recommend you notice that determine that it was orange dorm tent. I recommend checking out the Trailblazer X2 tent. It’s under $200.00, blah blah blah, right? So like it’s, I love this kind of idea of engaging a person where they’re at with information. They have to be able to provide data back like I was dealing with this chat interface the other day for my well, it won’t go into the details on it, but like I was, I was going with this chatter and face you today and and it was one of those like dumb old chatter faces where like all it could basically do is like just answer the questions that it gave you as multiple choice and I was just like uh thing give me a good one like I can interact with and this is where it’s really exciting to see where these things are going. There’s earlier question about like how much do I need to train specifically? This is 1 where you would like. This is 1 where you ground this with your company specific tense, right? You wouldn’t like be like. Hi I’m Lund. Or sea ray. And like I’m a sea Ray website and it recommends a Lund boat like that would be a bad news. Like, don’t recommend like competitors boat like recommend my boat so you can see how grounding in this is gonna be really important because it lets you surface the right information for the customer to make a transaction with you. So you can kind of govern that based upon how you wanna ground the experiences. You also are gonna run into use cases where the generic use cases are not gonna be useful enough without grounding. So I did this not for profit project and I don’t have an example here, but I can go in the past to find some where we did a. This is really wild. Umm, we did a a wildlife identification AI and it used Microsoft AI for Earth model which is built to identify wildlife species and where they had applied where Microsoft had applied it was looking for these like I don’t know like Discovery Channel, white tiger things that are Panther or whatever they are like these things that show up only like ever so often and you’re just lucky to see them once in your life and all this stuff. So they put these cameras out and they have to have people inspect them looking for the cat to show up. And Microsoft made this model for not only that, but like a variety of things. So I don’t know if they acted that and attached it to my bird feeder and a bunch of other bird feeders at this ecology center I’m at. I’m on the board for and it would identify every wildlife that would show up from the most broad to the most specific. Because this model was built and identify very specific images associated with those that wildlife. So like the reason I bring this up is because you’re gonna have examples where we do need to train these to be very specific about a particular thing, cause it doesn’t have that knowledge yet. You know, it’s not apparent to the thing that we’re training on that that that knowledge exists. So we’re going to be able to take just like large language models. There’s gonna be a certain amount of capability exists within a large language model in and of itself. To understand something you put in front of it and to extent that you prompt it as well. But then behind that comes the sort retrieval augmented generation where you surface data that exists behind that to be able to provide more insights. And that’s this tent is an example, but also that example of uh wildlife inspection. OK, one more example and this is this design analysis. So you are looking at this. Kids come up several times with customers. It’s like, hey, I’m like, I wanna arm my engineering teams with a tool that inspects the thing that I’m building and determines like, if the walls are too closed. Have I designed this right? Is it put in the right position? So we ask, what is this? And you know, it correctly was able to understand what’s happening on this schematic like this is a central microprocessor. This is, I don’t know what these things are like. This is an oscillator circuit and it does this. These are decoupling capacitors like. That’s pretty wild that it could use do this now. Could it get some of those things wrong? Yeah, I think so. Like it could get some of those things wrong, and this is where you might need to again ground it in things that are specific to your organization. But the fact that it got me to this spot is a huge step forward from what it was able to do before, which is really kind of where we’re going now. You could even use this to say, design me a schematic like put this on to the board and the most optimized way and then the human will look at it and take action on it. You can see how the combination of those two individuals will allow us to achieve some really amazing outcomes. So I’m gonna close by doing two things. One, I’m gonna show you how people get started and be. I wanna show you the architecture behind this and then we’ll hit as many questions as we can. So how does this all work alright? At the end of the day, you have a user and that user is interacting with some platform and that could be or. This could be just an autonomous process. OK, so this could be user or could be an anonymous process and that’s running a query or doing an action and it’s retrieving answers that allow it to take actions from a platform. So how is it? How is it doing that? So usually what’s happening is you’re surfacing some kind of platform website experience chat bot. You know something that surfaces that to your customer and then behind that exists this open AI service that is interpreting it, this interpreting the text or it’s interpreting the image that you sent for and from that it decomposes the activity into its natural steps. OK, so retrieving documents, doing an external search summarizing that based upon information it has, and based upon that, it’s also leveraging the open AI service to return information that exists that has been loaded into a cache. OK, so in preparation for that, we built this data store that contains OCR documents, but in this case we’re chunking it and then using open AI service to be able to understand it as embeddings in that database. So when this comes from the user, it goes all the way through that and then can leverage information. Then I’m loading into the process to facilitate that answer back. Now, one thing that hasn’t been talked about yet is this idea of applying moderation filters to the responses, and I think in the context of images that’s gonna be even more important than what’s happening in the context of text like text. It’s important for sure. But even in images you can understand how that could get really dicey, right? Like, what does? What are people gonna do with this? Like how do I how do I manage the moderation of those images based upon what people are doing with those images, so these moderation filters can look for types of questions that people shouldn’t be asking or filtering based upon just my company or using just my images that are in my environment and then here in the answer this is where you insert things like hallucination modules to look for the content that’s going back to the customer and provide them the best, the best experience possible. So lots of really interesting stuff in this, but no, it’s any given picture to do this is a sum of a lot of different parts. It’s not just like open AI in the user and that’s it, right? It’s open AI the user and then the stuff that you load behind it to make it the most successful. So to to kind of bring this home, what I want to convey to you is alright, this is cool. How do I like get going? How do I get started with this? And there’s a couple activities that we always like to start with. The first is executive alignment. If you have not aligned your organization to the mission of artificial intelligence in relationship to the mission of your business and how it all fits together, that really is the core starting point because that’s what lets you think about it broadly and how my business is gonna go to market over the next several years as the markets disrupt and as incremental and disruptive change happens. And I need to apply guardrails and I think about how I’m going to change my business. All it has to happen in executive level, or at least inclusive of the executive level, to be able to take action, which lets you go into group envisioning sessions to be able to come up with the right ideas. Those ideas ultimately lead into scenario evaluation, like quoting or customer service or data mining or whatever you’re doing with it. Ultimately, it has to have and ROI. It has to have a sponsor that cares about it and needs to have data that backs up the use case, and if you’re missing any one of those three things, you’re going to struggle to achieve the outcomes that you’re looking for. So if you pass that gate, you then get to move into executing Pocs and pilots that POC really proves that this is possible in a pilot proves that it delivers value, which then lets you move into production iterations of getting those outcomes and managing the managing to validate you get those outcomes. And at the tail end here, which sometimes comes all the way back to the beginning, is this idea of a scale pattern. How do I do this a lot like do I? I’m gonna have 100 AI projects. How do I do 100 AI projects at scale and not recreate the wheel every time I do it and that’s where things like scale pattern really come into play and doing this right? So what we would love to suggest to you and is absolutely in your form as you close down from this webinar, make sure you fill it out. Is we would love to come out and help with executive alignment and group envisioning and evaluation of scenarios and we will do that for free on us, right. So we love to help companies get past those stages to be able to take action, but we know we have to invest there and you have to invest there to make that possible. So as you so leave today and go on to your next thing, fill out that survey. I wanna know what you liked from this and what you didn’t like from it. And BI love to know how we can get engaged in any one of those steps to help you gain ground in AI today. Alright. And now that we’re here, I would love to get some more questions. So what kinds of questions? Let’s first just go through the chat, make sure people don’t have more questions in the chat, but feel free to drop them in the chat or come off mute and I’m happy to address them. OK, I hit the one that’s in the chat. Is it able to visual instructions? UM, I think that’s that’s definitely represented in the IKEA example and in the like key with the pointing to things example umm and in the. The installation instructions example, I think there’s some strong strong backups behind it being able to understand steps and what has to come first, second and third in the context of those steps. Nathan Lasnoski 1:05:31 Umm, definitely. See some gaps there, but maybe a lot of ground that’s been gained OK, great. Well, thank you so much. This is really fun. I am so like just putting this this deck together for open AI GPT V which is like a super interesting experience. So thanks for giving me the opportunity to do it Nathan Lasnoski 1:07:03 I’m loving the opportunities coming up within companies to actually take advantage of this and I’m really hoping to spend some time with each of you to have those conversations. Nathan Lasnoski 1:07:10 So have a really wonderful afternoon. Thank you for spending some time with me and thank you very much. Yeah, so recording will be available. Thank you for asking that question. Nathan Lasnoski 1:07:20 Thanks everybody.