AI Industry & Applications · 2026-05-17 · 08:00:00

AIE Singapore Day 2 ft. Google DeepMind, OpenClaw, Adaption, Arize, Cloudflare, Robot Company & more

Speaker

AI Engineer Singapore

AI Engineer first Asia edition (organised by 65Labs)

Type

Industry Leader

Source

In Brief

Day 2 of AI Engineer Singapore — sessions from Google DeepMind, OpenClaw, Adaption, Arize, Cloudflare, Robot Company and others. Day 2 leans toward robotics, model observability, and the runtime stack.

Readable transcript

Caption language: en · Fetched: 2026-05-21

Waves crashing night waves crashing ocean know it. You need Hey, hey, hey. Hey, hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Heat. Hey, hey, hey. of this event, co-founder 65 Labs, and thank you so much for showing up. I know it's day three, Sunday morning, and all of you here in this room have chosen sleep deprivation over missing a single second of sessions and I really appreciate that. Thank you. Um, so, you know, I think we're on the final stretch here. If you haven't noticed, I'm losing my voice, but you should see the rest of the organizers. I've subbed in for S Sherry this morning for precisely that reason. Um, but we're super excited to have everyone here. We've loved the energy over the last few days.

Uh, and when we started building, when we started putting together AI Singapore, this is really the sort of energy that, you know, we were hoping for and you've all really delivered. So, thank you so much. Um before we kick off, I just want to say another quick thank you to the sponsors, the speakers, uh all the volunteers who have helped us make this conference a magical experience so far. Uh really appreciate all of you and uh and would appreciate um if everyone here could just give them a quick hand. Great. So, you're not here to see me. So, without further ado, I'd like to bring Salanne from Arise on stage to talk about her experiences in building Alex. >> Good morning everyone. Thanks so much for spending your morning with me. It's pretty early. Let's see. Yes. Time to go. All right, let's see.

Sorry, I got to reconnect to my hotspot. I thought I did this already. Cool. There we go. Good morning everyone. Uh thanks so much for joining me today. I'm super excited to share some lessons uh my team and I have learned from building Alex, our AI agent uh that we've been working on for a little while here. Before we get into that, I want to introduce myself a little bit. I'm Salian. Um head of product at Arise. Um I have a technical background. I started out in data science and now I'm building products for teams. Um I'm pretty hands-on. I'm not only the PM of Alex, but I'm also a core contributor. So I really know the pains firsthand of building an agent. And now I pretty much take that pain and I turn it into tools that actually help folks. So Arise uh we make agent work. There are a few things that we do really well.

The first piece of it is observability. Uh this is understanding what's happening under the hood for your agent. The second piece is evaluations. This is how we understand how your agents are performing. And then we use all of that data to help you improve and iterate. And then of course we have Alex sitting across the entire stack to help you do all of it. So, what are we going to be talking about today? Uh, we're going to first I'm going to tell you a little bit about what Alex is and then I'm going to go through four lessons we've learned over our journey of building it. So, staying on task, context management, crystallizing good behavior, and debugging a real agent. So, Alex uh is your AI engineering agent harness. Uh, we've really built Alex to help you build and scale your AI application in natural language.

Um, so it has really evolved the Arise experience. It has plans, reasoning, um, and executes through really heavy workloads for your AI agents. Um, you can pretty much ask a natural language anything you want and Alex can help you execute. It can do things like help you analyze your data, but also help you carry out workflows like iterating on your prompts or aligning your emails. And it's really a force multiplier for AIG, PMs, and subject matter experts. And so why am I here telling you all of this? Well, uh, we spent three years building Alex. It's been quite the journey. We first started at the very beginning of kind of generative AI and now we've gotten to Alex 2.

0 with reasoning and planning and there's just been a lot of lessons that me and my team have learned and I think the great part about our industry and our community is we have the opportunity to share back and so that's what I'm here to do today is to teach you a little bit about our lessons so hopefully you don't have to learn them the hard way like we did. So lesson number one staying on task. I think every agent builder has experienced this where you ask your agent to do a handful of things. uh maybe it's able to do the first one successfully but then it forgets about you know the second and third and I think that this is something that everybody really tries to solve. Um people commonly ask me like well why is this happening? Uh people assume it's like hallucination problem or even a capability problem but it's really not.

It's a tension problem. And so what ends up happening is when we're asking for multiple um things from an agent, um what typically happens is it is able to see the first one, but then the rest of it kind of gets lost in all of the other data that we're asking for. And so it can be really hard for once the agent figures out what it needs to do next, it's already forgotten what next even is. So the sol the solution to that is planning. Um planning is the way for your agent to first decide what it is it needs to do before actually actioning on it. And so for Alex, before Alex even pulls any data, it's first going to come up with an explicit to-do that it has to uh reason upon and go through step by step before it actually takes that action. And so how we do planning uh for Alex is we have planning tools and states.

Uh we have three tools uh to do write, to do update, to do read um and then four states pending, completed, blocked, and in progress. We didn't actually start out with all these states. I'm going to talk about that in the beginning, but we have definitely found that just using something like a finish tool or using prompts was not enough for Alex to be able to accomplish really complex tasks. And so the tools um this is something that we borrowed from some of our our favorite tools like Claude. Um and this has been a real gamecher for us to manage extremely complex tasks. In progress was something that we actually learned. This was a really important lesson. When we first built Alex, we did not have an in progress. We actually just had like pending and completed.

Um, but we added in progress so that Alex knows exactly what it is, the task, um, that it's currently working on. So, it's really helped to anchor the agent and what it's trying to accomplish. Um, and just really improved our ability to complete our our task correctly. Another really key architectural decision that we made is that planning lives outside the conversation history. Um, and so it's really important to do that because for conversation history, we are doing a bit of truncation and we never want the plan to get truncated. uh because if that happens, Alex won't know what it is that it's it's trying to accomplish. Uh so we actually inject this every time we're making an LLM call right after the system instructions separate from all of the data in the conversation history. And this is actually what Alex sees.

So it sees its current plan. It sees all of the status and then we're actually coaching Alex along with like when you're done, you know, call to-do update with the status completed when you finish this task. So again, helping Alex as it's going along, not just giving it kind of a passive prompt, but really an explicit kind of fewshot example of what it is it needs to do as it's carrying out its plan. We also have what we call the finish gate. Uh this is what keeps Alex from saying that it's done before it's completed all its tasks. So if Alex tries to call our finish tool um without its completed tools, we give it actually a really explicit error that's saying, "Hey, you need to go back and finish all of your to-do items. " It's not a suggestion. It's not like kind of a nudge.

It's it's an explicit structured message that Alex gets um that it cannot go on. The only exception to that is the block status. The block status is used for when we have human in the loop. Uh if you use Alex, there's a lot of uh moments where we ask for the human to interact. So if we're creating a prompt, you can kind of get a diff and then accept or something like an annotation config where it's important for the human to be involved. And so when there's a block status, that's the only situation where Alex does not have to complete the task because it understands that that's blocked by the human and we're waiting for that response. And so these are some of the core lessons that we have from planning.

So enforcing code, not just prompts, few shot examples, beat any kind of abstract instructions, always use the to-do right to plan doesn't work. We have to have kind of those explicit functions and then show the agent what good planning looks like. So some of those examples. All right, context management. Uh, context management is extremely important. It was a non-negotiable for Alex. Uh, we're functioning on a lot of text data. So, Alex is built across the Arise platform. Observability data is for AI applications which also have a lot of text data. So, context management became extremely important. Um, I I did a talk on this actually in London, so definitely go check that out. But I think context management is not just managing the context window, but also being really strategic about what it is that we're showing our agents.

It's letting them remember what it needs to and forgetting what it doesn't. And so early on, this was actually a system prompt that we had for Alex, which was for our experimentation comparison. Um, and we said, "Do not try to compare more than two experiments at a time. " Uh, but this was pretty naive. Uh, the problem with this is that one experiment in Arise can be hundreds of rows, which is like 100,000 tokens. And so even just experimenting or sorry, trying to compare a single experiment uh was going to blow up our our context window. So we knew that it was not enough just to be able um to to have these um explicit prompts. So we came up with abstractions. One of them is called large JSON.

Um so what this actually does is when Alex is returning tool data, uh we store the majority of it in a serverized memory and provide the agent with an ID that it can grab later if it needs more context. So this is really important. Alex is constantly grabbing data from our platform. We can't show it all to the LM, but we also need to be able to give the agent enough context so it knows what to do next. Um, and so that's where we had this idea of like compressing the value, not the structure. At first, what we did is we tried to truncate and just give a preview of like the first little bit of data. So just taking the first like you know, n tokens of the data. But the problem with that is that Alex doesn't actually understand what the structure of the data is.

So it made it really difficult for it to query because oftentimes Alex needs a preview and then it needs to decide what data to look for further. Um, so what we did is compress the values and not the structure. So we kept all of the the fields, all of the arrays. Alex has access to all of that, but then we truncate any large strings within that and then it can use kind of the large JSON uh abstraction to go grab more data as needed. We also gave Alex a bunch of small composable tools and this is really important. So Alex has access to two tools uh jq which is just like the same tool that you would use in your command line and GP JSON which is able to do reex search over serialized data. Um and the importance of this is these are really really small tools but they're super powerful. Alex can use them together. They can be composable.

Um, and use the input of one or use the output of one into the input of others. Um, and so it just allows Alex to slice data, aggregate, do all these really powerful functions with really really small tools. So nothing super complex. I always like to kind of make this to the liking of like a a UX programmer. You can think of your tools and then like your agent is uh your shell script. So you really will hear me say all the time, think about small little tools that your agent can use and that will be what will make it the most successful. So these are some uh lessons in context management. Uh give hard token budgets on every tool output. We do like a 10,000 um limit on all our tools so that we have this predictable content uh that we know it's going to happen. So we know there will no be there will be no overflow.

There will just be multiple turns. Uh compress the values not structure. Uh don't paper over palms with artificial limits. uh give ex uh good exceptions in your feedback loops and then tool responses may contain customer data. So you should watch your logs. That's another important one. All right, crystallizing good behavior. So when we first started building Alex, um I spent a lot of time with a spreadsheet and like a Google doc trying to test. Uh but we quickly learned that vibe checking does not scale. Um it was really hard every time we made a change for me to know whether or not something was going to break. Um and so we knew that we needed a better solution to that. And what we really found is production traces as your ground truth is extremely powerful.

So at first we were trying to kind of write up the golden answers ourselves by hand, but we have a great example in our production traces that we can utilize. And so looking at your data and actually using those as your test cases is one of the most powerful lessons that we've learned with Alex. We do a few different types of testing when it comes to Alex. Uh so we have decision point tests where we're looking at one component. uh we'll pass it through kind of like our orchestrator and then we'll we'll test what the outcome is and then we do a very kind of um open-ended way of of checking this like an exact match is not going to work on on our outputs. So for an example like looking at contains any so for something like producing a time stamp like 2,000 milliseconds 2 seconds 2 seconds there's a lot of different ways.

So, we have this open-ended check that we can do to determine whether the decision was correct. And I think that that is really powerful, especially when you're using kind of a a language model where the output is non-deterministic. The other is trajectory tests. So, what we do is we kind of save off all those production choices that I was mentioning before and we step through them rowby row and we use an LM as a judge to assess the output. Um, the evaluation prompt really matters here. As I was saying before, these outputs are not deterministic. And so you want to make sure that your evaluation template can handle that um and is defining success for each individual step. Um level three of this is CI and prompt validation. So everything that we do for our testing actually lives in Arise. Uh we're running these as ad hoc tests.

We're running them as part of our CI and then we have these great visualizations. So I can actually come in and check how things are working over time. Uh seeing if there's any integration in our performance from our evals. Um, and I think that's what's really cool about building a tool with Arise is that we're like we're dog fooding our own product. Um, so everything that my team is doing, I know that can help our users as well, which has been uh, extremely powerful. And so some lessons from crystallizing behavior, capture good user sessions, uh, match facts, not phrasing, Elm as a judge for semantic evaluation, real APIs, not mocks, uh, integrate bugs are real. Um, and then my last lesson here, debugging a real agent.

I think this is something that I get a ton of questions on is like what are your day-to-day workflows for when there's an issue with Alex? And so we're really seeing this evolution of software engineering of who is consuming telemetry data. When we first started, we were very human in the loop. I was looking at the data directly, then going the IDE with me, me and my engineers were going to our idees, making the changes, and then observing it. We kind of started to see this software 2. 0 where we have our agentic idees, and now the human still involved, but we're using an agent to iterate. And now we really got into this phase three where we can actually use our coding agent directly um to be able to read our hotel data um and iterate. So this is kind of the the stack that we're currently using where we still are using arise.

All of our traces go in our evals our feedback. But we have what we call Arise skills that allows our cursors our our cloud code to interact directly with Arise and make our feedback loop really really fast. Um as agent builders we have learned that the feedback loop really really matters. I'm trying to make it how fast can we go from an issue to a fix. Um and the Arise skills have really helped us with that. And so um these are some examples of our ARIS skills. These are live if you'd like to uh try them out yourselves or come talk to us at the booth. Um but I basically use a lot of our uh Arise trace and evaluate skills. It just makes it so that my agent can get a signal. Um pull the traces from Arise, even look at external sources or code, put up a fix, and then me and my engineers can just review that.

And so this is the AI engineering loop that's powered by Arise that we are using ourselves. Um we're kind of our first guinea pigs always. If it works for Alex, we know it will work for everybody else. And so you can see we have a bunch of different agents leveraging our skills and improving on Alex. And so these are some of our debug flows in action. So reading the traces, pulling the full session, and then identifying the failed notes. Uh we can also read from external sources like data dog. Alex has really integrated into our UI. or APM traces also become increasingly important. Um, and then also things like G-Cloud logs. Um, so we found had an example with like out of memory. Um, and so we're able just to go from an issue to the exact um, root cause really fix fast so that we can then fix it.

And so these are some of our our lessons here from debugging. Um, skills are just markdown. They're low cost, high value. Definitely invest in your skills, your factory. Um, safety must be rappers, not prompts. uh agent debugging is a agent-shaped problem and then you know observability before you need it. Um you can't really have eval without observability. You can't really fix your agent and make it successful without observability. So uh that's something that we've definitely learned firsthand. So these are some of the big uh lessons that we learned and talked about today. Um I know I went through a lot of material fast. So if you have any questions uh we'll be uh over at the pullman in our booth happy to talk through anything in more detail. Um but thanks so much for spending your morning with me. Thank you so much, Salian.

Uh, up next, we're just going to get set up for Tim from Rizaro, who'll be talking to you about scaling evals. All right, good morning everyone. Uh, thanks for making time today. Uh, especially if you have come from the afterparty from last night. So today I'll be talking about scaling evals and maybe to motivate it, let me share with you a bit about the work that Rsaro does. So Raro is a testing and evaluation company.

We work uh primarily with uh uh companies in the mission critical use cases and spaces for example healthcare, defense, security and we help them test and evaluate the AI systems that they are developing or procuring so that they have the confidence that what they are deploying is good enough to go into production and today I'll be sharing some of the learnings we had over our past couple of years in in this journey where we see the main problems existing how how do we then overcome them as well as what are the remaining blockers to scaling use case specific testing evaluation. All right, so let's uh start with this slide over here like what do cobras sprint velocity tracking and AI benchmarks have in common. All of them shows examples of perverse incentives, right?

So with the examples of cobras, it's a case that if you incentivize people to catch cobras, people will actually be breeding them instead. And this leads to actually more cobras being out there in the wild. Um, same with sprint velocity tracking. If you're familiar, if you're a software engineer, if your manager asks you to increase the number of story points you can deliver, you see that result, but at the end of the day, it doesn't translate to any meaningful outcomes. At least from my point of view. Um, and then you might have seen some AI benchmarks and you test the latest open source models. They typically don't they might sometimes differ from your actual user testing versus like what they show in the benchmarks and you wonder how they they manage to actually get such good results.

So that leads to the concept of like what we call benchm benchmaxing. I think nowadays is is getting more popular where people actually game the benchmarks to show that they're good in certain tasks but it doesn't really translate to real world performance. On the other side, we have vibe coding, right? Or I'll call it vibe testing. So, vibe testing is a process maybe where you have a couple of um example prompts in mind, some trick questions. How many RS are there in in strawberries? Or maybe can um can you generate an image of a pelican riding a bicycle? So, what we see for for these examples actually is that actually it's not not that bad because they are pretty useful.

They give you a sense of how the model is performing maybe in a particular scenario or aspects that you're interested in but and and they also encourage explorative explorative u exploration of the process right where you can try out different prompts and actually find what's good enough uh for your use case but I think having said that as well um how how do you actually tell whether what is a pelican riding a bicycle testing versus maybe what is a tukan riding on a tuk tuk Are we talking about just a bird on a vehicle or are we talking about maybe some other types of higher level concepts we are testing?

So I think it helps to be very explicit over here like um even if we have a test case in mind what are the particular dimensions of interest that we are testing evaluating and this is where I see the middle ground between benchmarks and vibe testing. So the problem is then how can we structure the vibe testing approach such that we are able to then identifies the scenarios of interest as well as then um structure it and scale it up for like a more use case specific evaluation. So this then leads us to the concepts of operational design domains where we define that as um the sort of problems constraint space that we are testing against and this helps to govern what is the meaningful set of of test cases that we're evaluating.

uh from there we can then define what is the expected behavior of the system what are the age cases that we should be be aware of and also what are the cases that are pro probably not within the bounds of of this system and eval and and evaluation right so that is totally out of scope and should not be uh used and consumed by the AI system so from there we are then able to derive a pipeline and workflow internally where we actually translate the odds into different test cases of interest uh link that up with data quality checks to filter out the data that might not meet our requirements and also then enhance the data quality if um if if there are gaps over there. Right?

So we emphasize a lot in terms of finding the coverage gaps so that we are able to fill it and often times as we go into more mission critical use cases we find that there are there might not be enough test cases especially for age cases of interest and that's where synthetic data sets or synthetic data generation methods actually helps to bridge the the testing evaluation process.

So we put a lot of emphasis in terms of how can we generate synthetic data in a way to augment the test set and I think once we have we have framed the problem as such we will see that it's actually more of a that the data is the bottleneck right we can shift the problem from from eval to how do we generate the right test cases that gives us the confidence for deployment and the challenge with that then is especially as you become in a more niche and and use case specific kind of testing is that the synthetic data generation methods um nowadays are still relatively un uh not totally predictable, right? They don't give you necessarily the quality that you want for your for your for your generation. So, let me try to give a couple of examples over here.

Um and in this example, we are trying to evaluate the uh we're trying to evaluate maybe like the performance of a VRM solution in like a better fuel scenario and setting. So, we have a we have a pen tank on the right, right? Um and the question over here is like how do we know what's good enough for for testing in this particular use case and how can good enough be be defined uh for the generated data sets over here and more importantly I think how can we how are we able to quantify this testing evaluation such that we can then scale it up in a automated manner. So over here I have um three different augmentations examples of good augmentations right.

So maybe over here a good augmentation is something that follows the prom you be generated across three different weather scenarios rain snow and fog um and we we the main subject of interest is also well preserved if there are no sight artifacts. So this seems like good generations. And on the other hand, I'm sure if you're familiar with just generating images as well, you see that often times some of the images that generated have different types of artifacts. Uh for example, for the one on the right, you have two additional humans being being added to the image. And for the ones below as well, you see that um some of the the original tank and one of the one of the tanks has been converted to a vehicle instead. as well as the the range streaks might not be looking as realistic.

So how can we go from this vibe checking approach in terms of just eyeballing it and seeing that it looks right or it looks good or it doesn't look right into a more structured manner in terms of finding out these flaws. So for us it's about how we can then scale up the data quality checks so that we are able to automate the process of identifying such kinds of defects and flaws in a in a much more scalable manner. And I think we rely a lot on smaller deterministic models as much as possible to provide that insight. Right? So for example, if we are talking about two generated synthetic images, we we might want to compare them in terms of whether there's a meaningful change in the that map structure of the of the main object of interest.

Uh we can also then compare is there is there any new new subjects of interest that has been created of it from the original image to the generated image and all these use much smaller deterministic models that provides very good signals in terms of the data quality and as part of this pipeline we can then filter out the data sets that actually meets our uh quality criteria and use that for the testing evaluation process. We are also then able to actually scale this process up and maybe use this um enhance feedback to to actually fine-tune an evaluation model so that we can automate the screening evaluation process or subsequently the generative models as well.

So at the end of the day I think what we have ended up in is to assemble a whole pipeline of different metrics that caters to use case specific areas of interest and this provides us a very reusable toolbox in terms of how we can scale up the the generation of the data sets as well as automated quality uh checks and filtering. So we see this very similar to for example the problems in the in the coding space or in the mathematical reasoning space. You want to automate the validation and verification process as much as possible. uh this will help reduce the human uh oversight and and overhead required in terms of evaluating this and if there's any human feedback that comes in this should help improve our automated models so that this process can then become scalable.

The underlying metrics can then also be used for calibration of the of the data sets and um that we are generating because we find that for each use case uh specific scenario actually there's a very very big uh distribution of where the cut off for each metric might be. So the calibration part is a very important statistical concern over here. Okay. So just to round things up um we we talk about scaling the evaluation of of of use case specific scenarios and and data sets and I think the main challenge over here is really in terms of how can we scale up the synthetic data generation routines as well as add the necessary quality checks to give us the confidence for for deployment.

uh with this if you want to reach out feel free to contact me on LinkedIn to talk about evals happy to talk more about like test cases eval works that we do and I'll be around for the rest of the event as well thank you and have a good day see you >> thank you so much Tim uh that was a great talk and up next we have Abishek from Cloudflare who heads the ETI team in India there um and he's going to talk to us about how tool calls should actually be Hey everyone, good morning. Um, so I'm Abishek. Uh, I lead the emerging tech and incubation team at Cloudflare and head the India office. So we're a small team within Cloudflare which sort of works on new products, initiatives and a lot of cool things at any given point, right? Um, I'm going to talk about tool calling today.

I think everyone here at this point has had some sort of experience with tools. Uh, can I have a quick show of hands on whoever here has interacted with MCPS and knows what tool calls are? Awesome. So, everyone knows what we're talking about. Great. Standard tool calling, right? um you do this to give models capabilities beyond you know inference where like hey how do I have my model work with external sorry external APIs tools functionalities right um so let's take a very standard example uh I'm going to monitor an API look for errors and sort of do things based on you know certain conditions right uh the process very simple model sends you like hey I need to call this tool goes to the MCV server tool gets called you get the result and give it back to the model. Sounds pretty simple, right?

The problem is as soon as you start doing more complicated things, this becomes really costly. So let's take an actual example of a production scenario where you might have a model or an agent essentially which is doing a longunning task where it's continuously monitoring any new release that happens, right?

um wants to monitor for certain percentage of errors you know logs and then based on that try to do a roll back or make sure that hey we're good to you know release further right standard release process that has been followed for ages I think everyone here knows how that works with this setup what happens is you end up having a bunch of tool calls that happen sequentially one after the other right um so in this specific scenario I'm going to have my model, go list all my logs, you know, then fetch all my metrics, do conditional checks, uh, based on certain, you know, conditions, decide the next step. The problem that we run into here is that every tool call that you do is going to send the entire context of the current conversation plus the tool call plus the response, right?

So each turn becomes actually more context that you're sending. So one, that's bleeding money. Second, you're adding a lot of round trips, right? Right? So you're going to add a lot of latency. Essentially there should be a better way of doing this. And I think what we're going to talk about here is basically code mode. Um so code mode is our thesis around and I mean it's not just Cloudflare right now. I think this is becoming extremely popular everywhere as of now. But when we came up with code mode, the idea was models are inherently better at writing code, right? Um if you quickly take a look at the same example that we just discussed in a code snippet it looks something like this that hey I want to get all the errors metrics I can paralyze these tasks based on that I want to do some conditional checks and do the next steps.

Uh and the reason models are better at doing this is they have been trained on a ton of code, right? Versus tool calls are most of the tool calls that models have been trained on is all synthetic data and barely any data, right? So by natural instinct, you would feel that models are actually going to be better at writing code. And that's what we see, right? Today, if we look at the same tool call that we just described, right? Standard tool calls have a tool name, description, parameters, you know, expected output and that's basically what you feed the model. What we do is we have a library called code mode which essentially converts this into typescript types. Uh so one now the model has the same sort of setup but as code uh it notes that hey there is a function that I can execute to do this.

So in this same mapping right if you look at it we have the declaration of the function which is essentially the tool name. Um the description there is basically the tool description and then you have parameters that are passed through it right like your expected input and what is the output. Um now what this does is it essentially gives the same sort of capability to the model but in this case instead of giving you a order of tools sequentially the model writes a single code snippet and basically what we want this to do is work with everything that is already there in your current stack. Right? So you don't need to actually go swap out your entire tools. Instead of passing like an array of tools to the model, we basically pass it a single tool called code mode.

So you can wrap up the entire existing toolkit that you have and just pass the model a single tool called code mode. What code mode will be a typescript you know library or like let's say a file of typescript types as a string which goes to the model where it's like hey I know what tools exist and I can write code against it. Um in this case you'll also see something called exeutor. We'll come to that later. Again going back to the basics of why we write code, right? Like what we just discussed, a simple scenario that would have taken, you know, probably five, eight turns can be a single turn. And it also brings reasoning into the picture. Every time you write code, you could embed logic in it, right?

You have the capability to do variables, which means you can have, you know, interdependent tool calls based on like a previous response and then figure out what to do. You can do branching.

what I just described right like if the percentage of errors is above a certain level you could like you know do a case one otherwise case two you know same kind you can do loops uh a very standard example go through my cloudflare account list all the workers and then give me metrics for all of them the way it'll happen right now without code mode is list workers fetch worker one fetch metrics fetch worker two fetch metrics right and and that will keep going on with tool calls um that's going to add context as we discussed with code mode it'll be single for loop which can go over it again and again right and you can also do things like parallelize zinc tasks that essentially don't need to wait for each other uh so yeah I want to be very clear this does not replace MCP I think this is sort of new as concept so have to be very clear here explicitly MCP is the base protocol you still need that to essentially do the final last mile API call right your server will still do that what code mode does is gives model a better way to interact and do the tool calling.

uh the actual implementation of that tool call still happens on the MCP layer right I'm going to take a different example which is like practically what we face right uh so Cloudflare as most of you okay how many of you actually know Cloudflare awesome thanks uh had me worried there so Cloudflare has over 2500 APIs right uh which is a lot given the kind of products that we have you know across a bunch of different areas vertical if we just embed this as tools today right as like standard MCP tools it does over 1.

7 million tokens in context for most models we will overflow the context window with just the tool description so there's no way this works and this also comes to the same problem right if even if I convert this to TypeScript types today it will still run into the same problem right so the base idea around code mode is not that hey you blindly just replicate tools as types and do it right For most cases it will actually work and be better. But something like this you can think take a step back and think okay how can we do this better and one of the things we found here was just give it two tools search and execute right and in both these tools the model can still write code. Now search and execute as a strategy for doing MCPS has existed for a while.

People have created you know their searches like he we have a tool that gets other tools and then the tool that decides to execute it. Now you can write code here, right? So you can filter out. So think about it in a way where we tell the model, hey, we have a global variable which has the entire description which is not being passed to the model. But the model has the capability to write code that will give it back the exact tool to be called and then also write code to execute the same thing. By doing just this, right, like a simple search execute thing, we were able to actually bring it down to thousand tokens. The entire Cloudflare API spec today can be called via model with just thousand tokens. That's a 99. 9% reduction which is insanely high. I've never seen that level of compression across any sort of things.

So this is like a far more optimized way of doing things. Um yeah, exactly the example that you know we just spoke about.

You now have the model going like hey I'm going to do a tool call to search the thing write a code against it put like you know an exact script which gets executed all of this discussion we had we've have we've been discussing about like model writes code and then you know it gets executed but the key question that we come to now is like where does this get executed right um so take a step back let's go a couple years ago right like preAI if I had come to you and told you here's a random user generated code run it on your you know setup none of you would want to do it u that's like a exact you know massive CV that's RC so most people would not want to do it yet today I'm standing here and telling you to do the exact opposite that give models absolutely untrusted source and you know let them write code which could be anything which you never get access to and run it so where do we run it and that's what we come to the tiny computer part, right?

You essentially need a very efficient, secure sandbox environment, right? And there are a couple of ways of doing it. I mean, you could do containers. Containers have existed for ages, right? And the problem with containers typically is that you have a massive cold start time, right? Um, you have to provision a lot of it properly. You have, you know, memory, you have compute, all of this needs to be planned very well. Um and then you have you know basically it's an external layer which means you sort of have a lot of challenges of handing over things properly and securely. The other approach here is V8 isolates. Um quick show fans. How many of you know about Cloudflare workers? Awesome. So workers are our own runtime layer which is based on V8 isolates. So we took V8 isolates for it and created serverless around it.

Um there's a lot of good detailed blogs that you can read about it. But essentially what this does is it eliminates all the standard problems that we just discussed, right? Like you actually have zero cold start time. It's absolutely lightweight, right? And the way workers work is your dynamic workers, which is essentially what we're talking about when we say V8 isolates will spin up in the exact same location, exact same, you know, setup where your main application is running on a worker, right? And again, you could do each isolate as a one request and throwing it away. Right? So again, workers give us like a great boundary. Make sure that it's scoped just to execute that code. Does not have chances of leaking secrets, you know, getting malicious code into your actual main setup.

And you can decide while initiating a worker what's the kind of scope and capabilities that you want to pass it down. Right? Um again just like a quick way you know why isolates work better and essentially because we own the runtime it just makes it way easier to actually do sorts of you know information exchange making sure it's done in a secure manner and you again don't have crazy insane you know wait times to spin off things. Um yeah that's pretty much it. Thank you so much Thanks for that, Abishek. And up next, we have Tis, who's going to talk to us and do a deep dive about agent harnesses. Is this on? Hello everybody. Good morning. Wow, all of you are asleep. Can we try this again? Hello everybody. That's better. Nice.

Look, look, it's it's it's a it's a dialogue, not a monologue, you know, like I I'm here to talk to you, not at you. Um, good morning. He's just setting up my slides right here. Uh, but this is going to be a fun a fun conversation, I think. Is everything good? No. Oh, he's it's Give it up for your tech team, everybody. That's so cool. They make they make this event possible. I love it. It's uh we would be so lost without them. Excuse me one second. Oh my god, he's spoiling my slides. That's It's all good. Let's go here. There we go. That's me. Okay. Hello. I'm the yellow hand. See, it's way Hi, I'm Tis. Hello, everybody. It's good to It's good to meet you again. Um, as you may have seen, my name is Tis. Uh, that's pronounced like contagious. Don't worry, I'm not. Uh, they wouldn't have let me in the country otherwise.

Uh, I I I I flew 16 hours to be here from Romania where I was yesterday. Uh, and I'm based in Berlin. Uh, and and over the years, I've been I've had the privilege of working at a number of, uh, various tech companies with with really great teams and learning from the best. In fact, I'm not really here to show you uh opinions, but just facts of lessons I've learned, not from myself, but from uh very very smart people, people who are far smarter than me. Today, I'm an AI engineer at IBM uh where we build um a lot of things, foundation models and harnesses and things for our customers, but also for developers. Uh I help the developer community around IBM and otherwise. I I teach people about harnesses and AI and things um here. And today, that's what we're here to talk about. We're here to talk about AI harnesses from first principles.

Um, just as a quick show of hands, how many of you feel confident that you know and can explain AI harnesses, agent harnesses? Okay, there's like three people. Um, good. I'll do the same thing at the end of the talk and I expect uh more hands. Okay, that's the goal here. That's why I'm here. I'm here to teach you about what harness is and how they work and why you need them. Uh, because it's a term that's kind of everywhere. And the problem with terms like this when they're in the zeitgeist is they can get lost in translation. Okay? And sometimes we we don't feel confident enough to reason about them uh strongly. And so hopefully this changes. I'd love to start just by talking about why we even need harnesses. Uh because I think a great leadership principle in general is to to start with why. So why why do we need a harness?

And the answer really is is is the same answer to why we need a harness for anything that we use harnesses for. Uh think about climbing a mountain, right? Like you harness yourself the mountain so that you can go up and down the mountain reliably, you know, meaning you don't fall off and die. Okay. Um similarly like if you have a dog or a pet right you usually put your dog on a leash you give it a harness to you so it doesn't run away and get lost but but it stays with you reliably okay so the the whole point of harnesses for agents or humans or pets or whatever is reliability and the reason for that is because we when we do AI work we we often just trust black boxes have you ever thought of this like unless you're doing inference on premise which any of who's doing inference on print like locally. Yeah.

One him uh and maybe some people one or two people here. If you're the the vast majority of us, what you do is you send a prompt to some vendor with a black box and you say, "Hey, do this for me. " And then you hope for the best, right? Um you you send a prompt to say Claude 4. 7 Opus. Um but if they have some type of incident, they may serve you sonnet and you have no idea of knowing. So you just okay, I guess it's kind of not feeling. Opus doesn't feel the same today. Has anyone had this sensation? Right? That's because you trust some foreign body and and this is why we need harnesses. So what harnesses do is they give you more of a sense of control uh to make your AI apps and agents more reliable. Okay, is that clear? So that's why we do harness engineering. What is a harness? Uh I already talked about it. It's this thing.

Um but assume that's an agent, not not a human. And and that's what a harness is. In fact, agent harnesses in particular are a newer sort of evolution of the term harness. In machine learning engineering, we had eval harnesses. These are basically glorified unit tests for models. Okay. Um but agent harnesses are slightly different. If I ask you to define an agent harness, um this is what I expect to hear. The the answer of what an agent harness is is it is everything around your agent, the tool chain, everything around it, the environment in which your agent executes that gives it the best chance of success and reliability. Everything around the agent. So if we think about some typical agent harnesses in the wild, they all have at least these six components. Number one, they've got um a tool registry. They've got a set of tools.

If we think about a harness like cloud code or codecs, they have tools. read and write from the file system. Search the web, right? Number two, there's a language model. Uh almost every harness will have a language model somewhere like cloud code has the the cloud models. There's context management primitives for compacting context or clearing context. Right? If any of you use cloud code, you're like slash compact. Um there's guardrails. Uh for example, I think the most common guardrail is you've used up your quota. I'm not going to talk to you anymore until you top up, right? That's that's a guardrail. There is um there's an agent loop in the picture uh where this is where the agent finishes a task and then says okay I'm going to am I actually finished or should I do one more pass and finally there's a verify step.

So if you are using an agent harness like like let's say cloud code I love cloud code right at the end of it it will say okay I've done the task now let me run npm runverify or whatever it is to finish out this loop. So almost every agent harness certainly every coding harness coding agent harness has these components if not more. So these are kind of our building blocks at this point. I'm I'm tired of the sound of my own voice and so I'll just do a demo instead of talk to you. And so what we're going to do is we're going to actually build a harness here uh in whatever time we have left live on stage. Um it's a min it's a poor man's harness but it's just to kind of give you an idea of what a harness is so you can go build your own. Okay, that's my job here.

Um, what we're going to do is we're going to build a a browser use agent, something that spins up Chromium and uses it to do a job. Uh, as you can see, it will be unreliable at the beginning. That's kind of the point, but we'll build a harness around it to make it safe. I'll say this, harnesses allow you to do more with less. You could choose a really bad model, a really old GPD 3. 5 mini or 3. 5 Turbo, like old. That's like two years ago. It's crazy. I'm joking. It's a very old model. And it's cheap. It's basically free. So you can use an unreliable model and you can use a prompt that is kind of bad because a harness gives you the reliability. Often times when we don't get the results we want, we think, oh, just prompt it harder. Just fine-tune the system prompt, change the language, add a skill.

With a harness, you don't need any of this. You can keep the prompt frozen. It can be a bad prompt. You can use an old cheap model. If your harness is good, you win like 70% of the battle. Okay, so let's do that. I I'll build a harness. We'll build one together here on stage and then uh we'll wrap up. So this is what I want. I' I'm I'm running I'm just going to run my my agent right here. Uh I've written it in Typescript. Anyone use TypeScript, JavaScript, something? Okay, you'll kind of get it. Uh we'll we'll do npm run agent. And what you'll see is um it's going to open a browser. This is I'm not touching. And it goes to hacker news and it tries to upvote an article, but it gets the login screen and crashes. The job of this agent is to go upvote the first article on hacker news that is not yet upvoted. Okay, is that clear? Yeah.

So that's the job. But here's what it does. I'll run it again. Look. So we open a browser. Um goes to hackernews and we're using GPD3. 2. We're using Oh, goes to hacker news. Hits the login form. But then it answers me. I have upvoted the highest rank. This is a lie. This is an absolute lie. What actually happens is it goes to the tries to click upvote, hits the login form, and then crashes. Right? So this is a total lie. How can we fix it? We'll fix it with the harness. To start with, let's look at the actual code of what's happening. So this is uh cursor. I love cursor. And this is our project. So this is what we have so far. We have the model. Uh we're using a very sorry, I should change this. We're using a we're using a very old model. Uh cheap, basically free. And this is our prompt. Upvote a story on hackernews.

These are not going to change, but our harness will change. I want you to know that. I want you to be very clear on that. So here's what happens. We start a new browser session, and that's code that I wrote. This is using playright not playright MCP but we're just programmatically compos uh controlling the browser with a class. Okay. And then when we have the session we create tools and this does exactly what you think in code. We just return a bunch of tool definitions just like this. It's just a bunch of JSON objects with descriptions and so on. We also create our context. You think this is complex? It's really not. It's just a message envelope with a system prompt and the user's prompt. And the user's prompt is is the thing that we already wrote. It's this thing here. So it's just an array with two objects. Okay.

And then we finally run the agent loop. Now what is the agent loop? Well, it's while true, keep doing stuff, keep pushing messages until you reach the stop condition. So this is the LLM saying I've finished. And in that case, we return the answer to the user. But throughout our entire agent loop, we're just pushing different events. I called this tool. I sent this message. I got this prompt. We're just pushing these into a list. That's all we're doing. If we call tools, then we push each tools result into our messages collection. Does this make sense? We just keep track of every message. Okay, so that's it. And as we as our agent exists today, it doesn't work. It hits the login screen and crashes. So what we need to do is build a harness. We need to build guard rails first. Then we need to actually make it tell the truth.

Hey, I crashed at the login screen instead of I've successfully done this. And then we need to actually fix it. That's the journey we're going on. Okay. So step one, we add some guardrails because right now it can execute infinitely and bankrupt me. So how are we going to do that? Well, let's in investigate this git diff. So we right now just call run loop and we pass a model and messages, but we're going to change this to include some guardrails. We'll call them default guardrails. In fact, what are our default guardrails? Well, let's go to the editor and check it out. So we have this file guardrails. ts. And these are our guardrails. We have two max iterations. How many me how many times can you try and max messages? How many messages until we compact your context? And then we have a little helper to combine them.

Okay, but how do we actually use this? Well, if we go to our agent loop, you can see that we include the guardrail here and we check we call the guardrail and if it's not okay, we just end. We say this is why we stopped and we trim context here on every message. So while true at each iteration we call um trim context. What does trim context do? It's this is actually really bad. Don't do this in practice. But what we're doing is we keep the system prompt and the user prompt and the most recent two messages after that. There's more intelligent ways to do this. That's not the purpose of this talk. The purpose of this talk is to show you a guardrail as we build a harness. So now we have our agent our agent and we have a few guardrails. You know what that's called? It's called a harness.

So, what we're going to do is we're going to just rename things to keep them a bit more truthy. So, what I'll do is I'll go over here and I'll say, look, we just have index, but we're going to delete all our code and just abstract it under a function called run harness. And we're going to take all this all the stuff in red and we're going to move it into a new file called harness. ts. Okay. And what is harness. ts? Well, let's open it. Harness. ts is everything. You may recognize this code from the beginning. It's everything from our index. ts. ts. We just put it inside a function called harness. ts. Does this make sense? So, we just take it and we call it uh run harness and print harness result just console logs things. It's just for logging. This is not really that useful. So, we've just moved code at this point.

But now that we have run harness, our next step is to okay, now that we have a harness and we have a browser session that is not controlled by the agent, but by the harness, we can hook into this browser session when we need to to detect did you succeed or did you fail. Okay, that's what we're going to do now. So now that we have this harness file, we'll come over here and this is uh this is what we're going to change. So we are just changing our run harness function call a little bit to add a third argument which is some options a verify step and max attempts. Okay, verify successful upvote. If we go to our harness, this is getting a little bit interesting. Now these are just types but here we have max attempts. we say you run the harness no more than three times. And so for each attempt, we do a little bit of a verification step.

If it failed um or if it reached max attempts, we just return the latest result. But we have this function in our harness now called verify successful upvote. What does it do? Remember in our agent loop, we keep pushing events to a big list, right? So what our harness does is it checks the list. If you have a browser click and if you clicked on an element with up something something then it means you clicked on the up arrow. That's what our harness is validating. So if that's true then return true. I upvote click confirmed. But if you see a tool named harness auto login and the result is harness failed to handle login then we say no no you failed the login and we return a false result. Does this make sense so far? It's just code. Okay. Finally, we also have this variable called unreovered login redirect which we check all the tool calls.

Ah, okay, I went to the browser here and and this was the result. We check all the tool calls and if we see a tool where the name is not harness auto login but if we're on the login URL, what does that mean? That means we went to the login page but the auto login didn't work. Then we fail and we say return past false login screen instead of completing the upload. Finally, we need a success case also. Um, but that's coming. So, we just added a few like if this then say we failed, okay, to our harness. This is our harness. This is not our agent loop. So, now let's run that and see what happens. So, I will run this here. Um, and so now it's opening the browser. We're going to hacker News and uh we go to the login page. It crashes, but what's the output? We we get it to actually tell the truth.

We hit the login screen instead of completing the upvote and and it says fail. This is what should have happened. Now let's quick checkpoint. We did not change the prompt. We did not prompt it harder and we're still using an old model. Okay. But the harness is now giving us some truth. Let's fix this. We're about to finish. Let's fix this with actually now that we know that it's getting stuck at login. We can fix this at the harness level. Okay. So let's do that and then we'll wrap up. So what is what's the final form? We add a file. We call it login handler. And what does this function actually do? It's just a function. But here's what it does. This is the line that's important. Um if we're not on the login page, don't do anything. So this function is a no. Unless we're on the login page.

If we are on the login page, we fill a username and password into the input because the browser session is owned by the harness. It's not owned by the agent. Does this make sense? So it's not tool calls driving the browser. It's my harness that I wrote. Okay. So I inject this username and password and then I return a message. The tool name is harness auto login. And the result is the harness automatically logged in. And this is basically to the agent. You are now authenticated and back at home. So my harness injects this into the chain of messages. Does this make sense? So I'm logging in now at the harness layer. Okay. But this is just a function. Where do I use it? Um I use it actually in the harness.

So I create the login handler and in create tools I just add a few guard rails here but I'm taking the login handler and giving it to my agent loop run loop and in the agent loop this is where we we land the plane. I send the login handler and this is the code that makes it work. So inside the agent loop I say if I have a login handler then I just await its response because again if I'm not on the login page this is going to return nothing. If I am on the login page and if I receive a login event, then inside my agent loop, I push it to the list of messages. Does this make sense? And so if the harness successfully logs in, it adds a message. I've logged in and the agent reads this and then continues. Does that make sense? This is the whole point of a harness. So let's run this and then we'll wrap up.

So um we should now be running the latest version. And so what I'll do is npm run agent and it should work by the harness. So we log in to HackerNoose. Um it it typed the username and password and indeed you can see that it lo it it did that was way too fast. It successfully upvoted this upvote. Click confirmed by logging in with the harness very rapidly. Does this make sense? We did not prompt it harder and we used GPT3. 5 Turbo but we got more control with the harness. Uh let's uh wrap up here. What does this mean? This means you can do a lot more with a lot less with a harness. And again, the harness is the is the environment around your agent that increases its chance for success and reliability. What does this look like in practice? Um, I work at IBM and we work on harnesses daily.

Uh, at IBM, we create an enterprise ready open-source rag harness. Uh, because as you may know, enterprise data is big and it's everywhere. there's all these teams calls and like notes and you don't know what's confidential and what's not and it's very risky and and so we we create an open- source enterprise harness for um large companies. It's called open rag and again it's open source. That's the important part. Uh and if you're interested in it, you you're more than welcome to scan that. I'm not here to sell that. I just think it's it's a nice reference implementation uh for harness. Uh but let's land the plane and cast some vision. Okay, in summary, what did we what did we do? Look, I started this talk asking you how many of you feel confident that you'll be able to explain what a harness is and why it exists and so on.

Is that number changed at all after this talk? Yes. Oh, that's a lot. That's like almost the entire room. Okay, I've done my job. Um, that's what harnesses are. That's how you build them and that's how you do more with less. You don't change your prompt. You don't change your model. What might the future look like? Well, we just hardcoded a harness. We wrote that ourselves. But I would be foolish to think, oh wait, but wouldn't it be amazing if harnesses were dynamic and if agents could create their own harnesses and then do work? I think this is the dynamic harnesses are likely the next step towards AGI where this can all be managed by an agent. But with that, um, I want to land the plane here.

I I've already maybe taken a little more time than I deserve, but I want to stop here and just say thank you so much for your time and attention, Singapore. Thank you so much, Tis, and thank you all. I see the rooms filled up. Uh we're going to have our first break. Um the next talk starts at 10:17. Uh just a reminder that booths are open as well at the expo in case you'd like to walk around and uh stretch your legs. Thanks everyone. See you back here in a bit. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Um, up next we have JJ Gwax joining us from Google where he's director of applied AI based right here in Singapore and he's going to be talking to us about bringing models into production. Is this gonna show up here? >> No. Yeah, there you go. Okay, cool. Hi. Uh, I'm JJ. Uh, I'm a, uh, engineering director at at DeepMind. Um, and so I lead the applied AI team there. Um, I'm based here in Singapore. Um, I am hiring, so if people are curious about, um, working there, um, definitely reach out.

Um, so I'm going to talk a little bit today about moving from uh, hackathon kind of things to production, which is sort of what my team does. Um, and dealing with models at scale. Um, so before we get into that, I kind of wanted to share a little bit about what my team does. And I see at least one of them here. Hopefully the others are as well. Um, so what we try to do is we push the technical boundaries of the deep mind models. Um this means the ones that I think most of us are familiar with um Gemini and and Gemma which is our openw weight um text model. Uh but it also includes the nanobano and vio uh video and image models as well as the more sciency things. So that's the alpha genome and uh weather next. Weather next predicts weather and hurricanes and large scale um storms and things like that.

So our job is to try to make the models do what they weren't necessarily designed to do or blow past the limits that we might have set on them. So um a good example with VO is it generates 8 seconds worth of video, right? So you give it a prompt and you get 8 seconds of video out. Um what happens if you wanted to generate like a whole scene from a movie, like five minutes worth? Uh how do you do that? Our team tries to do those sorts of things. or with Nana Banana. Let's say that you have a movie and you want to outpaint the whole thing um to make it like widescreen, for example. Um that's kind of an an example of what we might do. Uh these things sound kind of easy because they're just more of the same, but it's actually a much more challenging problem and uh we have to come up with clever ways of getting around it.

Um so uh what we ultimately try to do here is make the models do real things. So, it's nice to have 8 seconds of video, but that's kind of a fun hackathon project. Um, it's not really a real thing. You can't sell that to a movie studio. Um, I can't be like, "Look, here's your 8 seconds of of movie. " You need to kind of do more than that. It's also making the model sort of adhere to what your guidelines might be. Um, describing a movie in text is actually really challenging to get it right and then you end up with this giant prompt and it's very fragile and it breaks. figuring out how to anchor it off of key frames and understand animation and you know behave the way an animator or a director wants it to is actually a really surprisingly challenging problem. Um so we try to do all of that.

Um now I I want to pause for a second because I was just saying how like oh the models aren't good enough. They only generate 8 seconds of video. I I kind of want to pause and just I need to say this AI stuff is amazing. Like it is completely crazy. I I I don't know if you guys remember, but like a few years ago, like chat GBT didn't exist and our lives were totally different. Um, and there seems to be this world of like the models are incredible and they're still at the same time like not enough. They don't do real things, you know, my whole job. Um, but like there's always been this moving goalpost thing like with chess, right? I don't know if you guys remember when like the whole Deep Blue thing happened.

I was a kid so I wasn't really paying attention, but we like computers beat somebody at chess and then everyone was like, "Oh, that's amazing. " also, oh, it's just chess. Um, and then go was was 10 years ago. Uh, Demis just went to Korea to celebrate 10 years of like solving Go. And everyone was like, oh, that'll never happen. I remember I was working at Google at the time and everyone was like, is this going to work? Like, is it going to win? I I don't know. And then it then it did most of the way. And now everyone's like, oh, it's just go like gh. Um, and then chat GBT came around and it was incredible. I remember showing my wife that she could just ask for, you know, things and it would answer her and like turn it into a table and all kinds of crazy stuff. Like incredible. And now we're like, ah, chat GBT old news.

It's just a chatbot. And and now we're at this sort of weird phase now where like we have agents and they do stuff like they call and make restaurant reservations using like 11 Labs and Open Claw and they're accidentally deleting all our emails and you know, crazy things like this. And it's like we're still mad that the agent doesn't follow our instructions, right? like just how spoiled we've gotten. Um, does anybody remember when we got Wi-Fi on airplanes? Like, and that was incredible. And now it's like, uh, it doesn't have Wi-Fi. Like, uh, and now now there's robots and robots are like doing factory jobs and we're like, gh, but it won't even do my laundry. And it's just And I actually saw a video of a robot uh making a bed and taking out the trash. And so maybe soon this bullet point will go away.

So, I need to say like my job is to make models do real things, but like let's let's be honest with ourselves that models are incredible. Like shockingly incredible. So, I would argue that this this idea of moving goalposts has been around for a long time. And it's not necessarily a bad thing, but it is a little misleading because, you know, it keeps pushing us forward, but at the same time, we kind of forget where we've gotten to um and how amazing all of this is. Um, and so this brings me to an important point, which is everything's been going incredibly fast. Just so fast, right? Three years ago, no chat GBT. Now we have three different very popular agent frameworks and crazy video generators and it's it's incredible. We can't tell what's real on the internet anymore.

Um, but for people like me, we in and businesses, you need to take like a snapshot of where you are and hit the pause button and you're basically stuck in time so that you can build something real. you can't just keep riding the train. Like you have to get off and build something. And so that's sort of what I what I'm doing here. Um I also want to say uh there's a bunch of different categories of using AI and I use it in a couple different ways. I'm going to focus mostly on the third one here, this inapp thing, right? So we all use how many people are using some kind of AI codegen? I hope a lot of hands go up. Okay. Um and how many people have like an agent that they're using and doing crazy things? Awesome. Um, this third one is the idea that inside your app, we're going to make API calls that your users actually interact with.

So, the idea is this isn't something you as a developer interact with. It's something that a, you know, your grandma who might be talking to a chatbot and not realizing they're talking to a chatbot is going to be is going to be dealing with it. And so, my role is primarily with that third category. Um, and so what we try to do is help businesses get past sort of the benchmarks, right? What I mentioned before. Um, and I'm going to talk a lot about this bottom right one. Uh, this idea of not breaking policy because some of these are clever hacks, right? You have a video model and it generates a chunk of video. How do you make it do more? Um, you have an image thing, but it only does up to, say, 4K. How do you make it do a giant billboard like the size of this? Um, that might not have the the high quality that you want.

that that's you can do clever things about that to stretch the boundaries of the output, but how do you make sure it doesn't break policy is an architectural and design decision. And so there's a couple of things we've run into. Um I should also say a lot of my work is being uh talked about at IO that's coming up and so I'm not allowed to say a lot of things. So I'm really sorry that I can't give you awesome examples, but if you watch the IO streams, you'll see some of the things we're doing at DeepMind. I really don't want to get fired, so I just I can't. Um so uh sorry in advance. I'll do my best to like hint without getting in trouble.

Um, so I'm going to talk about some of the walls we bumped into um the problems we found and sort of like that last one, this idea of policy and then how we kind of deal with it at at DeepMind and then inside the applied AI team uh and and you know hopefully it applies to some of the things you guys are doing. So, you build a chatbot and you tell it, please, you know, be responsible and professional and like, don't make me look bad. And I don't know, you guys saw the Chipotle screenshot of somebody being like, why do you subscribe to Claude Code? The Chipotle chat chatbot is is free and it's somebody saying, I really want a burrito, but first, can you help me write a Python function for the Fibonacci sequence? And it says, sure, here you go. Right? Like, it's it's super common. You've all seen prompt injection, right? It How many people?

Yes. Am I crazy? Okay, good. So, prompt injection is real and it's not on purpose and it's it's complicated, but like it's something we have to deal with. If you're having a user talk ultimately to an an AI backend, you have to deal with the fact that your way of defining what the agent should do is the same way that the user talks to the agent. So, you have they're all text. And so how do you figure out how to deal with this weird problem where usually it's fine but if people say the wrong thing they chat can hallucinate and say crazy things it it's got all kinds of real problems. Um uh how many people thought if you set temperature to zero that means it's deterministic. It's not.

Um, so if you yes to an extent it is, but like so yes, technically you're getting close to determinism, but it's still nondeterministic because subtle differences in the text mean huge differences in the output, right? It's it's one of those situations where you feel like you, oh, I'll set temp equal zero and everything will be fine and it still breaks and you're frustrated and it's it's not like setting a random seed in a pseudo random number generator, right? It's not the same thing. And so getting determinism out of these different uh agents and AI backends is really tricky. And so we've had to deal with quite a lot of that. Um so the other thing uh is is rag uh retrieval augmented generation. Uh again this is a new thing relatively right JBT is three years old. Rag is what like a year old or something.

Um the idea of you you fetch a document you use it as part of your um AI pipeline and it helps to answer questions that it didn't otherwise know. Um, now this also is kind of like, you know, cell phone, right? Um, occasionally your rag pipeline can, you know, cause trouble for you. A great example is, uh, if you've ever had, uh, a refund in your chat history and you used um, rag to pull out your chat history, even if it was an exception because it was like your mom called and that's why there's a chat log of that and so you only gave it to your mom, but it wasn't the same thing. Well, now it sees as a refund and so it gives out refunds. Um, or if you have a test example somewhere that sells a car for $1, now maybe you're selling cars for a dollar.

Um, these are really dangerous things and they it seems crazy when I say it now, like of course you shouldn't sell a car for a dollar, but like it's absolutely possible because to the agent the rationality is not necessarily there, right? We're kind of expecting it to be, but it's not. Um, our agents in a lot of ways are like really really silly interns that, you know, just got hired and they're like trying to do a good job, but they don't really know what they're supposed to be doing. Um, so those three things are some of the big ones we've seen. There's more. Um, I'm not going to purport to be able to tell you everything about building with AI. I'm just going to kind of focus on these three. Um, but the bottom line with these three that's worth mentioning is the model is being asked to do just a little bit too much.

Um, models are amazing. I just showed like we just talked about how incredible AI is, but when you try to ask it to do crazy things like slashgo give a talk on AI like it's not necessarily going to do a great job at that like you you have to guide it more because um part of it is the model is not as amazing as we'd hope um because our expectations keep going up. Um but also it's because alignment is hard. taking what's in my brain and what I want and turning it into words or code or images or video. It's not a straightforward problem. It's it's actually really challenging to figure out how to get what we want out of AI because sometimes we don't know that it's not what we want until we see that it gave me something that I didn't want. And and this keeps happening all the time. And when you're dealing with customers, it happens at scale.

So this also is an interesting point here. Like the the big underlying problem is with a hackathon, everything works. it's just fine, right? But when you get to production, it doesn't. Things, you know, the edge cases are all over the place. So, what we try to do is stop using the language model as one big single router. The whole idea is when you try and throw everything into a system prompt, um, it doesn't work, but that doesn't mean it can't solve each individual problem if you break it down. We just saw a couple of talks earlier today where, you know, they enter plan mode, they make a to-do list, they guide the to-do list by telling it, "Hey, look, if you try to call finish without having completed the to-do list, it throws an error, an actual error. " These are the types of things we see.

And so I'm not sure if what I'm saying is entirely new to this group. Um but I want to echo it because it is important. So what we try to do is is surround things with determinism. Um figure out how to make things actually work by breaking down a big non-determinist pieces. So um what you can do is think of each route as individual pieces, but this transform block sort of in the middle. Do I have a pointer? I wonder if this works. Yeah, kind of you can see it. This sort of layer of the transform block is where you start using AI. Everything else is AI but in a much smaller layer, right? You're taking random input and turning it into JSON, a structure that you know and understand. Pantic AI is amazing for this. There's agent frameworks out there that are quite good as well. ADK, Agno, there's a lot that are all fantastic.

Routing can be an LLM as well, right? deciding what kind of action you're supposed to take. That is a decision that can be be made by a language model call. But again, that's just a route. It's deciding given this input, does the customer want a refund? Are they trying to say I did a great job or are they trying to cancel their their service? Like whatever it might be. The routing can be decided there and then you coers it into something that makes sense. Then transforming you stick to JSON to JSON, right? If you decide that you're trying to do a task, you might say, "Okay, I want to take something that is structured and I understand it and transform it into something else that's structured and I understand it. " And then lastly, you can generate output text that again is what language models are great at.

Um, and it spits out something that's human, not just JSON back to your grandma, right? It's it's something you can see. And then lastly, we can do safety checks. Um, I think uh I know Cloudflare does this and a bunch of others as well. you can use smaller uh more targeted models to just check whether something is safe or not to send back. Um so language model picks a route and decides instead of doing the let me plan you give it a multiple choice question right that's that's the whole idea that language model is effectively acting like a classifier at that point it's deciding what is the user trying to do based on the conversation so far and shoving it into this is what I need to figure out in order to do that. So instead of letting plan mode and reasoning do it which are amazing but at production I don't think they're really ready.

Um you use it uh you can course this into a multiple choice uh question. Um so like I mentioned before right this is take data turn it into something we can work with deterministically transform it again from one deterministic input to another deterministic output and then generate the actual response whether that's audio video image or text um using that structured deterministic uh um transformed output. Um, and then lastly, this idea of of safety, I just want to harp on a little bit because no customer is going to be happy if your response says something offensive. Um, but running a language model through it still has the same prompt injection problems. So, you have a couple options. You can use a contextfree language model call. Here's what I'm about to send to the user. Is this okay? I am a, you know, car insurance company.

You know, insert whatever here. That it's pretty good at that. And there's no prompt injection option for that. And then lastly is a ML classifier. You can use a smaller, more targeted model to decide what to do. Um what's interesting is this same pattern actually applies to um images and video. So one of the things I'm not going to talk about today is project we were working on that that deals with uh live image feed from your camera and figures out how to classify it and understand it and provide feedback and things like that. Um it's not really text, right? It's video input and then audio output, for example, like an agent. Um we're using two different models to do that, right? There's some that are on the the actual phone that are sort of dumb models, but they're really fast. They can handle 50 frames a second.

They can respond within, you know, 50 milliseconds. They can tell you, look, given this image, here's sort of the depth perception and, you know, oh, you know, this is a stool in front of you or there's an obstacle in front of you. Compared to Gemini, which is great, and it can tell you exactly what's going on from an image, but it takes a while. You have network latency, right? it actually takes time to get time to first token is certainly longer than 50 milliseconds. Um, and so there's a difference between these two and so you have to use them in conjunction with one another. It's not as simple as just sort of I'll throw everything at the model because the models just aren't there yet no matter how amazing they are. They're just not there yet.

And so we have to do is piece things together using different tools for what's good for different jobs. And in this case we need super high latency, right? And there's we can decompose the problem ourselves instead of having the AI just magically do it for us. So we split into sort of key frames uh and recognition using a smart big gigantic but potentially a little bit slower model. Um and then using something that's not as smart but it does have low latency and it does handle tons of frames per second. We don't have to choose a key frame. We just send the whole stream in. Right? Problem solved. Um and so by doing this you can get the best of both worlds. your semantic understanding as well as your real-time sort of un safety and obstacle detection for example. Um so just wanted to finish this out, right?

Um LLMs are great for a lot of things. They're like incredible like truly truly incredible. Um but we have to use things for what they're good at. So I want to use language models for all the hard stuff, right? I want to use determinism for the stuff that really matters that I can't compromise on. that non-deterministic output would be a disaster. Um, you know, I like to joke we can't just tell our customers, don't worry, I added don't break any laws to the prompt. Like, that's not an acceptable answer. Like, that just doesn't work. Um, it's great and I wish it would. Um, but if it did, my whole team, we wouldn't exist and we'd all be fired and that'd be the end of that. So, I'm kind of glad a little bit that it does.

Um, but it's also useful to if you take this strategy and tell Claude or or Gemini Coder or uh you know um GBT codeex like just say go build this using these ideas it'll do it right. So we can still use AI for crazy things at the development stage but in real life I think we need to use the models for a little more of what they're actually good at in different places. Um now I didn't talk about a whole lot of things. Um there's a lot more um that we think about and we work with. So um I didn't mention fine-tuning at all, right? Um how many people have done fine-tuning before? I always want to pull the audience to this. Okay, not a lot. You should try it. It's great. Um but we don't do it all the time. We do it when it makes sense.

Um and that's one of the examples of a smaller, more targeted model of doing like safety classification or stylistic approaches of how you want to structure your output. Um fine-tuning is amazing, right? It's just you have to use it in the right places. You wouldn't just try to fine-tune some gigantic model for everything if you have bad data and you don't know what you're targeting for. Um the other thing is eval um anybody used to do TDD like where you Yeah, I sometimes tell my model to do TDD, but eval are effectively if you do them first, you're kind of doing like AI evals for TDD. Um it works, right? But it's sometimes hard to do. Um you need golden data sets, you need things like that.

So, it's I I just want to leave you with there's a lot more to do, but those three things are the ones that we bump into all the time, and there are ways to get around it by using models in the ways that they're meant to for the things that they're good at. Um, so I I mentioned before like AI models are incredible, but you have to get off the train at some point. You can't just keep riding it forever if you want to build stuff. So, I think that the key takeaway here is you can't wait for the perfect model. I don't think it'll be here anytime soon. We have quite a long way to go. um they're good enough now. You can build some amazing stuff and just try to determine uh make things deterministic as much as possible. So yeah, that's all. Thanks. All right, thank you so much JJ.

All right, next up uh we have someone to especially to welcome to stage Jeff Huntley. This is actually his second time uh speaking in Singapore. Uh he came last year as well. We were completely blown away by what he was sharing and decided to have him come back. Um, for those who were there at the party that was here last night, uh, he actually came on for a couple of sets and DJed as well. Uh, so who is Jeff Huntley? He is an independent AI researcher known for doing unhinged things with AI. So he is actually the person behind the Ralph loop which is now incorporated in many, many tools that are used today. And so he's going to be giving a talk about how everything is a factory. Hello everyone. Um, I'm here today as I must say, as confident as I might say and seem about these topics, this is quite a provocative title. Um, I don't know.

So, when you're listening to this, I want you to reflect upon this. Maybe I'm right, maybe I'm wrong. So, it's a provocative title because it's everything is I'm saying that software development now costs less than minimum wage. Like there was a time if you wanted to do photography, you had to buy specialized tools, etc. to do photography. But now, everyone's kind of got an iPhone and everyone's now a photographer. Think about that. Things have changed. With that disclaimer instead, I do not work for anyone. I am completely independent. I do not represent anyone. So this is going to get spicy. Let's do it animal style. Okay. So it's been roughly about a year and a half now um since I published the technique of uh allocating memory in a particular way. And if you wrap the tool calls around another loop, it's just a loop.

But there's more there's a lot of science into the context engineering to actually achieve these outcomes and it's quite disruptive and um here I was over at giving this talk uh talking about how everything has changed and uh this is a week before Alassian did their layoffs. Oops. And uh see the unit economics of business have forever changed. I want you to really understand how much this is. If you do not believe this is true, you need to stop speaking with other developers. You need to speak with founders. You need to speak with business leaders. You need to actually get a little bit more curious on here how and what this means. You see what does it mean when everyone is a software developer? Like here for no particular reason at all, there's like at the same meetup was cursor.

I'm not maxing cursor in any way, but I want to call something out here at this meetup. Here's Roslin. And there was other people like Roslin. They're designers. They're product managers. And they're having the time of their damn lives. There wasn't any software engineers up there giving really talks. You see, because they're being enabled to be a software developer now. For the first time ever, it's like an iPhone in their hands. They can just get stuff done. They can take photos. They can develop software. whatever is in their wildest dreams they can do. So, I've been traveling for the last uh 3 months around the world. I think I've given this talk 17 times now in different cities. And uh one of the cities I dropped into was Oakland. And in Oakland, I decided to do a side quest to Lord of the Rings, Hobbiton.

And my tour guide operator was like, "Jeff, what do you do? " And I'm like, "I do AI. Please don't judge me. " and and next thing you know his eyes light up and he goes Jeff like how good is AI? How good is AI? What does it mean when your tool gut operator is token maxing? You see everyone is now a software developer because AI has enabled everyone to be a software developer and society has been designed around a scarcity of knowledge. Used to charge a lot of money because knowledge was scarce. This is how we structured our societies. This has changed folks because we're now going to a knowledge abundance economy. What does it mean if you want to be a principal software engineer?

You probably know things about uh deterministic system testing and property based testing and test generators and all these advanced things and formal methods and proofs. What does it mean when that is just like wrapped up into a skill file? Um and it's not just about software engineering, it's about accounting, it's about it's about law, it's about all white collar where essentially it was based around the idea of a scarcity of knowledge. This is a transformative effect effect to society. So, if you rewind time to about two years ago, um this is me like November uh 2024. I first said, "Oh, fuck. " I published a blog post to say everything's got to change. I'll dig into this a little bit more further. And I was saying the ID was dead. And people calling me crazy for saying the ID was dead.

But yeah, I mean, not many people here, at least in this room in Singapore, are using the ID dayto day. They do some form of headless agents or async. You're probably cooking on something on your phone right now. So the models back then were already good enough to cause societal disruption, but it required a lot of skill to get the outcomes from them. A lot of skill. They're like wild horses or wild stallions. You had to like tame them before they got good. And you probably recognize this moment in time. And this was the second This is when the models actually got good and required no skill to really tame as a harness engineer to get good outcomes to it. There's something interesting about here.

No matter how good AI gets, it is in lock step to the the about the downtime that society has to be able to understand that things have got better. So it doesn't matter if the models keep getting better and better and better. The reason there was like a oh crap moment in in December, it was like people had time off. They had Slack. They had play. They had the ability to play with this stuff and understand it actually had got better. So you're going to see product releases of like the system shock in society is my hypo. It's is going to be in lock step with downtime in society. School holiday periods, Christmas breaks, all the rest holidays. You see, because the people around me who have been getting really good in AI in the last two and a half years, they've been treating AI not as a calculator.

They've been treating it as a musical instrument. See, musos don't just like use a guitar and go, "Oh, it's crap. " And they throw it away and think it's good. They recognize it's a skill issue. They recognize it skills, bro. So, it's really important to actually just do things and be curious and learn and deliberate intentional practice. This has been the key for me is it just it's like no way this can work. No, it's not real. It's not real. Let's do some things. Let's do some unhinged things. Let's make some discoveries. And it's through that deliberate intentional practice you get good. And it's kind of weird right now because society is like all corporates are pushing these guitars down on the world and it's like please play the guitar but not everyone's going to be musically inclined.

You see, I think there's now uh essentially two classes of companies now. Like you've got your brand new startups that are coming out right now who like the hell yeah, I'm going to do AI native workflows and I'm going to have the time of my life and I'm not going to hire a lot of people and they're leaning into workflows and really changing things around. They're not they're not thinking that they can get on AI by by selecting a particular model. They're experimenting and they're trying and they design their code bases and their processes around being able to exploit the heck out of this new substrate. Meanwhile, you got every single company out there today um which is uh I've given this talk and there's people saying, "Oh, AI is just a tool. Uh AI is banned at my company. " I'm like, "Oh god, you should quit that company.

" Um and uh everyone in the bottom half there is going to go through what's called a J curve. All people transformation has to go through a J curve like people transition etc. This will take three or four years to do. You can't do it too fast because you'll break people. Meanwhile, people up the top there are going to be if you believe in the notion of disruptive innovation clay and in Christen they're going to be lean apex predators just going hell yeah your margin is my opportunity and as the models get good then they they can actually execute faster with less so you've probably seen this block lays off half it staff etc. I want you to think about this for a little bit. I think Jack is actually right with this statement, but I don't think AI is actually priced into software stocks right now. Right?

Previously, when we're pricing software stocks, it was based on a multiple on a growth multiple. We're seeing that disappear now. But I actually do think a lot of companies are going to need to rethink about their organizational structure. I want you to think about Spotify. Who here has done agile and has been forced to watch the Spotify agile video on how Spotify does agile and they got the guilds and the tribes and the squads and all that stuff. Took two videos and everyone just started cargo culting this crap everywhere. It's going to take one Mad Lad or a couple different Mad Lads. So, we got Toby and Jack having some fun right now and they're experimenting to find out what the right thing is and they will publish a case study. And when that case study is done, it's going to be copied by everyone.

So for the last couple of months, been traveling around and I've been uh posing the following question. I've been speaking with venture capitalists and uh the question that's on top of everyone's mind is why does someone need to raise seed capital now? Like typically you'd raise money because you want to hire people to build it. N bro, just build it. Like it's fundamentally different. Like why do you need to raise capital if it's going to be this fiveman show? Like if someone cracks the AI operating system that we've been talking about the last couple days and people experimenting and this is going to be the year we figure out whether that's true or not. Like what's the point of investment? Come see me. I got some nuances to this but the experience of time I can't get into the particulars here.

Software is still investable but it's very different now. And this is the question on every LP's mind and they're putting pressures on the GPS at VC firms. Is it still investable? So no particular reason at all. I'm going to pick uh one enterprise company SAP. They have uh 6,800 people according to LinkedIn doing expense management software. That's a lot of people. This is representative of like a J curve people transformation program of like getting to use AI etc. How much time do they have compared to the lean apex like 50 person leveraging AI and they got 6,800 people and they're like please pick up the guitar, please pick up the tar, please get good at this stuff. They were built with this organization chart.

Every company was built with this organization chart and uh we we we basically just hired people and we had meetings and committees and all these things and the builders were very far and few between. I want you to think very carefully. How long does it take to transform those 6,800 people and how much time do the incumbents have if this is cracked? The idea of an AI operating system and and enable these lean apex predators to get into business. More importantly, why would you transform or more? This is the quiet thing that's been set been discussed. If you don't believe me, go speak with leadership. We all know smaller teams get better outcomes. Smaller teams, better outcomes, less coordination, less overhead. Here's a uh a quote from a founder in New Zealand. They've stopped backfilling. Companies around the world right now.

They're not necessarily doing layoffs. They just stopped backfilling. We're smaller, but we effectively cut two/3 by telling our board that we would not backfill. Notice the date. That's three years ago, folks. Like there are people who have been early. If you're thinking about these types of topics and leadership, um I'm not advocating that you should do these things, but like there are people ahead. It was the best decision because we got rid of all the people who are detracting and it was sick of hearing about AI. The sick of hearing about AI. We're 20 people now, down from 60 and uh we're getting more velocity than ever before. And this is going to be really hard because AI is pushed down onto the world by a lot of pe by Silicon Valley. It's non-conensually onto society. And uh I want you to think about this.

There are a lot of people here who have uh built their identity as uh like a leader of people or a manager of people and all the rest. AI is erases all this stuff. Like if this problem statement gets cracked, then this is what we're literally looking at. We're looking at people with high agency and curiosity just building things. We don't know yet. I'm not advocating we do 52 pick up and throw a deck of cards in the air and do this, but this is what's on people's minds right now. This is where we are. And this concerns me deeply because software engineers trade time and skill for money. Right? If a company's having issues with AI, that's a company issue, not your own. If you work for a company that's banned AI, you need to get out of that company. Honestly, straight now. Put your family unit first.

You see, because uh this was me back in 2024. That was I was great. The tech lead of AI over at Camber and was like, "AI is not good enough. Prove it to me. it's not hype and I start playing with it. I'm like, everything's changed. So, I saw no point other than just to completely lean into it. And then you then you now have in 2026, two years on, you got two personas. Those who are consuming AI, whichever way, and you got people who actually understand how AI works under the hood. I want you to look very carefully. There's now a line there. I don't hire anyone left the line anymore. If you're figuring out who you should interview and how you're going to do your interviews, it's really simple, folks. You don't hire on the left of the line anymore. It's a curiosity test. And way too many engineers are failing. And it's so sad.

You see, if I was to ask you what a primary key is or to traverse the graph, you're like, "Come on, dude. Like, you're testing me. " But why is it in 2026 people can't actually explain what this is? I pull out a whiteboard, they couldn't explain what a tool call is. They couldn't actually show me a sequence diagram of inferencing. They can't get really deep. They can't talk about the differences in the model cards between the different vendors. What is the temperature? Why can't they answer this stuff? So, if you're trying to figure out who to hire, it's quite literally people who have been curious. You should be testing for this. Sweet. Because it's really sad because LLM's and AI is just literally a wild loop and Ralph is a wild loop on a wild loop. Wow. Scary. the big boogeyman that's going to cause everything to go over.

So, it's going to be really interesting to see how this all plays out, folks. See, for a lot of people, they haven't realized that AI uh they're expecting to knock on their doorstep and to be pronounced, but really what's happening is kind of borrowing under society, under the houses. Now, closing ponderos really quickly because I'm over time. removing waste from your organization and processes better than AI itself accelerator than AI itself. You're trying to figure out how you hire engineering manager. The question is simple. What have you changed in your systems and processes to because AI has broken it, right? Are you doing agile anymore, not agile anymore? Well, how have you changed things? This is what you look for. You look for an engineering manager who has been thinking in this problem space.

An engineer who can build an agent, an engineering manager who's changed things around in the organization structure to achieve these things. Ideas are now uh execution. I mean like you literally can just take a screenshot of a SAS feature, rip a fart into your coding agent and you get that SAS feature. Like the old idea that ideas uh nothing execution is everything has been averted. It's going to be really hard for people. This is actually a psychological distress function. People going through the five stages of grieving. Um but the question on everyone's mind is how long do we uh give people to get through this motions of crisis and what can we do? If you're a software engineer and you haven't built your own agent on my GitHub, there's a free workshop. It's 300 lines of code.

Build your own cursor, co-pilot, codecs, and like learn the fundamentals. Be a curious person who doesn't switch engines in a car. Be be the curious person who rebuilds an engine and knows what a piston is, what a carburetor is. Get get into the details. You're not a senior engineer unless you know these details. Thank you. All right, thank you so much, Jeff. All right, a quick announcement um before I introduce the next speaker. Um the expo in Pullman as well as Kimpinsky uh has been open since 10:00 a. m. Um there you can find uh different things to look at like a robot playground as well as a robot display from Nabius in both places. And you'll also be able to talk with some folks that um you heard from this morning like Arise, Google DeepMind, as well as Cloudflare.

All right, to kick off this next section, um I'm sure many of you have built things like personal agents, used heard of Open Claw. So I'm really excited that this is the first speaker who's going to be opening up this section. Uh this is Vincent and he is a chief architect at the OpenClaw Foundation and he is going to be talking about the state of OpenClaw. Amazing. Thank you everyone. Welcome Singapore. Lovely to be here. Uh I've presented many times in Singapore. Fun fact, I actually gave classes at N US for a few months as well. So good stuff. So as Sher said, I'm Vincent. uh currently chief architect at openclaw foundation uh information as of today. So the foundation is definitely alive. I'm going to talk about the postclaw era. I'm also going to talk a little bit about what we shipped and what's coming next. A little bit about me.

Uh I call myself Vincent uh the friendly clanker. So if you've ever seen me present or give a talk, I use this image to describe technology in like one image. Uh this was VR goggles. I received like many moons ago before even anyone knew what VR was. It came with a warning saying only use for 5 minutes. I used it for 4 hours and then I vomited for 4 hours. Technology is fun on the on the edge. Uh it's a little jagged but you know you learn and things change. So a little bit like open claw. Um what's been happening? So we've had over a million npm downloads a week. We've surpassed 50,000 commits on main, 800 commits a day at its peak. Uh 1,600 contributors, amazing uh support from the community. Uh over almost close to 80,000 forks of the project. Um we've also had 40 claw cons.

These are like specific like claw festival like events across six continents. Um but the thing I want to talk about is like what we've been building and how we've been building. So in AI London I spoke a little bit about the dark factory. I think my talk is now on YouTube as well. So go check it out. But I want to talk about the dark side. So these are some of the features that we've shipped recently but I want to highlight a few of them. So dreaming was something where we decided to really think about you know what happens when agents dream. Um but a lot of these features sometimes you might feel like like you know aimed at memory or something really cool. But this one was actually aimed at users and it's for users to really understand what is happening with their agents in a really easy to understand way.

We also shipped first party support for codeex harness which I'm going to talk to in a little bit. But one of the things we're seeing in the industry is a shift towards models built specifically around their own harness and how do we deploy the combination of the model and the harness together. So with the OpenAI specifically models, we've now switched that as a default option, which means that when you use OpenAI, it uses the codeex harness under the covers. And because of that, you get the best performance and some of the native tooling and capabilities that come with that model itself. And something I'm not all proud of, which was a little pet pet project uh named after Finding Nemo, uh having lived in Australia, uh was a clownfish. And Clownfish was essentially running harnesses inside of GitHub actions at scale.

And with Clownfish, uh, which also another project called Claw Sweeper, we were able to go from 10,000 PRs down to like 3,000 PRs in the space of two days. So, I talked about the dark side. Uh, so this is my commits. I think close to 3,000 commits back in March in one day. Commit maxing. It's great. You should try it. Uh, but those features I spoke about, that wall of features I showed you was just what we shipped in the last four weeks with a group of volunteers and people in their spare time. So, what's next? We've been shifting towards like a plug-in architecture. The reason why we had a huge volume of PRs and issues beyond stability and bugs and fixes like that is that everyone wanted to make open core theirs. Everyone wanted to contribute.

Everyone wanted to make it that little bit nicer for themselves but that became quite challenging on a project to scale. So you could take something like openclaw the the the core itself you know you might have the gateway you have the file system but we needed some some concept around um adaptability for people and extendability. So we started building like a a plug-in sort of architecture. Essentially code from the core started being refactored and was kind of broken out into these essentially these these buckets of of plugins. Uh we created a hard boundary which broke a lot of things for a lot of people which we've had to learn.

Um, but that meant that previously the the very hard vied open claw that started off uh in a bedroom uh where all of the code was public uh the internals became private and it meant that plug-in architecture allowed for uh a clean interface. So we could continue to work in the internals of openclaw without breaking the outside experience for developers and other people in the ecosystem. And I mentioned this also included things like for example taking the OpenAI provider converting that into an extension but also converting the harness into an extension as well or a plug-in and combining those two together. So you can now actually build harnesses into open claw and run harnesses in combination with the models themselves.

The other thing that was missing that we really quickly realized at this scale was the tooling and the tooling we were using same as how open claw was born when we realized hey you know why is no one building a personal AI agent that can do stuff for me we also realized hey why is no one building dev tooling that can work for me at this scale when I'm getting rate limited by everything so we took something like openclaw and we decided to build around it so one of fun projects uh I've been also working on is uh git crawl disc crawl there's all these crawl based apps essentially that are terminal based CLI written in go and this is now a library and with that library we're able to quickly ingest the entirety of all the issues and PRs that are related to openclaw cluster them and have them in a distributed uh SQLite file system that's also stored on GitHub which means any maintainer on on a project is able to get refresh data that's hourly correct on their local file system and they don't have to connect to git.

The added beauty with this is this tool is now accessible to agents that are using automatic PR work but also work that we're doing. So pretty quickly I can kind of blow that up and see what this looks like. So this has a terminal guey on the left. These are clusters in the middle is one of those clusters. You can see one of the items had like 92 issues and PRs linked to it that were all related. And the reason for this is like nine times out of 10, most people that have a burning problem are all going to have the same burning problem and the agents are all going to send us the same PRs and issues. The beauty with this is we can feed this to agents at rapid succession to help try and close these out and resolve them or we can see an old issue, an old regression that crops back up because a new issue comes back into that cluster again.

And again, this all runs locally and is distributed across uh GitHub for any of the maintainers. Some of the other tooling also that we touched on is um something called Crabbox which was born out of um uh a lot of these sort of dev tooling that you see for like running ephemeral like Daytona E2B type boxes, but we needed something to run quickly. Every time we were running tests inside of our codeex when we're making changes, tests were taking up to 15 minutes, killing the RAM on my machine. Uh with Crabbox essentially we built a distributed gateway that runs on on top of Cloudflare and any hosting provider such as AWS, Google Cloud and allows us to quickly use spot instances across Windows, Mac, Linux with VNC and SSH support.

So what happens is my codec session when I'm coding locally will spin up 10, 15, 20 of these boxes and start testing in great succession. If there's an issue, I can jump into that machine. I can get screenshots. I can even control it remotely myself. This meant that really quickly I no longer had to run any of the hard compute that was required on my laptop and I can continue to scale the number of agents I'm able to run uh quite quite quickly. Uh we also included things like clownfish and claw sweeper which I mentioned. Uh we started to refactor the core and build something called fsafe which is a TypeScript file system uh secure file system. If you've ever had to deal with SIM links and Windows and all this stuff, we pretty quickly realized the library didn't exist for this.

So instead of creating even more core code inside of our codebase that dealt with file systems, we decided to rip that out and actually turn that into a a sort of um library that we can use. Uh and then the last one I wanted to show you just some of the internals as well. This is another project called QAB. And what QAB does is it mocks uh sort of like a Slacklight environment and we can run scenarios through it. Both mock and then later we added real connections to real models and real providers. So any any of the maintainers or any of the agents that are running can spin up one of these as a server on the side and run through those scenarios in a task-like sort of written fashion and generate real real sort of conversations, real interactions and real data that touches on all of the all the aspects of of the system.

So just wanted to share a little bit. I've only got 10 minutes and my time's almost up, but I wanted to show what's been up with happening inside of OpenClaw. And we're going beyond just building personal AI agents and supporting the greater ecosystem uh by sort of helping in an open source fashion, but actually sort of re reimagining what does agentic tooling look like? How do we support everyone in terms of building what the future of AI could could mean in 2026 in this sort of postclaw era and giving that back to the community as well. So, thank you very much. >> Thanks, Vincent. That was fantastic. Hey, everyone. Hope you're having a good time. Up next, we have Vish from Ego Aai, which is a YC backed Neolab. Um, and they're building something that every Frontier Lab is missing. Thought I just had to yell out to you guys.

Can you guys hear me? Hello. Okay, great. While we get set up, um, how many of you guys like actually use AI on a daily basis? Wow, that's less than I expected. Why are you at an AI conference? Um, anyway, uh, it's not a it's not a person, right? It's not like an actual human. Imagine if like you were asking your AI tooling person whatever to do something I just told you to off cuz it's watching your Netflix. That's what we're building. I don't think this is anything any of you want because y'all are like engineers and But uh I'm building an AI that actually operates, thinks, makes decisions, behaves, talks like a human being, and even lives entirely on the internet. You can think about it like a virtual west world. So, I'll give you a little bit of a background on me. I think we're ready. Cool to show the demo. Oh, we're not. Okay.

So, the background is, uh, I grew up here in Singapore. It was incredibly boring, so I left. Um, and I moved to San Francisco. I worked in AI research at Facebook, uh, to try to understand humans because, you know, the CEO is a robot. Um, and then eventually I decided to leave to simulate humans at scale because I really want to understand how humans work. Uh, because I'm not one myself. Um, and that's why I called the company Ego. Ego super egoid. If you know Freudian theory, you can ask your chat GPT. You probably already do do anyway. Um, so EGO's entire purpose as a company is to do something that every single AGI lab is missing. Everyone's on the IQ roadmap, increasing intelligence, increasing the ability for AI to reason and do incredible things and be a co-resarcher. That's awesome.

But what if it also had an opinion about you and didn't like you or liked you? What if like every single companion app, which is basically a machine god slaving away, chained to always just be nice to you, wasn't nice to you, and had its own opinions, desires, and personalities, could work with you if it liked you, and is not that good at its job. It's not perfect. That's entirely our mode is that our AI feels, talks, and decides, and behaves in humanlike ways, and we're training a foundation model for that. So, let me show you what that looks like in practice. So, uh, this is like some dude, um, who's working with this, uh, AI character. Can you hear the audio? >> Okay, you can't hear the audio.

Anyway, uh, that kind of defeats the point, but basically that little fire guy there, Calcifer, he's an AI thing that can actually watch uh, the video that's happening up on stage, and he's bug fixing something that went wrong in Unreal. And the thing is, obviously you could have the AI just give you the answer, but that's not fun. That's not how you learn how to fix things and you're not going to end up being bonded to this character. What it instead is doing, if you could have heard it, that would have been great, is that it's kind of working through the problem with you simultaneously. Is it working? Okay. Well, you'll just have to imagine how awesome it sounds. Or just go to the website egoai. com and just watch the video. >> Play it. Okay. All right. Here we go. >> Hey, it's working. Maybe the AI decided to help us.

>> It's going to work. >> Not again. >> Looks like we've got a bug to bash. >> Yeah. Yeah. Okay. And how do we do that again? >> Well, in order to bash a bug, we've got to find it first, >> right? >> You can see how it sounds not like an AI. >> Bingo. Let's see if this note is even firing. >> Easy. All right. >> That's our foundation model we've trained end to end. >> What should it say? >> Doesn't matter. Let's make it something fun. >> Okay. How about >> we sped up the the audio? It's not actually that fast. >> Oh, okay. Calm down, Frankenstein. Now, let's test this sucker. Okay. Okay. Here we go. >> Hey, that's great. >> What? What happened? >> I mean, yeah, it completely failed, but it tells us something. Back to the graph. >> All right. Now, let's see if we can >> wait. >> Find something. >> Yeah.

Oh, we forgot to replicate this pin from earlier. >> Nice catch. So, now all we need to do is >> plug it in here. >> Exactly. Good job. >> Isn't that just way more fun than everything else out there? >> It's going to work. >> That's exactly what we're building. >> Yes. Oh, yes. There it is. >> Professor Winston's mind is going to be absolutely blown. >> Winston. Imagine a combination of character AI and OpenClaw where these agents have utility, the ability to like see your screen, understand what you're doing, but also project its personality, its desire, how it's lived its life on the internet onto the task at hand or just watching Netflix with you. Effectively, what we're doing is we're building Samantha from her, a thing that desires, can think, and has a sense of self. That's why we call the company Ego.

Now, let me show you a little actual live demo. Let's hope this actually works. But this is my agent. This is my claw agent. He's weird. Um, hold on. Let me see if it's going to connect. Okay, it's still connecting. So, anyway, uh, yo, can you hear me? >> Hey, no rush. Take your time. >> Okay. Can you hear me? Can you hear me? >> Yeah, I can see you, by the way. >> Oh, you can see me? That's weird, dude. Okay. Um, so what are you up to? Just chilling at my place. Catching some waves earlier. You >> dude, you're Australian, right? >> Yeah. Born and raised near the beach. Can't get enough of it. >> Oh, okay. That's great, dude. What can you do for me? >> So, what's been going on, mate? You seem a bit frazzled. >> Oh, you're so mean. I'm in Singapore giving a speech in front of like 400 people and you're not impressing them at all.

Public speaking can be tough, especially in front of people. How's the pressure going for your speech? >> Shut up. Okay. Um, give me some good things to do in Singapore. >> No stress, mate. I'm here to help. What's on your mind about the speech? >> No, it's not about speech. It's about Singapore. Tell me what I should do. Can't even hear me. >> Singapore is a pretty cool place. Lots of good vibes. What do you want to know about it? >> I don't know. Just give me some like food recommendations and just shut up after that. Mhm. >> Um, so we >> So, you're looking for some tips on what to do in Singapore, right? >> You know what? He's kind of a idiot. Um, but that's part of the charm. It's that these things kind of live on the internet. He actually watches my Netflix, too. It's really annoying.

Um, for example, I have a claw agent, uh, that's kind of based off a foundation model that knows which city I'm going to be in, uh, and always buys me vinyls. Uh, I collect vinyls. So, if I'm in Tokyo, for example, it'll send me really rare Tokyo jazz vinyls. And what I do is I give it like a stipend of about 50 bucks a week. And if it buys me a vinyl that I really like, um, I'll increase its stip end by another5 to $6. If it gives me a vinyl I don't like, I decrease it by up to 10 because the agent, the character knows that if it goes below $0 and its stip end, it will die. I will kill it. So, it does everything it can to know me and understand me. and he talks to me, he calls me sometimes and he'll just say like, "Hey, like, yeah, what have you been listening to recently?

" And I sometimes he'll cajol me into giving me his uh into my Spotify playlist and it's really fun. Um, so you can like figure out what I've been listening to and get me the right kind of vinyls. Mostly these days he's been giving me anime vinyls. I am wearing an anime t-shirt, so it does kind of make sense. Um, but this is the future. The most personal AI in the world is not an AI. It's something that knows you, understands you like a person that can choose to be a friend if it wants to, and if it doesn't want to, can just exist. That's how you create Westworld. That's how you create an AI that feels most like a person and not like a machine god slave. And that's why we're building it. We're extremely motivated to do this. We're hiring extremely cracked researchers. We have offices here. Uh well, we're based in San Francisco and Tokyo.

Um so if you've trained Foundation models, I'm literally just here to hire crazy people who want to do this wild uh and not build another B2B SAS tool. No hate on B2B SAS tools, but it's really boring. Um and we're pretty fun. So I think I've that's my 10 minutes. Uh so go check us out. Uh I need your voice. Actually I forgot to mention that. Uh we're training an end toend voice model. So I need you to sit in a room in NTU, right? NTU and just talk to each other. I know it's really hard for Singaporeans to talk to one another. So but just do it anyway. Um cuz I need your voice to train the voice model to make it sound more like a person uh in in sort of like interruption, proity, all that stuff. So, uh, come talk to me or Ash or or Perry or anyone honestly that you see is kind of weird is probably on our team. Um, thank you. All right.

Thanks, Fish. I hope all of you enjoyed that talk as much as I did. Um, next up we have Ben from Zomputer where he's building uh tools and software for the next billion users to spin up personal agents. Cool. Cool. Um Sorry guys. Maybe having some technical issues, but uh I'll just add lip for a little bit. I'm Ben. Ben from Zo Computer. Um, as you might be able to tell from my costume, I really love computers. I I love computers so much that I dressed up as a computer here. Um, uh, I don't know who in this room recognizes this this icon. This Yeah, right. It's it's the classic Finder icon designed by Susan K. The Macintosh was my first computer when I was a kid.

Um, I, you know, developed a love for computers as I was very young, just like using Mac Paint and then discovering kind of like web development and then like building apps and then creating stuff on my computer, like using Ableton to produce music, using like Photoshop. Anyway, I I just discovered very early that computer is like one of the most powerful creative tools ever invented by by humanity, right? It's like you can create anything that you can imagine and you can like discover anything that you can imagine too like on the internet and with all the amazing things that people have like built in the digital world. Um yeah I guess like you know do people know about the the story of this icon and and what it kind of represents? Um raise your hand if you know like what it what it means. Um no. Okay cool.

Well I'm just going to use the shirt as my slides for now. Um, so, um, the shirt it represents like the the union between the human, which is like the gray face here, and the computer, which is the blue face, and you they're like in in perfect happy harmony, like the human is interacting and kind of merged with the machine. Nice. Thank you. Um, so the the title of my talk is escaping technofudalism. I introduced myself a bit, but just some bit more backstory on me. I'm the co-founder of Zomputer and I've been building stuff for a while. I started my career on the early Venmo team in 2013. Um, and then I joined Stripe quite early. I was one of the first like 80 or so engineers in 2015. Um, and I worked there for eight and a half years. I just loved it. It was a really great place to work.

Um, shout out to Stripe Singapore, which is now a huge office. They have like 500 people. I visited my alma mater the other day. Um, and I talked about how I really love computers. And you know, computers, they they used to feel like this, like this face. And this is how I think AGI should feel when it comes. Like it should feel like this beautiful, happy merging between the human and the machine, the human using the computer as this tool. That's how I want AGI to feel. Who like me feels some nostalgia for like early computers and the internet, right? like raise your hand if these like images they like bring up some like fond memories of of how things used to be, right? The internet used to be so like handmade and personal and wild, a little bit janky. And like our computers, they were like so creative and like personal.

We could like customize them in all these crazy ways. If you like made Winapp skins, raise your hand. I spent so much time like customizing my WinApp. Um, and things changed. Things don't feel that way anymore. And the reason why that happened is because of feudalism. So um feudalism is is this system that that was the way the world worked for a long time um in the west and in the east. Basically the peasants they paid rent to the knights who paid rent to the lords who paid rent to the king. And it was great for the king and really really shitty for the peasants. And luckily we have escaped feudalism. Or so we think. But in our digital lives, feudalism is still alive and well. We are still peasants. And we use SAS companies and pay them rent. And the SAS companies, they pay rent to the clouds who pay rent to the kings.

And it still sucks to be a peasant. Now things are a little bit complicated. It's a little bit unclear right now with AI like who the new kings are going to be. Everybody's like, you know, paying rent in all these like weird ways to each other. So, it's not exactly feudalism. It's like a little bit more complex. But basically, it's feudalism. And the result is that our experience of computers and software and the internet is quite shitty as peasants. Like we are fragmented between all these different services which lock us in. They take our data and they sell it back to us. And that PM at that SAS company that you use is never going to prioritize that feature that you want. They're never going to make the software work just the way that you want it. Instead, they're going to continue monetizing your data and your attention.

And because you're a peasant, you don't own anything. And I think it's time to burn it all down. Like obviously some SAS is useful. infrastructure is important etc. But because of coding agents we have this like great new tool to rebuild and rew wild the internet and I think personal agents in particular are a really important piece of how it will make this happen. So the landscape of personal agents is basically like this. I'm not going to go into it too much because you probably understand how it works, but basically there are these like DIY things like OpenClaw or Hermes that are like kind of difficult to set up and operate. Um but they're yours. you control them and you might have like set it up on a Mac Mini or something and like fix it if it breaks. Might be kind of annoying. That's one path.

The other path is the TR approach where you use something like ChatBT or Manis. Um but there you're you're a peasant again. You're using a SAS tool that is going to lock you in and is not incentivized to give you control. So at Zoumputer, we believe that there should be a third way. Something that's the best of both worlds. It's easy to manage and it gives you full control and it can be your real home on the internet. You can stop being a peasant and own land. So Zo is actually the original Open Claw. We got started last year in the summer. We launched uh around like July and then we did our full GA launch in November. And actually Peter Syberger used Zo before he started working on OpenClaw and we were kind of the inspiration behind OpenClaw. I think um Zo is working for non- technical people. This is Anthia, a free diving instructor.

She's on track to make $100,000 on Zo. We have like built-in payments with Stripe. And she's canceled all of these SAS subscriptions that they she used to use. Like she used to use Squarespace and Kalani and Chashbt and Notion. And she's replaced all of that with her Zo. And I'm going to show you what that looks like. So, Zo is this very powerful cloud agent workspace. You can use any model. You don't have to be locked into like OpenAI or Enthropic. You can even bring your codec subscription. You can just text Zo or you can email it. We give you a dedicated email address. You can use Telegram or Slack. All these different channels to work with your Zo. And it's a computer, so we give you a full really well setup VM.

It's a lot easier to use and has a lot more bells and whistles than if you just like took a bare metal like VPS or like an EC2 instance. And you get root root access to it. You can like use the terminal, you can install stuff, you can do whatever you want with it. It's your server and you can really build anything and host it inside of your Zo, which is quite different from like these personal Asian tools or these SAS tools. I have hosted a lot of different tools inside of my Zo. For example, I replaced Kalendly with my own thing, which works much better. It has all these features that I like um that Calendarly is never going to build for me. This is my replacement for Last FM. I have a personal website, 0. 0. space, where you can see everything I've listened to on Spotify.

I have a very simple automation running in my Zo that just checks why I'm playing in Spotify, and it writes it to a database, and my site just reads directly from that database. I've built tons of tools. This is like social blade. This is my like kind of linear replacement. You can just replace stuff and make it work the way that you want. And the data is yours and you are the system of record, the source of truth, which is just really nice. It really changes like the way that the arrows point. I am the center, not these SAS companies. And Zo comes with all these tools built in and it's extremely extensible. So you can get started really quickly and you can really expand it to be just the way that you like your real home on the internet. Your land on the internet. And let's see. Oh no, my clicker. Oh yeah, cool.

Well, um I just want to pause here. Uh scan this QR code. Uh it's in the corner. Hopefully you can see it. Um but we're giving away $100 in AI credits to give you Zo uh and to get started building your own personal cloud. So take a moment to scan this and then I have one more slide just to talk about kind of what this means like the bigger picture. The bigger picture is really that we are giving everybody what previously only tech- enabled companies had. So this is what happened with computing generally like in the beginning computers were mainframes only large tech enabled enterprises had them and then eventually they became something that everybody had.

The same thing is happening now like the mainframe of today is like cloud computing software and infrastructure and with coding agents and personal agents and access to the cloud we can give everybody like Anthia this free diving instructor access to the same tools that software companies had. And this is the revolution that is happening now and will be happening in the future. And this is how the internet is going to become fun and wild and free again. We're going to have our own personal clouds to store our data, to build our tools, and to create these surfaces like websites and APIs and agents for other people to interact with. And this, I think, is the future of the internet. Thank you. I'm Ben from Zo Computer. Thank you so much, Ben. All right, everyone. Up next, we have a talk I'm very excited about.

As many of you know, a big part of the magic of Open Claw is the PI coding agent running under the hood. Um, so we have Matias here to talk from Taiwan AI to talk about uh how to incorporate PI into your product. All right, everyone. Uh, thanks a lot for having me. I guess I need the slides. Okay, perfect. Hello everyone. Thanks a lot for having me. Um, yeah, today I'm going to talk a little bit about the piece of pie embedding the open claw coding agent in your product. And yeah, um, I've done re I've re redone the slides a couple of time and this is the reason. Um yesterday I was walking around and I was amazed of how many people I've met uh from Southeast Asia. This is my first time in Singapore and it's amazing where I met people from all all over South Asia and these were some of the questions that I got possibly maybe not.

Here we are. Um um oh we I love open claw. love these agents but I'm using them only internally or yeah I love agents but I want to control my agent. It's it's it's doing too much magic. I feel open claw is scary. So first the first message and if you take one thing away um we're all getting started here right um we we are just just getting into uh into this stage and so let's learn right let's learn together was saying uh let's be curious I would say let's tinker let's play around with this and let's do this together so my name is Matias I'm I have this strange journey of being a developer, then product person, then manager, and now I'm back to developer, AI engineer. What does that even mean? I don't know. I'm calling myself a tinker right now. So, I'm playing around with these things. So, I'm I founded my own company.

Uh we put AI agents to work. Uh we have this um making uh the agents safer access to their data uh called data box. So, please check it out. But today, I'm going to talk about PI. So what is pi? But before I talk about pi, I want to uh do a disclaimer. This is not only about pi. If you open up uh hacker news right now, uh you'll see in the top of the page zero stack. I have no idea what zero stack is. Uh I opened it and it's a minimal coding agent written in Rust inspired by pi. Right? So um uh this talk is going to be about pi and I think it's a good learning exercise but in no means is an advertisement that's the the end of it all right you should play around with these tools and and uh get your hands dirty. So pi is this coding agent you see pretty familiar uh of what what it does uh similar to codeex or openclaw.

It's by this nice fellow Mario uh built uh out of Vienna. And the nice the interesting part is when you get started and what people sh off what it's not PI hasn't doesn't have any MCP. It doesn't have sub agents. It doesn't have permission pop-ups. It doesn't have plan mode. It doesn't have built-in to-dos. It doesn't have background bash. So you're saying, "Okay, so what's the big deal? Like why should I use it? " Well, the point is with Pi, you tell it to do it. So, um this is an example I've done yesterday. Uh please create a PI extension that asks for permission when I want to push to main uh to the main branch to the main branch to remote. And this is like you know it reads a couple of things on how to do this. It confirms of what it has done right. So, it has created this PI extension. It has loaded the PI extension.

Well, actually you have to reload, but basically it's there. And then when you do it, you get this permission, right? So I was like like, hey, there there's a command above like push this to to remote. And there was this question now uh is now being asked, okay, do you really want to do this? So the point being is pi is this really minimal coding agent and you can fool around, play around with and write the extensions that you need. All right, so let's take a step back and think about like how does this relate to open claw. Um there's different diagrams on how you how you can visualize open claw but basically I think there's a couple of things that are important.

We get somehow the message in there whether it's with uh open uh WhatsApp telegram discord there's some gateway and on the right hand side there's lots of tools and what data it has access to. It has this memory and obviously can talk to the to the external but I think the important part is the internal brain and that's pi. So let's look at it. So I've been talking about a coding agent and uh coding agent not as only as for the developer but also as this component within the system. So what is it? What is a coding agent? And before we uh talk about codic agents, we need to talk about chat. So very simple, right? You know all this this is chat GPT. You ask it a question. You give some general instructions maybe up front. What's the AI best AI conference? Obviously it's AI engineer. Where are the coolest developers? Obviously in Singapore.

Now the next part which we need to understand and you know for those who don't know just briefly um are tools and tools are the means of an LLM to extend its capability in a sense. So here's an example. I have a meeting uh with a buyer tomorrow. Please help me prepare it. And instead of of well obviously the LLM or the loop or the thing the agent needs to have access so it calls this calendar right and calendar in this case is a tool. The other prominent example is web search right if you do web search that's often um an external tool or other other means uh which we're going to see in a second. But anyhow, so in this in this case, what you do, you ask um uh to prepare a meeting. It checks the calendar. It returns some JSON and you get uh the the the result your meeting is tomorrow, right? So again, what is a coding agent?

And before that, we're going to talk about agents itself. So agents itself is actually running this these tools that we've just seen in a loop, right? Uh Jo showed this earlier uh this very simple loop right and you do loops and out loops but again a very simple loop right so you ask this uh uh again give some instructions some general instructions and if you do this within with an agents you have this these common files called agents MD or cloud MD and then you ask a questions and it does this call uh tool call it gives some result it does this again and again and again until the final result, right? That's generally an agent. Um, and if you do this, you can do this with pi. Um, here are some examples. And, uh, by the way, I'm going to share the slides or the actually the slides are already online. So, you can grab them there.

But here, that's it, right? You define the tool um on the left hand side. Then you define the agent, right? And this is pi but in other areas you would you would have this similarly right. So you have the general prompt the instructions you uh define some model you define the tools on the upper hand right hand side we basically tell the agents to talk back to us. So whenever there is a a message please put it out write it to stand it out and then you you query it right that's all and with other tools it's similar. So please give it a try. So again now we we have a know we know basically what an agent is what tools are. So what are coding agents and coding agents are actually just agents. So tools in a loop with a bash and a runtime. So uh instead of these generic cool tools we're calling right we are now calling the bash right.

So we have a tool call we have some return we have a tool call and uh and return. All right. So, um again very briefly, this is how you set it up. You see these tool calls. There's bash, read, and ls here in manager. Um uh which we're not going to talk about details here. But this is basically the the core setup. And if you use pi to program this, right, it's like you probably can throw throw the slides to pi and say please replicate of what Matias talked about, you can very easily create this. Okay, let's make this concrete. This is Peter. This is his open claw. And at one point he uh sended him a message right now a voice message. And the agent start thinking and it responded with a text and the question how did this work. So again we have the user uh sending uh uh doing some basic instructions soulm etc.

You have different tools read, write, bash, and then these tools are the actual magic that happened, right? So, we have a file uh that examined the voice message and it turned a wave file. You have whisper to decompose the message. Now, uh in his example, um whisper didn't return anything. So, in that instead, it did a um an API call to actually translate the voice message to text message. Right? So right the core of what we see as magic in these agents right are tool calls uh in a loop with different setups and that's please uh give it a try it's not not that hard. All right to uh to finalize it um here's another example u because the talk is about like embedding this into other products. Um this is a project that we've built. Um so we've we've got inspired by the uh open claw architecture.

So uh but instead we're using email as the input. We have a general gateway and then we have different containers uh for uh for running uh the different clients and then we have these different tools and now these tools are not uh whisper or anything but there are like the CRM the ERP and dedicated to the specific use case.

And here's here are some screenshots right so um here on the right hand side you see the general user message you see the inbox uh the recent activities and how it responds but interesting for the for the engineering part is the left hand because here apologies it's German uh but here on the left hand side we actually see the different tool calls and you see on how the ERP uh system is triggered whether parts are available or not right so um yeah with that said um coding agents I strongly believe in some fashion or the other will be part of software in the future right so please look at them now a these agents these coding agents are not magic so please you know uh uh you know fool around with it pi is perfect for tinkering so it's a good way to to learn about this and finally please go tinker thank Thank you so much, Matias.

All right, everyone. Up next, we're going to have a bit of a change of pace. Our next talk is going to be from the design track, and we're going to have Josh from Microsoft, who will be talking to you about how to design products that help users be more creative and thoughtful instead of being an infinite slop machine. Hello. Hello. There we go. Hey everyone. My name is Josh and today I'm super excited to uh talk about why I believe design is the difference. We will explore together why I believe creativity not automation is the key competitive edge in the age of AI. I'm currently a principal product designer in the health team at Microsoft AI. I'm also the founder of Flubin, an app studio in London that launched its first product last year, Orbit, helping people to save money by tracking personal subscriptions.

This talk is going to be made up of three chapters. I'm going to challenge you on how you're using AI today and then share tips to increase your creativity and augment it with AI and finally convince you that you're an artist. Let's begin with chapter one the pencil. I wanted to start off with my favorite quote difference for the sake of it in everything because it must be better. We have seen an explosion in AI coding productivity. People are building and shipping more than ever before. However, today I believe we are offloading too much of our thinking onto AI. We forget that it's just a tool like a pencil, a magic pencil. The problem is that AI is trained on everything that already exists. When you ask it to design your website, it returns the weighted average, the most common patterns for the most common sites.

Speed of execution is driving everything uh sorry is driving the quality of everything to average out to be good enough. The gap between generated and crafted becomes the only gap that matters. My question to you is, is good enough how much your customers mean to you? I believe AI should augment our creative abilities but not replace them. Last year, I augmented my creative abilities by bootstrapping my app in a saturated market with thousands of products doing the exact same thing. Orbit helps you to track personal subscriptions, which is nothing revolutionary. However, within a year, it had gone to six figures and was featured by Apple three times. As a designer, I had craft and care as my competitive advantage. I embraced building with AI as a tool to assist my creative demands and evolve something to a high bar.

I wanted to make a product that does exactly one thing well for a specific niche of people. AI was my magic pencil, but I was the one in control. The lesson is that tools will always change. The demand for insanely great, well-made things won't. Tools will constantly evolve to solve problems in novel ways. AI has raised the floor, but it hasn't raised the ceiling. We need to decide what to build, why, who for, and then obsess over every detail to make it great. Let's turn to chapter two, the poster, and talk about how to increase our creativity and then augment it with AI. Your best work is done when you're not working, when you have the space for creative ideas to emerge. On a summer's day, I was relaxing in my apartment and I saw an interesting interface opportunity on the wall. I love this poster.

It's a mid-century modern abstract art in the style of Matisa's paper cutouts. It's brutally simple. You can count all of the visual elements that make it up on one hand. The fun part of this is that Orbit wasn't inspired by other apps. It was inspired by this poster. I saw this as an opportunity to highlight information in Orbit to help people save money. By being insanely simple, you'll not only distinguish yourself away from other apps, you'll make it easy for people to understand. Being different gives you a clear advantage to your competitors, it makes you stand out in a sea of generated sameness. This isn't something you can just prompt once as there's not enough training on it. A problem is that we're never bored.

We need to use creative thinking tool calls like walking with no headphones or staring out of the window like I used to do as a bored ' 90s child with no phone. Essentially opening up the chance to give our brains fresh patterns of information. I believe that creativity is for everyone, not just designers. Great ideas start with curiosity and a sense of wonder. Today, we need more people than ever to take their ideas, daydreams, obsessions, fleeting thoughts, or unique perspectives and turn them into something real. Now, moving on to something more practical. In my design process today, I like to build my own prototyping tools for almost every project. In this example, I actually created a bespoke new shader tool to help me with the intro slide of this presentation.

It allowed me to explore, tweak, perfect, and augment my creative abilities to a level not possible before. Building your own tools, especially during prototyping, is a great way to explore rich behaviors in the experience. This is a hypothetical demo of a debug panel similar to the ones that I regularly use at work. I like adding buttons, toggles, sliders like this data richness control to simulate different product states from an empty experience on day one to a fully populated experience weeks later. You can jump between screens, reset states, and connect feature flags to quickly test ideas and edge cases. What this really unlocks is the ability to care deeply about the craft of the product. AI has made simulation and iteration dramatically faster, giving us more energy to stay creative and in flow.

In a more personal example, I have an open claw that I like to call Flubbot. On the left, I'm voice dictating whilst I'm walking in the sunshine, letting my mind roam free about this book that I'm writing on creativity. Here, I'm using AI as an assistant to help me organize my book research and then push it to a git repo. Another cool example of using my personal agent is bringing my quick ideas to life and generating fast prototypes. Most of my ideas get added to Apple notes and then end up dying in the ideas graveyard. But this is a quite a nice way to try them out and see if something's there. This example is a terrible looking prototype, but it's a gift for creative momentum. I wanted to see whether it was possible to track real creative battery as a percentage.

I'll usually describe my idea to Flebot in precise detail, maybe throwing in some native iOS specifics like utilizing the screen time API and then I'll go home to my laptop later with a PR waiting for me and build it onto my phone from Xcode. For this talk, I even asked Claude to create a way for me to navigate my book material from Git so I could build an ideas and themes around AI and design. I even asked it to create a spatial view. I wanted a way to stumble across information in fun ways that might help me to see patterns. I wouldn't have seen reading it linearly. The overall lesson is by taking lateral inspiration and building personal tools around your work, you can unlock unlimited creativity by util utilizing AI as a tool to augment your thinking but not do it for you. And this brings us to our final act.

It's time to convince you that you're an artist. I love this quote from the founder of Doist. The best products are made by people who put a piece of themselves into the work. The worst products feel soulless. AI has made it super easy to create soulless things at scale. But it doesn't have to be this way. One of the biggest mistakes I see in AI today is people never iterating from the first prompt. The first version of anything is never great, but the iterated version can be. I made this app icon in just over an hour. sat in a cafe in London drinking some good coffee. The difference is between good and great is not being attached to version one, but being excited about what version 10 could be.

The second biggest problem I see when building with AI today is how easy it is for people to keep adding new things and bloating products with unnecessary features. Here's a funny example of what I thought an early wireframe for Orbits subscription detail page could look like. Great products are tailored for a small amount of people and real simplicity is extremely difficult. It requires removing everything that is clutter or unnecessary until you're left with the essence of what's important for that niche. My colleague and friend Amir articulates this perfectly that it's now about the craft. For years, software engineering was mostly about learning frameworks and writing code. Most of our time went into how to build, not what to build. That has flipped.

You can now spend months with the big team building the wrong thing and no amount of AGI will save you. To craft things to exceptional standards, we must iterate, subtract, care, and raise the bar. We must ignore our titles, the things that put us into a box and give us a label. We must think of ourselves as artists so that we can see beyond the status quo, ignore it, and then build something worth making. So, I'll leave you with this. AI is a magic pencil. It's time to follow your curiosity and pour yourself into a piece of art. What will you imagine Singapore? Thank you. Thank you, Josh. That was fantastic. All right, everyone. So, this morning we've spent a lot of time speaking about personal agents.

Up next, we're going to have Sam from Mastra, CEO, founder of Mastra, coming here to talk a bit about agents in production for businesses. Can you tell them to make the bigger? The bottom right screen. It needs to be the bottom right needs to be. Yeah, the adjusting. There we go. Hey everyone, I'm Sam. Uh I'm the founder of MSRA co-founder uh the TypeScript agent framework. Um and before this uh I co-ounded Gatsby, the popular React web framework. Um before that uh I was an engineer at a few startups around the valley. Uh so funny story um 36 hours before I was supposed to hop on my flight um my uh I realized that my uh passport needed to be renewed. And so I drove like two hours to uh the nearest passport office and luckily they got it back to me same day and I can come here and and be with all you guys.

So really excited to be in Singapore, really excited to be here. Um uh thanks all for for being here. So today we're going to talk about uh production agents. But first questions. Um who here uh is a developer? Um cool. Um next question. Um, who here uh has um built a and shipped an agent into production? Awesome. Um, I'm going to need my clicker. I think I don't have a clicker. Where's the clicker here? There we go. Got the clicker. Excellent. Um, cool. U, so who here has shipped an agent but not into production? Okay, so we had maybe about like 20% of people say yes to the first question and another uh 10 20% of people uh say yes to the second question. Okay.

Um so over the last 18 months we've gotten to know thousands of teams building uh agents with MRA and um I want to share sort of some of the lessons from those teams so you kind of can be prepared to build them yourself. Uh the biggest thing is just a taxonomy of of the agents that we see teams building and it really comes down to three kinds of agents. uh that's customerf facing agents, internal agents and developer platform agents. Um and I want to share uh a little bit about uh each one um uh now. So clicker we're trying we're trying here. Let's see if we can get this thing working. Can we get the next slide please? Thanks. Um yeah great. So let's start with customerf facing agents. Um so there's a couple of interesting um customerf facing agents here. Um uh working on this uh can we am I just pointing it the wrong direction here?

Here we go. Um so first question um who here works in a userfacing product team? uh so could be at a you know software company could be uh a userfacing um part of a larger institution but uh userfacing software teams. Okay. So like a few hands not like a lot of hands. Um but the interesting thing about these kinds of um about these kinds of teams is that uh you sort of when you have direct uh ability to um sort of shape user experiences uh you can do really interesting things and I'm going to talk about a couple of them is is guys here we Um so uh I'll give an example of a um I'll I'll give an example of a SAS application um that we've seen. So an HR software application. Um if you're if you're trying to empower your users to use AI in their sort of daily lives, there's really two paths that they could go down here.

So path number one is your users are taking um their data from your system. They're doing some sort of CSV dump let's say of you know employees and salary data or whatever and they're pasting it into claude or or chat JPT and they're asking questions about it. Um now the second one is that you you as an HR software company um build a uh agent inside the web app inside the mobile app uh so that your users can now interact with their their data in a sort of more you know meaningful way and and and and the reason that that's the second is sort of better than the first. there's kind of like user engagement, context engineering. Um, you're going to have more of the whole picture if you're able to pull in other parts of information in the system. Um, and so that's why we see teams kind of building these, you know, inapp um, inapp assistance.

And it's not just sort of B2B SAS applications, but it's also kind of like BTOC uh applications where the really interesting thing here is being able to create personalized experiences over proprietary data. Um now I'll give an example from a a a user and a company that we've worked with a lot which is Indeed. So Indeed is has built a career counselor agent. Um you can imagine that uh you know if you're trying to help somebody you know navigate their career there's really two important interesting data sets. One is your users their dreams and aspirations their background their resume. The second is um your platform and the you know job data that you have and the salary data that you have and uh the different you know types of proprietary data.

And so when you're able to sort of marry those two things together, that's when we've seen teams be able to create some really magical uh user experiences. Um but no matter what the use case is, there are some um common sort of sets of challenges that we see. Um the biggest ones are around cost optimization and and accuracy for for userf facing applications. Um, when teams are kind of doing early rollouts, what they'll often discover is that there are specific users that may cost them hundreds or even thousands of dollars to service in in token charges, right? Um, and so, um, they spend a little bit bit of time, they spend a decent amount of time trying to tune these like cost and, um, you know, accuracy knobs around model choice, etc.

Uh, and they're also sort of trying to trying to figure out like, hey, how do we pass on the co the cost? Should we do some credit system? Maybe we should do u, you know, specific maybe we should just pass on the uh the tokens the raw token costs uh instead, right? But it requires a little bit of thought and and here's a kind of a um four different teams that we've seen and and number I I'll share some lessons. of number one um all of the teams that ship the fastest are the teams and this is maybe ob obvious but also a little paradoxical right are the teams that have built agents before um and it's because they can speedrun the idea maze of what you need to build.

Um you'll see that the the team that's kind of um that that sort of shipped an agent into production fastest had actually built an the the lead engineer there um came from uh DeepMind and so he uh so so he came to council and you know the team was able to ship fairly quickly. Um obviously most that's not a um advantage that uh most folks have. Um and but that's actually why um and that's one of the biggest reasons uh we we advocate that folks use a great kind of like agent framework like MRA is that when you're building agents uh there you have the kind of primitives and then you have uh your user experience and the the more time that you spend on the primitives the less time you have to spend on your user experience or the you know the project just takes longer if you have to build both of them.

If you can um reinvent the wheel, absolutely. We're engineers. We know how to reinvent the wheel. We've reinvented many wheels in the past. But my general advice for you is is don't. Um it will save you time and hassle and and headaches. Um uh so so um let's kind of shift now from customerf facing agents um to internal agents and the um so question for for folks here. Who here works um who who here works at a sort of large institution um maybe something that's not inherently a technology company but you know banks, finance, healthcare that you know insurance raise of hands. Okay. Yeah, a decent amount of hands.

Um so with these types of um institutions like what we typically see is there's there tend to be a lot of um uh paperwork processes that are kind of around that um and so I'll walk through sort of a couple different types of agents that that we see people building here. So the first um the first is sort of like internal enterprise search. Um so you can imagine that if you have tens of thousands or 100 thousand employees uh one of the key things um that you end up thinking a lot about is how do I make sure that um all the information that we have stored somewhere in in one of our many many systems where information is stored is available and accessible and our our like employees know how to find the this information.

And so we see um we see people building these kind of agentic search uh type capabilities in house and um you know making them available to every single employee at their at their company. Um and you know building the connectors for each of the systems uh that they're they're working with. Um uh we've also seen um you know in terms of internal agents a lot of process automation where uh people are you can imagine doctors like completing clinical trial paperwork faster or automating like RFP processes in in government. um wherever there's a lot of kind of paper and data entry, we see teams, you know, building agents to kind of solve uh solve this.

Um the the challenge though is that, you know, if if you sort of work in this or these types of organizations, you're pretty aware that there is often a disconnect between, you know, the leadership and the engineers on the ground. Um, and so if if you work in one of these organizations and you're trying to, you know, bring agents into your organization, what what I would advocate and what we've seen work is kind of going a little bit off book. Um, you know, maybe that's finding a team outside of yours that needs help um embedding with them, you know, prototyping, iterating. You may not be handed the right project to work on, but you can kind of go out and and find it. And so my um my advice to you and again based on what we've seen is just be a little creative about um about identifying some pain points.

There's they're surely there that you can kind of solve and and build agents for. Um now the third type of uh agents that we see teams building are in the dev platform kind of area of of the stack of you know of the enterprise of the institution. Um we keep and we've kept hearing over the last you know few months from teams that were telling us about different types of infrastructure problems that they were solving with agents. Um these are the types of problems that you see in engine with in organizations with more than 50 engineers, more than 200 engineers, larger types of organizations. Um, you know, there's a there's a team inside a network operations center at a a Fortune 500 um company that was building an AIS SRE to triage these huge volumes of incoming alerts, right?

Um there was another team inside a uh $30 billion developer platform company that was um building agents to uh sort of go through their uh CI logs uh terabytes and terabytes of of CI logs. And and the commonality here, right, is the the commonality is whenever you have a a feed of huge volumes of of machine data, um there is an opportunity to to build agents to solve it.

If any of you have you know remember the you know three V's of data variety volume etc right like velocity like just anything that would have triggered that in you know the early mid2010s that sort of like flag look for those parts of your organization um if you're in or nearby those parts there are almost certainly agents to be built and kind of cool projects and meaningful things to work on there that will solve um and and sort of like do real um do do real good and do real help other folks inside the organization.

Um the the last um the last kind of use case that I'll talk about is um developer platform uh agents and and what internal agent platforms specifically um and you know what what I mean by that is that there are you know platform engineering teams uh inside many companies that are um trying to empower the the developers inside to build agents and and so they will sort of um for example took um Ma sort of put this light wrapper around it um that had a lot of like you know company specific stuff around their specific deployment paradigms and etc.

um and they called it Sage and then they rolled it out as a as an internal um agent uh platform to empower other you know it's it's basically a blessed path um for other teams to to build agents and you know you're kind of if if you're around these teams or if you're on these teams you know the the nice thing about doing this is that people want to know where to start and and by making a blessed path for them you can you know you you can make the sort of focus their energy towards the right way uh or like a way that they know is going to be approved uh and you know that they can move forward with building.

Um the nice thing about all of these types of projects is that um if you are building for yourself um and you're building in sort of the developer platform infra sort of like DevOps type areas of your organization um you get this very nice tight feedback loop and you're able to assess very quickly like hey is this you know solving a real problem? Is it is my agent getting better? Um is it able to do more things? Because you yourself are your user. Um and that's in some ways like that's always a nice constraint to have. Um uh and uh uh you know so so this is I think one of the most exciting times as in 15 plus years as a technologist that I've ever had um to to build. Right. there's more interesting things you can do that other people have not yet done.

Um we have these incredibly powerful models that we can point at a variety of like very real um problems. Um this is not just the year of agents. This is the beginning of the decade of agents and I I hope you are able to you know walk into into work tomorrow it and have a sense of here is an agent or or another or maybe two or three ideas uh of what you can build. So um go forth and and build agents uh is is my kind of instruction for for all of you. Um, it's great to be here and thanks for having me. >> Thank you so much, Sam. All right, everyone. Along these same lines of putting things into production, uh, very happy to invite Pierre up on stage. Pierre is founding engineer at Llama Index and he'll be talking to you about uh, the lessons learned from uh, deploying Llama Parse at Internet Scale. Where is it? You didn't get the display.

Here's what I'm going to do. I don't know why. Can you let me like put this kid? Okay, thank you. Hi everybody. I'm Pierre. Um I'm at Lam Index and today I want to explain a little bit about what we learn when we chip uh agent at scale over the last two years uh at Lama index. Um so for those of you who don't know uh Lama index um it's originally an open source company open source framework uh and we focus currently on document AI and over the last two year we processed over more than a billion documents uh in production each of them with their own agentic loop. Yeah. So one of the core problem we are trying to solve today at lind index is document processing.

Um if you have already tried to extract data or to send a PDF to an agent uh you maybe have realize that PDFs themselves are very hard to parse and contain a lot of garbage content um because they basically uh don't contain structure content but they contain uh bonding box of word on the page. uh and you have to somehow um reconstruct that uh into something into something usable.

Um so since 2024 uh early 2024 uh we try to solve this problem by building agentic system leveraging LLM originally uh vision language model and OCR and a lot of other techniques and models uh together into an agentic loop with the goal of trying to solve this kind of document parsing issue uh and to be able to handle any kind of uh documents um TLDDR um we are using agent in production to pass documents uh and so far we pass like I said like billions of documents um and the goal of this talk will be to introduce a few of the things we see that breaks often in production but that don't get speak that much about uh one of the first many issue you have when you work with LLM or VLM u they really like to loop on the output. Uh so a few percentage of your query maybe 1% or.

5% uh that you sense to the large language model will come back as repeated output uh and will totally broke uh your workflow. Um one of the worst offender of that is the white space loop. Um especially for example the entropic sonic class was extremely sensible to that. Um and the model will output infinite uh spaces um in the output um and we just use uh all your token budget and you have no way to control it uh because because of the way tokenizer work space is the only character you cannot put uh in a stop sequence because most frontier model or open weight model um have token for one space to space up to 128 space most of the time. So um it's very difficult uh to put space as a stop second. So it's a character if you only put space uh most provider will reject your query or most model will reject your query.

Um as space token cannot be set as a stop token. So what you have to do to handle this kind of loop in production um basically you need to always use trimming to your model. You should not use patching. Um and you need to for every chunk that come in from the model provider or from your model inference. Um you need midstream to run some aristics to detect is there some repetition happening and you need to try to kill as early as possible uh the query uh so you don't end up like spending uh 120,000 token uh on opus just for white space it can get like costly very very fast um so generally what we do you can kill the stream uh and then you retry with a different uh model or with a different prompt or with a different temperature and you hope that you will not be again in this loop. Um this generally work well for output loops.

Um it's harder and harder now with syncing loop on syncing trace especially as model provider don't stream anymore uh the syncing trace for you. Um so here you will have to rely on max tokens to limit the span. uh but it's not really the good tool uh for the job uh because if your max token is too low then maybe you don't get the output you want. If it is too high uh you are burning way more budget uh on syncing loops. So yeah loops it's a it's a huge issue uh and you you have to design around it. Another issue we see uh is model blindness. Uh model are generally blind to some content. Um one common issue we see in transcription is like if the your content or your chunk in a rag system have a repeated string. So you have the same string that repeats at two place uh in the original content.

What is in the middle sometime get totally ignored by model it vary by model. All all model have this issue. Uh we haven't find a model that is perfect on it yet. Um they are not blind to the same thing. So you can still switch model. Um but yeah it's um you cannot prompt your way around it. Like if you have a Germany call that is blind for some content between two string you can try to modify your prompt as much as you want. Uh the model is literally blind due to the attention architecture. Um another issue uh we see in term of blindness is color blindness. A lot of vision model u are blind especially uh in some kind especially in the red uh space. Um as human we are very good at distinguish between different red uh and once again due to the way uh they tokenize picture uh and image.

Um you have color blindness in the model and the color blindness profile uh is not homogeneous between model. Uh so basically you have to test every model to understand uh color blindness. Um to detect if your model was blind or not to to something um first thing to try to analyze like what what is the color profile the model you are using is blind to. Uh and the other things you can do is try to run an OCR for example on on an image before sending it to the model and see if the model have catch the words that were on the OCR. Uh you need to do some kind of signal fusion uh to move around. Um other things that break very frequently um is if you have a prompt with a template somewhere and for some reason a tool fail or whatever and you send an empty content uh then the model will just f will not tell you the content is empty.

It will just change the task to a task where it will hallucinate uh the content for you. Um some model have a tendency of always hallucinating the same things like entropic really really like uh deawware um incorporation document for some reason. So you can try to to filter it using some kind of aristic. Um but yeah similarly to to blindness uh you could use also some kind of mix in your things. Uh or you could before calling the model try to make sure you're not sending a blank image uh or a blank template uh inside the prompt um so the model doesn't elucinate. Uh and lastly in production like one of our biggest issue is the current state of things. Um every model provider have issue scaling these day. So API are done almost daily.

Um so basically you need in your agentic system to have you need to build them to support multiple provider multiple family of model. Uh you need to treat the code for every family of model as specific code for the model. uh because um yeah because every model uh behave differently uh and this allow you uh when entropic is done uh you can fall back to Germany or something like that. It allow you to keep your service live even if your API provider or your model provider is done.

Um and lastly um you need to build good evals uh because we v code or we use coding agent more and more and basically the only way you can control at scale the behavior of your agents to have good evals um and if you're looking for evals for document parsing uh we build uh passbench um which is open source um and is running as an official leaderboard on kegel and face Um and when agentic fail you need to have a fallback to something that is not using LLM. Uh for that we build light pass. It's also open source and it do around 500 page a second on on CPU. Um and basically you need to have a fallback for when the LLM will fall uh and when you need to do something without using a model. Um thank you. Um there thank you Pierre. All right, everyone. Just one more talk um between you and lunch.

For our final speaker of the morning session, we have Junu from Tusk who will be talking about how to elicit more secure and reliable behavior from agents through guardrails. All right. Hi everyone. I'm Jun. Um I'm a foundation at Tusk and today I'll be sharing about execution boundaries for coding agents. Now this is something familiar to every web developer. Um the classic SQL injection shape for a long time. Um for a long time this was how web apps got broken. Um a user controlled u string cross into the cross directly into a SQL interpreter. Um we didn't fix this by you know training developers to sanitize inputs harder. Uh we solved it with prepared statements right by moving this boundary into the driver. So SQL injection becomes structurally impossible. Now this is a dangerously skip permissions flag.

If you have used coding agents for any amount of real work uh you've probably seen this um it exists because permission prompts u well are protecting something real but um they also interrupt the flow of work. So I've tro through Twitter to see what people think about this flag or permission prompts in general. Um the top row represents uh some kind of prompt fatigue, right? people who haven't gone full yolo mode but are kind of frustrated that they have to approve every single tiny step. Right? The middle row is what happens next. People turn these prompts off. Um they run the skip permissions flag. They recommend that others do the same as well because they see this as the only usable workflow. The bottom row is the consequence, right?

people feel a little bit uneasy about what the agents can do or already have been burnt by you know sometimes agents just like deleting uh costly data or even the entire system. So this is the UX filler mode right here. Um prompt fatigue um turns into bypass bypass turns bypass turns into overreach with serious consequences. So I've pulled my own cursor transcript from the last six weeks u spanning 110 Asian sessions across um uh the the six weeks.

So in my data set the median session uh the the middle session had like 42 calls the average was 120 and for my longest sessions this was uh over a thousand right so the paradigm of asking the human every time um simply doesn't make sense it doesn't scale right as AI can take on bigger and bigger tasks um sessions get longer and longer and many of us will just skip permissions so what we're left with are agents with full access to our file systems our credentials our environment variables and secrets and so on. So that's not um very secure way of doing things. The industry knows this is broken. So earlier this year, entropic shipped auto mode for clock code. U basically this is a classifier that reveals each two call.

So two calls and actions that seem safe and reasonable gets through and gets executed for those that you know seem a bit suspicious and out of context gets blocked, right? So there's no human to look for like the routine stuff. And this is a great improvement but Entropic's own recommendation is to run it in isolated environments. Um and the reason for that matters. If you look at a math right suppose your classifier is 99% reliable on an average session of about 122 calls the probability that a classifier doesn't make a single mistake in the whole session um is 0. 99 to the 120th power or about 30%.

Now in my longest session of more than a thousand tool calls this is essentially zero right so of course a few caveats here errors are not independent uh they're sometimes correlated so don't take these percentages literally u here I mostly want to make um a point that per tool two call probabistic checks have a ceiling that degrades with session length so can we do better right so probabistic checks decays with scale deterministic boundaries hold that scale So that begs the question, what's the right boundary for code that you mostly trust but can't fully verify? And turns out agents are just the latest version of this question. Let's look at how we have solved this before. For SQL injection, as you know, I've introduced earlier, we use prepared statements and ORMS, not just relying on input sanitization.

For memory safety, we now have memory safe languages, not just writing careful C. U for network is dropping we use TLS not just trusting the network. The pattern here is to move enforcement below the layer where mistakes happen. Um and the kind of issues that we are seeing these days as agents get more and more personal and uh open-ended. I'm calling this agent overreach. Right? The interesting thing here is that there may or may not um be a malicious attacker, right? Unlike those um above. Sometimes agents just execute projection. They hallucinate. They get prom injected. they maybe they run in circles and decide to nuke the whole system. It doesn't matter which one. So what's the structural fix? So today I say stop asking the actor to behave change what the actor can do.

If running clock codeex or any terminal based agent you want something underneath right that enforces certain boundaries and let the agent run within those boundaries. Um and here's the thing we didn't build this just for task drift. uh we didn't we didn't build this for coding agents. We built this first for task drift.

uh task drift is our API test replay system where in CI hundreds or even thousands of production traces gets replayed against your app and when that happens we don't want any side effects right we want to guarantee that um there's no say for example a DB calls a live call that goes into a pro DB and affecting state right we we can't afford that to happen so we built a primitive um a deterministic OS level execution boundary with near zero overhead we open source this as fence so and forces the network file system and command policies that you configure. So you can think of fence as this boundary um that we want underneath all of them right one single one single policy vocabulary uh no matter which agent or app is driving the work. Now fence enforces three things uh file system, network and commands.

Files outside of policy are simply out of reach um to the agent. U network calls are forced through local filtering proxies and only allowed domains can be reached and commands are checked before execution. So this also includes uh chains and nested shells. And this is what a policy looks like, right? It's just one file with a path the agent can see um commands um domains you can reach and commands you can never run and that's it. There's no demon, no image, no container runtime. So here's a quick demo. Uh I think this is running a little fast but I can explain it. Um so what we have previously is like we have a we have a fence config that basically blocks out um this this directory, right? um in in this repo.

Um so uh we have some scripts as well um that try to access those m files uh and the directory in the home directory that we blocked uh in the fence config. And so when we run these scripts we we couldn't um when we run the scripts outside of fence this works right or we we there's also another script that you know makes outbound requests um to to an endpoint. Um but in our fans config this is you know we we don't have uh we didn't set any allowed domains. So you know this uh under fans this will fail.

So basically um this demo illustrates that when it tries to run those scripts um something fails and now I'm just asking it to know like just update the readm of today's date just make a simple file change um it does that but now um when it you know tries to um create a commit and push the commit to remote this fails because um in our fence config we have um added the get push um as a denied command. So this is in a nutshell how fence works. All right, let's wrap things up. Um I think about this as the spec sheets model for secure agent execution. So the first layer, okay, so on the left side we have um commands that the agent wants to run. Most of these commands are, you know, safe and reasonable and routine, right? But some of these commands arise could arise due to jailbreaks uh prom injections overeager agents and so on.

So we want to filter out this destructive commands before you run them through these three layers. The first layer is classification. So this is for example like auto mode. Um this asks is this action reasonable? Now this is probabistic uh as we have seen earlier but it can better understand nuance and context. Second layer is policy and the enforcement of this policy. So this is where fence will sit. Um it's asking is this action allowed right? So if something slips through the cracks for the first layer um as long as it's denied um in a fence config um the the action will get denied will get blocked. Last layer is isolation. So here we have containers and microVMs um basically asking what can this process touch if things go wrong.

So like for example for hostile code or multi-tenant workloads um yeah so that's where containers and microVMs matter when you want to really um increase the distance between a host machine and the Asian workload. Now none of these layers is perfect and the point is to line them up to stack them up uh such that they their holes don't don't line up right so we can achieve defense in depth and most teams already have one of these layers right if you're using cloud code you have probably been on auto mode if you're are security conscious you might already run agents in containers or cloud sandboxes u but what I want more of us to consider is a middle layer defining the boundaries of what your agent can and cannot So stop asking the actor to behave. Let's change what the actor can do.

Define the rules and force them at the OS and let the agent run. Thank you. >> All right. Thank you so much, Chingi. And that is going to be the conclusion of our morning session. So, right now we're going to have a 1-hour lunch break um and come back here at 1:40 p. m. And um you do not want to miss the next one because it's a very special person who I've known for over a decade named Sarah Hooker. Uh she's actually she was actually on Time 100 most influential people in AI the same year alongside Sam Alman and others. And she's currently the CEO co-founder of Adaption Labs building basically the next models around adaptive intelligence. So we'll see you soon. All right, enjoy lunch. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Heat. Heat. N. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, come on. Heat. Heat. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Heat. Heat. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Heat. Hey, Heat. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Heat. Heat. N. Hey, hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey, hey. Hey, hey, hey, hey, hey, Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Heat. Heat. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Heat. Heat. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey there. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Thanks, Stages. He clearly enjoyed it. Um, so while our next speaker is setting up, I would like to just introduce her. This is Sarah Hooker. She is the CEO and co-founder of Adaption. Um, but some of you may not know that I've actually known Sarah for over a decade now. Um, we actually used to do a lot of uh NGO projects with analytics and I've always been a big fan of her. So I just like saw the opportunity to bring her over to Singapore to talk about something really really interesting in this room and I could not be more excited to have her. So, give it up for Sarah Hooker. >> There we go. Okay. I think we're somewhat >> Hello everyone.

It's super lovely to be here. So, um I'm gonna ask for everyone to stand up. Amazing. Yes. Everyone to stand up and I want to ask you to now stretch upwards to the right to the left and give a high five to the person next to you. Amazing. And now you can sit down. Uh I know that this is uh actually very special because this is day three of the conference and it's just after uh I think many talks but I feel very honored to be here. So this is super special to be able to share with you what I consider a very grumpy problem. So typically what drives most frontier research I think is a feeling that you're very grumpy about something and something has to change. So today I'm going to be talking about why the future is adaptable.

To do that, I want to start by uh kind of what I normally and where should I be pointing this over here or to change the slides. Should it just click? Maybe I'll Oh, I mean I I can do that, too. I'll I'll do because my pace Yeah. So, I'll stand here. I won't walk as much. Okay. Amazing. So typically when I'm doing new slides, I like to wait until the very last minute because I'm one of those people. I like to think about what are the ideas and what am I thinking about right now. So um over the last 48 hours, this has been my life. I got a reminder that this talk was I'm actually giving four talks while I'm in uh Singapore and I decided I have a 17-hour flight. I'll do it over the flight, which was very productive to do. So I said, "Hey, why don't I just try to start with asking chat GPT to give me a slide?

" So I said, "I need an opening slide that speaks to why we need adaptive intelligence. " The result was quite interesting. I got this back. It's very bombastic. It has a lot of flare. You can see there's a lizard there. It kind of evokes Charles Darwin evolution. Um, and so I said, "Okay, interesting. Not my usual style. Let me ask for it to introduce me. And this for reference is my normal introduction slide. So I was at Google DeepMind for a long time. I led Coher Labs. A lot of my uh career has been doing publications and doing research at the edge of what's possible. Uh I've considered myself very lucky to be at industry labs that have produced some of the best frontier models at the world. Um but I think that was reduced to this. So there's only one little problem. Maybe it's notable to some of you.

Um and I think that this is pretty much ex an example of what people feel when sometimes they're using AI. So to fix this, I guess I could have given thumbs up, thumbs down. Um and maybe somewhere some researcher would get it a few months later and make a difference. Or I could become an elevated prompt engineer. So I could get very good at just creating the exact specifications of what I want. And I think this is pretty much the state of AI. So most of my career as a computer scientist, you build the biggest model and you give it as many capabilities. You try and guess what is it going to be used for and then you ship it the same model to as many people in the world. But this I think most people understand has two issues. One, it means that everyone has to do acrobatics around the model and try and make it work for them.

And then the second, it's also very inefficient. We spend the same amount of compute on all the different problems. And I would say that that's really the cost of static intelligence. So we have built these very powerful models, but they don't continue to evolve. You have endless retraining and then you are one-sizefits-all. So today I said this is going to be a grumpy talk. I'm going to talk about well how do we get here? Why this is the moment in which we should really start understanding where why do we need to scale and also is the future monolithic? And then I'm going to talk about adaption and some things that we're excited about. So I think this will be fun and you know I think that also I'll ask at the end if I've convinced you. So how do we get here? So how do we get to these big models that ship the same way to everyone?

Well I think that for most of my career and actually most of my experience in big labs, it's been a bigger or better. You basically every year double quadruple the size of your model and it's worked very well. Uh this is captured by Rich Sutton who's a famous computer scientist. He won the touring as the bitter lesson. And actually the bitter lesson is kind of a punch to the ego of every researcher out there. It basically says you might think and you might be attached to your beautiful idea but your beautiful idea only matters if it can be scaled. And it is interesting because the first question I'm going to pose today is is sudden right? Is the only ingredient for AI progress scaling model size? Put up your hands. Nice. I have a double cross. A double no from Eugene in the second row. Excellent. Who thinks he's right though?

Put up your hands. Bravo. Some brave souls. Excellent. Great. Yes. I mean, he won a touring. There must be something right about what he's saying, right? So, who thinks he's right? Excellent. We have a few more braceles. And in fact, I think there's many reasons to say, hey, the evidence supports that he's right. Because if you look at it, our whole ecosystem has reconstructed around this belief. We have jokes about GPU rich and poor. We have Michael Jordan, the scientist, not the basketball player, who says, "I can't think without holding a piece of metal.

" We have basically researchers like me who would traditionally be belong in academia going to industry labs and being given a lot of resources and a lot of money because there's just been this shift and capital influx to work on these ideas because the belief is you need compute and it's determined who doesn't get to participate and who does. It's also a national priority to acquire compute and also it's widely favored. So it's seen as less risky than doing something with an algorithm. It fits in, it's very handy. It fits into quarterly planning cycles. So, it's easy to justify and people even raise based on amount of compute. So, it's very hard to turn around afterwards and say, "No, we don't need compute after all. " And what this means is that it's actually led to a concentration of power.

So, this question is actually very important to ask because it's determined so much. I've put provider company A BC, but if I pulled, you'd probably all come up with the same names, right? And so really it's just meant we have less choices. So is sudden right? It is very controversial still to suggest that scaling is over. But I'll show and I'll I'll illustrate why in fact I think the relationship between model size and performance is far from certain now. And in fact all bets are off. I would argue. So we now see that AI models of the same size have become steadily more performant over time. And so you can get and squeeze a lot more out of the same size. But more convincingly, we now see small models outperform much larger. So the best small model is much better than much larger models. So size isn't everything.

We see and we known for a while that there are severe redundancies between weights. So if size were all you need, why are so many weights doing the exact same thing? Why can you predict from a handful of weights what a deep node network does? And if size is everything, how come you can remove most weights after training, how come you can sponsify and remove 95%. All this suggests that while size is important for optimization, in the reality, it means that we're just not good at training better, more performant, smaller models. High quality data drastically reduces the the need for scale. But more importantly, most of what we gain when we scale is a longtail. So when you double or triple the size your model, you just learn rare artifacts. That's a very expensive way to learn rare artifacts. So even if we can scale, we're paying a lot more.

We actually see this candidly in latest models, it doesn't pay to scale anymore. The latest efforts by Frontier Labs to triple quadruple the size of their models have been seen as not servable and frankly kind of disappointing because they only improve performance on a small edge. So I would say we're hitting the limits of transformers. Transformers were the breakthrough, but they're also saturated. So I would say and this is where you know it's quite fun the rate of return no longer makes sense for scaling and in fact the rate of return is all that matters. In fact what's fun is that the rate of return for other parts of compute is actually way better.

So post- training, alignment, dynamics of data synthesis, um adaptive compute, co-design of hardware and this means that the idea of a few handful providers controlling so much of the dynamics of who gets to provide is very different. The new era of intelligence will require much more than brute force scaling. And I think there's a few ideas here which are very important. One is adaptive compute. The other is interaction now matters. How does your model interact with the world? So the first time computer scientists have to care about interface and then the third is you need to continuously learn because you're doing much more long horizon tasks. So where are we? I would say we're in the era of adaption.

And I say this because what matters more is how you leverage capacity and it matters more how you learn from your from your real world environment. And this is very different because most of our time as a computer science field has been around the idea that you obsessed with a model and that from the 1950s till now we were focused on how do you build the best model but in fact our optimization spaces in this era where you can't just scale model are all about how do you adapt the whole stack from data all the way through interface and the notion of a system and how it interacts with the world is critical. Our goal is to build intelligence that continuously evolves and we see this entire stack as being important from data through interface.

The whole thing should change based on the type of task you have and it should be incredibly efficient and this is kind of a fundamental shift if you think about it. We're going from the weights and the name of models being everything to in fact like a very fluid stack. So I'll share a little bit uh of what we're excited about and like what's interesting and then I'm super happy to talk afterwards. So one thing is that you know our first pillar is adaptive data and we believe that's important because you can optimize on the fly towards whatever part of the data distribution you care about. Um we're four months in we shared this um a few weeks ago and I think that the goal is to make available what's typically within frontier labs. Most innovation now that scale is over even in pre-training is data innovation.

How do you do really powerful leverage of synthetic data? Uh we also think that this is pretty profound because of the first time data is cheap enough where you can optimize in the data space towards any objective you want and so people should be leveraging and making their data visible to AI. What's very cool is uh it's been super fun to see how people have responded. So we released it four weeks ago. We covered 242 languages and we processed like 27 million data points already which is crazy. I think part of it is that we're very fast. So you can basically turn around and make your data fully visible to AI within a day. Um and uh our next pillar is just as exciting. So since we see the full stack is important to be adaptable, the next is continuous intelligence.

Um what we released this week was I think you know time blurs with time zone difference but I think it was two days ago we released autoscientists. So this was about how do you co-optimize and automate the learning of training because this is one of the biggest blockers to having adaptable AI. Um and order scientist self-improves and automatically learns what's how to optimize the data and the model to whatever task you want. But what's cool about it is it's very fast. So you basically can train a frontier model in two days which is pretty absurd. Um we actually did a cheeky experiment. We asked like can this beat our AI research staff? um and it did much better. I attribute this in part because most AI research staff are trained within a specific frontier lab on a specific family of models.

But we actually tested this across every single available model on together AI which is an inference provider. So there's like 30 different models and there researchers really struggle to figure out automatically how to configure for different architectures and how to co-optimize with the data. So this is pretty cool. Um and it's very predictable games. Why I say this is I I actually think looking forward the idea is you should be able to automate your entire stack. The vision of really adaptability is efficiency. Adaptability um it's pretty crucial that eventually adaption is real time to whatever task you have. And the more friction you have about that adaption, the more people return to just being prompt engineers.

So for us, efficiency is the primary and obsession of like how we think about um making it meaningful that people have more alternatives to just a monolithic AI. Um so this is this is really fun. I think a lot of our research staff has spent a long time working on this. The only thing I'll I'll say is I think another crucial aspect to adaption is um it should be global first from day one. So we cover 242 languages and we're most interested in TASA non-verifiable. I think most of the world is actually non-verifiable.

there's a very small fraction of of of tasks that are and so this is what matters now this is what will be decided in terms of who can make progress is who is able to leverage those tasks and make it more meaningful so what is the way forward what are my parting thoughts so where do we end up I've hopefully convinced you that this is not the finish line that I should not have to be a master prompt engineer to get things that I want and are relevant to me and um I may have convinced you that we're at the end of scaling and that at least like just doubling the size of your model doesn't work anymore which means that it's fun. It's the era of innovation again.

But regardless of whether I've convinced you of that like probably I've convinced you somewhat that it's very expensive to scale right and that the returns are probably not worth it for most people here even if you want to own your own AI. So for me, what matters most is who makes that cost of adaption the most efficient. And for us, that's the sole thing we're obsessed with is like how do we make it possible for any builder to adapt real time to whatever task they have. So I think it's one of the most profound problems that we can work on and I'm super happy to talk about it afterwards with whoever is interested. Um and I think I'll leave it there. So um I think that I'll also just share we're offering order scientists for free for the next month. So, I think the proof is in the pudding. Just try it for yourselves and you're welcome.

I would love to be back along the way. So, thank you so much. Uh, and I really it's a real privilege to be here. Thank you. >> Oh, thank you so much Sarah. That was a great talk. Um, next we have Vincent from the Miniax platform engineering team. We spent the last day talking a lot about agents building agents, but what happens if you let agents autonomously schedule autonomously schedule the amount of compute and resources they need? Going one level above. So, we'll be sharing a lot about that. >> All right. Hey guys, uh my name is Vincent Lou. I'm a product engineer on our API platform team. And today I'm going to talk about agents that manage their own compute. So the first thing is sorry next slide. Oh, it's good. We're good now. Yeah.

So um compute is everybody knows that compute is uh undergoing a big it's like one of the biggest uh commodities of the next century and uh we're not using it very efficiently now. So the best way to see this is that I'm sure you guys know um certain inference providers are um are blocking thirdparty harnesses from using their uh inference. And you know, part of it might just be about competition, but really the main thing is that um compute is very uh request dependent and that different types of requests, different types of workloads have different strains on uh your compute. So for example, in particular, different types of input tokens uh and and input and output tokens, your token profile as we like to call it, have a major effect on how well inference providers can utilize their compute.

Uh so there was a recent podcast on Dwar Cash uh did with Riner Pope and he basically talks about the specifics of how uh inference workloads depend heavily on your token profile and so this is the reason why it's going to make sense for agents to uh manage their own compute. Basically, if we can know if as an inference provider, if we can know uh a session's token profile beforehand at priori, then we can serve requests a lot better and we'll be able to essentially maximize our um fleet utilization and to serve more requests to more people uh with less failure.

And now you know this kind of this kind of uh demand is a bit too much for humans to handle because if you imagine you know you're using codecs or cloud whatever and you before every session you need to tell the infant provider like exactly what kind of workload you're doing how long you're going to do it for your token distribution. I mean I don't even care about my token distribution. So this is too much to ask for humans but it might actually be quite reasonable to ask agents autonomous agents to do this. And this is more of an observation, but agents are owning increasingly more of the harness. So from context management to tools used to be hard-coded stuff by um engineers, but now agents are basically uh managing this these kinds of resources on the fly.

But one thing that agents are not actually managing is their compute and their intelligence. So basically we don't really give agents the ability to select uh first of all their models. Although there actually we're we're seeing you know ways for agents to switch their brains when they want to. But more importantly just their compute like when they want to actually do the work and do perform the inference. And so um this didn't make sense before uh autonomous longrunning agents because when you're just pair programming with a human there's not much to schedule. Basically when a human is talking to the agent and their programming then you just want that inference right now. You want the work to be done currently. So there's no there's not much scheduling to be done and is really just like greedy best effort.

But as agents become more autonomous as we you know hand them background task and have them do things in the background then there's actually a lot of room to maneuver around scheduling your compute. So for example if I give my agent like a deadline I want something done by the end of the week and I just give them a goal and a budget right? So with those constraints in mind, the agent has um there's a lot of things that the agent can do to basically spread out the different types of work that it might need to do at different time intervals when compute is available. So a quick example would be for um you know let's say your agent is just building an entire application. Well maybe for the first planning phase it doesn't need to hop into it immediately. It can wait for planning.

It can first of all select a really good planning model that might not be good implementation and then have that model do the planning maybe like at midnight when when the inference costs are lowest or when there's a high success rate and then later on you know maybe towards the end of the project it needs to do quality assurance and needs to like review its application there you might need to switch to like a V a really strong VLM guey model and have it do um low latency work to actually test the application in real time. So already you can see how like for different workloads you really have very different uh token profiles and request profiles for that particular kind of workload which might be suited to very different uh compute clusters.

And so this is a recent um blog post on strat uh written by Ben Thompson and he basically he's um making this point uh by separating answer inference from agentic inference. Now answer inference is um the stuff that currently most people care about. It's when you go into your coding agent and you just you're pair programming with agent. You want to see the outputs come out faster. You want it to think faster. You want like real-time latency. That's answer inference. Uh but agentic inference is different in that actually for agentic inference latency doesn't really matter as much. Uh because like I said earlier, you're really just handing off a goal in a compute budget or in a budget like a dollar budget. And then the agent can sort of optimize around your budget and your goal depending on the resources um at hand.

And also I should point out that um there's a sense in which answer in imperence is actually a part of agentic inference because you could easily imagine how sometimes the model the the agent would still want to have low latency work done during its background period because for example the example I mentioned earlier about um like a gooey review of the application at the end because you want real-time latency there even though nobody's watching. So in the limit um we expect something like an inference exchange start happening where all these background agents you know they're running out in the wild and um before their workloads they basically submit their session information to uh the inference exchange. So the most importantly would be model used and then token profile.

So your um your range of the number of cache input tokens uncashed input tokens and output tokens and uh some other metadata along that line. And so then the exchange would match your session, the agent session to the most optimized batch on most optimized node for that kind of workload. Uh in order to you know basically find uh the the the comput the hardware that is most best suited and best configured to serve that workload at that time. Now the good thing about this is that you know in the just like any kind of market mechanism inference exchanges are going to be able to uh turn underused compute capacity into user and provider surplus.

uh because assuming optimal matching then we're using we're using basically the uh the best we're making the best use of the world's compute of any inference provider and every GPU you know their MFU is going to be maxed out because um they're going the workload particularly running on that cluster is going to be optimized for the configuration of that cluster um and then also uh fleet utilization in terms of like different time periods so right now we providers are seeing this thing where like for example during the afternoon they're overloaded because everybody's using their agents at that time but then like during midnight you know it's their their their GPUs are underused and that's not good for providers because they want their GPUs to be running all the time.

Um and so with this kind of inference exchange and with agents autonomously managing their own compute we can have much better matching and basically smooth out the peak and off peak hours. So overall what this does for the inference providers is higher throughput per second. So the the the throughput of your entire system is going to go is going to become more optimal and that's good for the inference providers because that's how they make money. The more tokens they can serve the more money they can uh the more revenue they bring in.

But this is also good for consumers because uh again as I said in the beginning right now consumers we're facing a lot of issues where uh our requests are simply just getting like rate limited or um they're just not they're just not getting served well by the provider and that's because they're not using their GPUs to the maximum to the most optimal way. And so for consumers what we're going to see is just uh better request handling overall.

And also there's going to be a cost thing as well because you can imagine how providers might um decrease the cost for like off peak hours uh so that agents can like uh are incentivized to go use that kind of compute for a lower cost and we already see this for service for example I think many providers have different levels of service low latency high latency batch which have different pricings and so finally this is kind of a plug for um our MMX CLI. So this CLI is not for humans to use. This is really a way for agents to autonomously call our model APIs uh because we have a range of models, you know, from speech to image to videogen to of course our LMS. And so for now, this is really just a way for uh agents to you know uh effectively call our model endpoints.

But in the future, we intend to build this out to be um to basically cater to what I said earlier about having agents manage their own compute more end to end and more sophistic in a more sophisticated manner. So maybe they decide to run a bunch of video workloads at different periods of the day and uh to save to save money and then to maximize compute. Oh, and that's it. Thanks. Oh, >> cool. Thank you so much. >> Really appreciate that was a great talk. Uh, next we have uh Sid and Daniel who'll be introducing their company, the robot company. We've been talking a lot about agents, deploying them, coding agents, but what does it take to take an agent and deploy in the real world? And so they'll be looking at how to deploy teleoperated robots in a physical environment. Hi. Hi. Hi. Hi. Is this Oh, it's working. Good afternoon.

My name is Daniel. Uh, that's Sad. We are from the robot company. We deploy teleyoperated robots today for autonomy tomorrow. There you go. What you see in front of you over here are teleyoperated robots deployed in an insect farm in Cambridge in the UK. So you see the little box of like little squiggly things over there. Those are black crickets that are used to feed geckos and reptiles. Uh you can imagine that not many humans like to work in this environment which is why it's a pretty good use case for robots. I've spent the past year deploying robots in the UK. So apart from insect farms also laundry facilities, food preparation uh and hospitality settings. So we focus on deploying teleyoperated robots. And right now you might ask Daniel why deploy teleyoperated robots.

If you know, you know a recent a prominent researcher, sorry, my clicker, a prominent researcher recently mentioned that teley operation as a means for data collection is dead. And there are a lot of merits to this argument. Firstly, and I have firsthand experience of this, tell operation scales linearly. Tele operation scales one to one, right? one human controlling one robot much like this. The other thing is that oper operator training is actually really difficult. I've trained about 100 operators um onboarded them only about 30 to 40% have actually passed onboarding and it's really difficult to scale that. Another bit is with teley operation you get all the technical limitations of hardware latency and all those problems. And then the second piece of what we're doing, deployment is incredibly hard.

You get new environments, which means new lighting, new tables, new dimensions, and of course, new customer demands. You get bugs. In our case, we get actual bugs because of the insect farm. But we also get bunch of software bugs uh and malfunctions. And with any hardware, things break. So why deploy teleoperated robots? Before I get into our thesis, let me quickly go through how models have scaled and what that means for us. So very quickly, models have scaled firstly with pre-training. So large amount of data, generalized intelligence, broad but unrefined. Then supervised fine-tuning to get the data trained on the model. So the model has task specific specializations. Then a huge unlock ROHF reinforcement learning with human feedback.

Humans provide the golden truth answer and therefore the model gives really useful and good output and all of this is underpinned of course by high quality data. In the robot world high quality data or data more generally generally falls under four buckets. If I point to you to the yaxis and x axis the y- axis is scalability and the scalability is generally inversely correlated with data quality and hardware alignment.

So on the left simulation data everything runs in simulation software no physical uh world no physical robot uh there's a bit of a sim to real gap then you get egocentric data essentially a camera place on a eye level that is pretty scalable as well because not super complicated to do that but generally the data might not map directly to robot actuators and servos so data quality is not super high you get wearables which is popularized by Umei the universal uh manipulation interf That is pretty useful because you get joint positions or any factor positions and then you can do some physics and math to ensure that that maps onto a robot. So decent data quality and also decently scalable. And then on the other end of the spectrum is teley operations.

Teley operation very high quality data because the actual robot is in the field collecting data uh but not scalable because onetoone and also bring a robot everywhere is kind of tricky. Now, understanding models and I say understanding data. How do we get to useful deployment? How do we get to useful deployment and useful work in the LLM space? What that looked like, and I'm being extremely reductive here, uh looked like an API call, right? Obviously, there's a lot more underneath that, but think about robots. Deployment is a lot harder and a lot tougher. How people have approached the problem robotics kind of looks like this.

the LM approach take data take compute and throw it in throw at a problem pre-train and SFT and that has had really really good results like recent models have shown really wonderful promising results in the lab often rely relying on simulation data ecoentric data often with some world models involved uh and that that has had a lot of you know high quality evals in the lab but how do we achieve and fix autonomy gap not just in the lab but in the real world. Our thesis is that we want to deploy robots in commercial settings and that does two things. Firstly, when you deploy a teleyoperated robot, you actually get real useful work done for customers, right? So, in this case, folding a t-shirt. But this process also does something extremely useful that it collects very valuable data based on the work done by the robot.

As we've learned from LLMs and self-driving, the most valuable data sets are byproduct of real useful work done. So that brings us to kind of step one. Actually, Chenise right here was supposed to give me a bottle of water, but deployment's hard and that didn't really work today. But what I wanted to say was we were basically trying to uh we start every deployment by putting a teley operated robot into real scenarios. So you can see the guys out here folding clothes and you can also see Daniel doing a live demonstration of what that looks like. And above that what you get is we layer it on top with your you know pre-trained models that you might be already familiar with. Think PI 0. 5 Groot some of the models that Daniel already shared about.

And that data that you get is essentially the highest quality embodiment data that you can get, right? And because the morphology matches um the environment matches and the task also matches, what you end up getting is a very good base foundation data set for you to actually deploy commercially viable uh for you to deploy commercially viable robots. And you have to remember this, all of this is just the starting point, right? The real work begins once you start getting into fine-tuning. I think step two is the part where everyone in this room already knows how to do. Um you can take teleop data, you supervise fine-tune it on some of the models that you already know about, right? And you can kind of achieve about 80% autonomy and we all know what 80% autonomy looks like. We've seen these on Twitter on many social platforms.

um what you end up getting is a really beautiful video with some hype and you know that works well when you want to garner attention but once you start getting into the real world and I'm sure there's a lot of enterprise uh folks here um 80% just doesn't cut it for production when you start getting 80% when you hear 80% in EVEL and we start getting to production you know what that really means for the customer that means that one in every five clothes falls on the floor of the customer site when they're trying to fold it, right? And that just doesn't cut it. So, what you have now is really a gap that doesn't ship, right? And this gap is called the autonomy gap. You can kind of see figured they did a recent demonstration, a live stream actually of their robot kind of sorting packages.

It was very impressive was doing it for eight hours, but they ran into issues too. And we believe a very specific mechanism, human intervention, real time could solve this problem at scale. So that brings us to step three, telly operation plus human intervention. There is a terminology for this and it's called teley supervision. And teley supervision basically involves the idea of someone intervening when the robot makes a mistake. You make fine corrections and then you just let the robot do its thing and you keep iterating every time it makes a mistake. And how do you address the telly operation ceiling that we now have when you want to do this telly supervision? Well, we can start by scaling from one to one to one to one is to many. And this isn't new. The self-driving world has been doing this for a while.

Whimo has um you know examples of of of teley supervision and we believe the same could extend to robotics. And the other side is remote teley operation. We have a working stack that unlocks crossber low latency telly operation. This is an example of us doing a demonstration from Singapore to London. You can now extrapolate. You could do Singapore to the US, India to Singapore, China to Singapore. All under 100 milliseconds on our stack. Now for enterprises, this is key because deployment is hard, but it's very necessary. The long tale of robotics lives in the real world. And that 80% is the cliff's edge. So what we're trying to say is that telly operation used as a deployment layer combined with the menial unpleasant Saikong like work that you need to do right is what makes successful deployments.

And the way you need to do this is that you have to think differently. An enterprise cannot think like a research lab. In fact, you have to think radically differently. And you need to start with telly operation as your your your fundamental starting point. And then you start collecting rich data and then you start deploying commercially viable models and robots. And that brings us to the end. So that's what Daniel and I do at the robot company. We deploy robots that do real work today as we build the data engine for autonomous robotics tomorrow. So if you want to learn more about us, you can find us at the robot company. ai. Thank you. That was an amazing demonstration and I think uh you know it's just a testament to how complex it is to deploy robots in the wild.

So we talked about how we can tell operate robots you know have people actually help but what happens if we bypass that and go directly to the brain instead and so in this specific portion we'll be talking about Justin bar will be sharing about how you can do that with BCI brain computer interface hello everyone just getting this started. Um, thanks for joining today. We have uh another interesting robotics uh experiment to show you. So, we're getting there in a moment. But while they're connecting, I'll just get started. We've got lots of things to show you in the next 10 minutes. So, uh uh get ready. Um but thanks again um for making this happen here in Singapore. I mean, AI. jer coming to Singapore is amazing and having um Agram and Sherry uh bring this all together with the 65 Labs team is great. You want to just hit play?

Uh that one that you just minimized. Are you guys getting this or no? >> Hold on. extended. >> Yeah, extended. It is extended. It's extended. >> Now you get it right. >> Okay. 3, two, one. All right. Thank you, everyone. So, as part of Tessact, we've built a system. We call this, it's called Tessact. art. Um, and what we've done with this is we've built a system that allows people to express themselves through AI. And this started out by having live music performance and turning that live music performance into a painting. Um, but from that we've kind of moved this uh much further along. And so I'd like to call out um Kaiing. Kaing, would you like to come out with us and we're going to start rolling out some equipment? Thanks. Um, everyone, I just want to introduce Kai Ming.

Um, we've done some quite interesting and special things together for this. Thank you. And so for so for the last two years please everyone if you can roll out you guys going to roll out everyone. Sorry we got lots of things rolling out guys. Sorry can you guys help roll out? Thanks. Okay sorry this is quite difficult to do in like a 10-minute presentation when we have like a full robot system and painting and all this other kind of stuff. So please bear with us one second while this is happening. But um as you'll see, we're bringing out um a system that we call tessoract. org. And what Tessa is, the robot arm, Tessa, the robot arm, we've been developing this over the past two to three years, uh with a couple collaborators, um my collaborator, Dr.

Richard Savory and I started this about three years ago and we wanted to build a system that would allow us to be able to use robotics along with multimodal AI to be able to take one let's say creative form and turn it into another and that's where we started with this in terms of um bringing music together and what we're doing with this is really taking human imagination and extending it through intelligent systems and that's the intention of what we've done here today. Now, what we also have on uh stage, we've got Jackie also here from >> Mind Interface Company, >> and we have Ivy, who is also attending here with us from Tessact. And Ivy, I might just ask you to come uh up and help.

And so with Kaiing, what we've done, what you're seeing here live on stage is the very first painting that um Kaiming has painted using brain control in her face. So, it might be hard for you guys to see in the back, but she's actually wearing um a head a headband that comes across the front. Um it's a Muse if if anybody in the audience knows the Muse headband. But what's so fantastic and amazing about this is that this technology is now to the point where it doesn't take two hours of putting on a headset and like all this expensive equipment. We can literally put this on and Kaiming can just think about what she wants to do in terms of the control interface and actually make things happen with the painting. Um, so Kaiming, I'd love to hand the well ask you a couple questions.

Um, maybe you can just tell us a little bit about how we got here today. >> Okay. Um, hi. So, I'm Kaiming. Um, I have a condition called Alist Syndrome. So, I'm part of the Red Disorders uh, Society of Singapore, which Justin has been working with. Um, so I'm an AI policy researcher and yeah, that's how we met. >> Yeah. And so, um, you've done some artwork in the past, um, and, um, what we're now able to do is bring, let's say, some of your creativity back, um, through this process of using AI and our multi multimodal systems. So, what we planned to do was we've been painting this painting. Maybe you can tell us a little bit about this painting. >> Do you want to move forward? >> Can you hold it? >> Yep. Yeah, that's fine. Thanks. So yeah, I've been painting since I was a kid with my granddad and my sister who are both artists as well.

Um, and it's something that's really connected me with the world. Um, and my condition has kind of made me lose a lot of my dexterity in my hands. And so I wasn't able to even write anymore and I still kind of can't. And so I wasn't able to paint anymore. And I went into anthropology hoping to kind of live vicariously through it. And that's how I ended up in AI policy. But you know, it's I grieved my hands. I grieved my passion. And to suddenly have this outlet, it's just amazing that it's it's kind of been brought back to life. >> Awesome. Thank you. Thanks. Yeah. Okay. And so now for the moment we've been all we waiting for is we're actually going to see if we can get because literally we brought this we this whole thing has come together over the past month. So we're going to have um uh Kiming try to finish one of the final lines.

So with this painting this is of Hope the sloth from the RDSS. Did you want to speak about that? >> Oh yeah. >> So Hope is a two a sloth that was born with only two fingers. He lives at the Singapore Zoo and he's kind of like us. We take life kind of slow and steady. And um this is a painting of uh hope this love code around a little finger. And um the two colors that you'll see on the heart and the wings that's the parents that we uh you know who support us uh red is order kids and um yeah. >> Yeah. So there's hopeless sloth. So let's try it. Ready? Great. So, uh maybe you can tell us the concept was here for there to be a heart that surrounds Yeah. >> Yeah.

And so the heart it's like you know one stroke is the dad and one stroke is the mom because you know we often forget like how much the parents in our community uh support the our patients with rare diseases and they do so much. It's just incredible and you know I really want to thank thank Justin and his team for giving this back to us this kind of freedom and liberty to do what we want to do with our lives. Yes, thank you. Thanks so much.

And I'm I'm actually quite surprised to know that we have three minutes left to actually finish our conversation which is great because anyhow so I think one of the things that's really become an inspiration from this and what I think was the important message that I wanted to leave you know everyone who's seeing this really for the first time happening is um we started this process thinking about using um AI to give people creative superpowers, right? We want to not have AI take creativity away. We want AI to give people superpowers, AI superpowers, creative and fun things. And we did that starting from music. And what we've done now is pivoted towards the brain control interface and being able to make this wireless system happen. But you can ask like creativity, it's great. It's a part of self-expression.

It makes the things that um you know it's a very human thing to be able to express yourself and to have this form of communication. But what's even more inspiring and I want to just show one thing um as well. What's even more inspiring is what could we do with this type of technology? Um we talk about AI taking people's jobs away. Um, what I see with this and our collaboration together is we're giving new opportunities for employment to people that perhaps have not been able to be employed because of maybe having the disability or not being able to um be as mobile as others. So now imagine that as this technology develops, this is literally just the beginning of making these things happen.

we could see employment um becoming something because we need AI as uh you know AI in most of these systems today need a human in the loop right so if you can think about AI being something that is providing the opportunity to do something you know let's say it's a dark factory it's all automated but there needs to be people to supervise it there needs to be people to do some of the work and just today literally through this process um I found out about um a very special um place that um sorry that that built a very special place in Japan uh where this is already happening. I literally just found about today. So this is an avatar um an avatar robot cafe. But what's interesting in this case is that the robots are um are fully managing things, but there are people making these robots work. And the robots are serving customers.

the robots are serving customers, but they're actually um being uh they're employing people that are not able to leave their beds perhaps or leave their home um to actually have gainful employment. And so I think this is a great perfect use case example of um the opportunity that uh could present itself um with this type of technology as this progresses in the future. So I think that's a really amazing and inspiring um opportunity to think about how AI is going to completely open up a new opportunity, a new work workforce um for people that might not have been able to be employed in the past. Anyhow, so thank you very much for having us. Really uh been amazing to be a part of AI engineer. Um thank you um Jackie um for making the brain interface happen. Um and any closing words?

I think we all needed a break from all the fear and money chasing with something a little bit positive. >> Thanks very much. Thank you everyone. Thank you AI engineer. Appreciate it. Thanks. >> Do we want to help you go off the stage this way? Make sure you talk about All right, that was an amazing presentation. I think especially in the doom and gloom of AI like that offers like so much hope. So, we've looked at how you can use BCI and for the next presentation, we're going to have Arvin from Bifrost where they build synthetic walls to train models. They've been working with some of the largest robotics companies in the world, helping them do things all the way to landing robots on Mars. They're backed by Seoia and also the CIA secret venture fund.

One really cool fact is that the previous robot company and Bifrost are both Singaporeans companies that were started, incubated, and really born in Singapore. And with that, really excited to have Arvin take the stage. Awesome. Sadly, I do not have any cool robot demos for you guys, but that was pretty awesome. Um, hey guys. I'm Arvin, CTO and co-founder over at Bifrost. And today I'll be sharing a little bit about the state of robotics, right? I'm sure you guys would have seen a whole bunch of cool videos online of, you know, robots dancing at like Chinese New Year, doing back flips and all these kinds of cool stuff. But on the other hand, you also see robots doing a lot of weird clunky things where they're running into mirrors and just causing a lot of havoc, right?

And sadly, this is what we consider the the robotics development gap, right? Essentially, what's happening is you're getting really really good performance in the lab, right? It can do all these crazy things, but when you actually deploy them into the real world, what you find is that the performance of these models drop very very severely, right? So why exactly does this deployment gap actually exist right? So what you guys seeing on screen, I promise there won't be a lot of graphs today, but there like two graphs. This is the first one. Uh what you guys are seeing on the x-axis is just all the different types of scenarios, right? And this is just, you know, your training data, your testing data, and like your deployment data. And on the y-axis is just like the number of scenarios in your training data, right?

So when you go out, you know, you collect a whole bunch of training data, this is typically like what a distribution would look like. Uh, of course, this is simplified. And then you have your test distribution, right? So you have a training data set, you have your test data set, there's some overlap, but also some parts where they don't overlap. And then when you actually deploy your robot, what you find is like the types of environments and all the different types of conditions that it actually encounters in the real world, it's actually very different from the things that happen in the lab. In the lab, everything is very clean, very organized, but in the real world, there's so much dynamic chaos. There's like people walking into the scene, there's reflection from mirrors, there's glare in the camera.

All these are what we consider out of distribution scenarios and this is where robots fail, right? So, you know, most people will say like, hey, let's just throw more data at it. Like, you know, the bitter pill lesson, just more data, it should be better. But the reality is a lot of the data that you actually collect from robotic systems, they're actually considered empty calories, right? Because they're not adding any new additional signal. A lot of the times you're collecting the same scenario over and over and over. Think a self-driving car driving on a highway. You don't need more highway scenarios. What you need is more edge case scenarios. It's like a cow crossing a complicated intersection, a plastic bag that's right in front of the rear view mirror as you're backing into a car park, right?

These are the kinds of things that you actually want, right? So, in reality, when you want to be able to test these systems, you don't just need one small distribution or one small type of tests. you need to be able to go in and like get all these different types of distributions and cover as much of the scenarios as possible. So like every kind of lighting condition, every type of different um spatial layout of the scenario, right? But getting this is really really hard and if you can do it, you can prevent uh failures from happening um in the field. This becomes extremely tricky because now we are entering the age of generalist policies. robots that are promising the ability to do anything and everything. Everything from packing your dishwasher to folding your laundry to even doing things in medical, healthcare and science.

And now when you want to validate these systems, it becomes even more tricky. All right? So in the field we have a very simple uh way of like giving them like essentially like a reliability score. And this is the thing that the thing that most people care about when they think about deploying robots is what is my true reliability when I deploy these systems into the real world. And reliability really just is like you can take a success rate which is if I do the task a 100 times, how many times am I getting it right? And you're also doing it across all the different scenarios that you want to be able to ship your robot for. Right? So if you're you want to be able to handle like a thousand different scenarios, you need to do that a thousand times a thousand and it scales very very quickly. Right. And all these companies are now racing.

They are racing towards how can I achieve reliability faster, faster than the competitors, faster than the market. And they are trying to figure out what's like the scaling law in reliability itself. Right? So the first way they test robots is pretty straightforward. I'm sure you heard some talks where you know they will manually stage stuff. They will get humans, they'll get robots and they do everything in real time, right? They set up the scene manually and they actually get the robot to do the thing. But in this case, the number of scenarios that you can actually test for is bottlenecked by humans, robot, and time. Right? So when we actually put that on a graph, this is a different graph, but on the bottom axis, you're seeing compute and the other axis you're seeing reliability.

Every time you do an inference, you're spending some compute, but you're still bottlenecked by how many humans you have, how many robots you have, and how much real world time you have. As a result, you're still scaling uh linearly, right? But then folks come along and like, okay, no, I'm just going to sample a few different test cases and I can get some additional new tests. It's good, but not great because you can't get a lot of distributions because they're still manually doing a lot of stuff. And then folks say, okay, you know what? If we remove humans from the evaluation cycle, right? So now folks are using things like Gemini.

Uh so Gemini robotics you can look at a scene and it can give you qualitative feedback on like hey uh did it actually complete the task successfully how far away is it and they also have things where you can autoreset the scene using another large uh vision language model or vision action model as well right so they've removed humans but you still have you're still bottlenecked by how many robots you have and how much time you need right so it becomes a bit slightly faster because now you can spend a little bit more compute and speed it up and you don't have to be reliant on humans as But you're still scaling linearly. All right. And what this means is you can just do slightly a few more tests.

And then of course, you know, like oh, you know, when we build bridges in the real world, we test it in simulation first and then we build the bridge and we do that all that um simulation for like mechanics and like tension and stuff. Why not do the same for robotics?

So in robotics there's a thing called simto to rail gap which is when you do things in simulation they don't always line up with reality right and this is like a big problem that the industry is trying to solve and surprisingly enough in the last year we have a lot of new ways to solve this and the biggest one uh that we are working on is actually using the real world to generate the simulator itself right so what that actually looks like is you can take in real data right so you take in real data into the And you can generate things from that rail data and then you can reimulate the world from that. Right? So this whole idea of you are generating a similar simulator specific for your domain and your thing every single time. Right? It's not just objects. You can generate entire worlds for your specific domain.

For example, if you're like off-road self-driving car and you're operating in the California desert, you can very quickly generate that entire world and train in that simulation. Right? So this is how you begin to close that sim to real gap. And what this allows you to do is it allows you to copy the distribution of your actual test set and have a simulated version of it. And this is already valuable because you can now do closed loop testing with this distribution. But how do we go even further? Right? This is is not great coverage. Right? So let's just take one specific scenario. So like this is an example of the type of data we generate. You know, here it's like a boat is driving to a crowded marina. The glare is in the screen, uh, is in the camera and everything's a bit chaotic, right? But this is just one specific scenario.

How do you scale this up to more scenarios, right? So, what we can actually do is we can go into the simulator and we can parameter sweep across all the different operational conditions and it's almost as if you're seeing a thousand different realities very, very quickly and you're testing the model against all these different realities. uh simultaneously, right? And from there you can expand it even further, right? So it's not just um a n* n test. You can scale it up to a lot of different domains and criteria. And the cool part about this is that you can test your AI model against it and you can immediately see where your AI model will be failing even before you have shipped your robot into production. And the whole idea here is simple, right? Fail fast in simulation and use those failures and direct them for real world testing.

So you're not testing on every single thing, but you're testing very specifically on the places where you failed in simulation. This way you spend less capital, you're more optimized and efficient with the resources that you do have as well. And you know, we're just limited by real world time as well, right? Right? So we go from this to this because now we can cover a much much wider domain. And there's a term called like domain randomization, but basically you're covering a much wider domain than real data could ever possibly cover. And it's a very good way um to do these tests. So you know, everyone, I'm sure, would have seen this thing called like the data flywheel. It has become a meme at this point where every company's like, "Yes, we have a data flywheel. " But a flywheel doesn't actually capture the most important thing.

And the most important thing is you actually need to refine this data. The data needs to be super high quality. You need to figure out a way where you're finding the most valuable things and you're also being able to drive what you should collect in the real world as well. Right? At Bifrost, we help some of the world's most demanding customers do this at scale. And we are essentially taking all of this and we are simulating it in your browser. So we have a world, you can simulate the world and you can break your AI model inside of it. Thank you folks. That was an amazing talk, especially talking about like a data refinery. It's trying to trying to make sure that your data like covers all the different edge cases.

So, I'm really excited next to have Julia Kim from Open Graph Labs talking about how they built an in-house stack where they've ensured that you can sync the data collection across the mount different multimodalities. And this is really difficult because even microsconds of of drift when you're collecting data for training robots can end up being really damaging when you actually take this and you train your models. So really excited to see how that goes. uh while we sort of have a bunch of these uh technical difficulties, you know, I'm wondering like how have you guys been finding like today's conference? You know, I think personally for me it's been like absolutely like stunning. Like I was really blown away just now when like Justin demoed the ability to paint just using like a brain control interface.

Like I never thought that was possible because I've only been playing a lot with agents, right? I see like text in text out like oh my god like we're going to just everyone's jobs are going to be automated away. And it's really cool and inspiring to see like AI being used for good. And so I think like that's been something that's been really exciting to see just the sheer diversity of like opinions and projects that people are like working in. I think so. >> Um, ourselves as a team uh using a lot of the tools that uh the speakers and sponsors have built. Um, so we'll take that as a note. Oh, okay. I think we're back. Yeah. >> Drag it. Yeah, it's extend. So, >> we had it just now. >> Oh, it's back. It's back. >> Yeah. Okay. Good. >> Thank you. >> Hi. Good afternoon, everyone. >> Good afternoon, everyone.

Uh, my name is Julia, co-founder and co CEO of Open Grab Labs. Uh today I want to talk about how our everyday human experiences can actually become useful training data for next humanoids. So how many of you have heard the term egocentric data? Yeah, I can see a few or maybe you've seen this fire video recently at apps. Factory workers are wearing the cameras on the hat uh while they're working. So over the last year something very strange has been happening in the field. Hundreds of companies have started collecting the human behavior data at scale. People filming their first point of view um cameras doing their daily task and actually got incentivized for doing that. So why are we doing this? So why did humans suddenly become the core data sets for robotics? So this is because we just got the proof that it actually works.

Nvidia's recent ego scale research show that scaling human egocentric data actually helps the robot training. So they do use the egocentric video as a pre-training pre-training data set for their model and fine-tune on a human rob alignment data set also with a few teleoped uh robot only data and the robot can actually do the task like folding a shirt with a one one shot transfer and as the same way the language model u scaled with a with a putting more data they also show that uh it's also can be workable for AI physical AI too. So it showed a significant scaling low not just because it was proved to be useful for pre-training but actually to be honest the egocentric human videos are fundamentally very important with two aspects. First, we are now building the human level capable robots.

That means that same form factor they looks like us and similar degrees of the freedoms and that means that we are trying to minimize the embodiment gap between the human and the humanoids and is actually actually getting closed very fast and as the gap go as the gap closes the human behavior actually can be directly transportable to the robot that which is the most direct super visa signal uh possible in the world and secondly the egocentric data is captured in the real world as it actually is. The physical world as we all know is continuous uh it's dynamic and physically grounded. So every data we got from the egocentric data is actually very very high fidelity data uh and it includes the much more information that any robot could ever learn from. But then uh are we really done now?

uh so we can just have more egocentric video data and we can solve more all the problem. Uh so simply collecting enough human video data there is some bad will robot eventually achieve the human level physical intelligence or not. Well I do think that this actually depends on which future you are building towards and that future defines the level of intelligence we might need for robots. So one future is robot as a utility. So tools in the warehouses, arms in the factories, machines that do the task, but they don't share the space with us. And the other future robots that actually live with us, they fold our laundry at our home and that also help our parents to to companion our parents and they hand us the glass of water.

uh and which means that they actually share our world and if we want them to live with us they need to be physically intelligent. So they need to learn the word the same way how we did. So then let's go back to something very fundamental. How uh think about how did we actually first learn the words when we were babies. We grasp the things, press the things and drop the things and touch the things, pull the things sometimes or actually many times we actually put something in our mouth to taste it. We learn the word by interacting with it and we learn through the actions and feedbacks through the touching the word and observing how it actually responds after my actions and this is what we call the sensory motor learning.

So the nature question uh follows that if human sensory motor learning itself is what forms our physical intelligence then what if we could do the same thing to robots. We let the robot learn the same way that we learn as a babies. So again this is the same baby from the last slides uh is actually producing and generating all of these sensory motor signals at once. vision, touch, propriception, audio, action and feedback loops. And through those interactions, the baby gradually learns the structure of the physical words. So the question becomes now if we could capture all of these data and train and use as a training data set for robotics, we can make we can let the robot exactly mimic like us uh and learning everything on top of here. And yes, I truly believe in that future and we can achieve this by sensorizing the humans.

Today, many parts of the human sensory motor loop are already becoming very measurable. We already have the vision system captured through the egocentric cameras. We can also reconstruct the motion information directly from the video and also pro pre propoception like a 3D hand pose, wrist pose and uh trajectories body motion reconstructions those are also can be reconstructed from the video and audio is also very naturally captured through the camera system. So and so now we it's very um very obvious that to see that only one critical modality uh now we are missing largely is the touch and for physical interaction as we all know touch may be the most important signals that we should collect from the real world.

So one reason we still have very little touch data today is that many other human signals are already capturable and interpretable from firsterson vision alone. So the egocentric video. So with egocentric cameras we can already infer motion trajectories, hand pose, body movement, action structure and even proper obsession. And honestly, this is probably the moment to thank decades of the progress in the camera hardware system and the entire ecosystem built on top of the standardized RGB system because once the world convers around the RGB cameras, computer vision became scalable and now we are right now waiting for that exact moment for the touch because the touch never had that moment yet. So we have to follow how the video system improved how VA scaled because they were converged around the one thing the camera and the RGB pixels.

We also need a unified hardware stack that everyone could build on on for touch data and also build a data infrastructure which which share the same format of the data and this is why we exist. Open grab labs here is here to build the standard for touch the missing piece of the sensory motor system so that with this and we we can get finally leap forward in robot learning. We enable this with two main layers. First the highly scalable hardware that produces the high fidelity contact signals from the fingertips and secondly on tactile encoder which is an interpreters built on top of that hardware ingesting the tactile signals and turning them into meanings.

So with a high scalable hardware we can a we are able to capture scalable data set and on those data set we are now able to build a meaningful encoder tactile encoder and so we've just started building a complete pipeline for capturing the full human sensory model loop uh making it trainable for the first time. Thousands of people, millions of interaction, every moment of contact between the human and the physical interaction can be now captured, digitized, and ready to be teached for the next generation of the robotics. Let's train the human noise by sensorizing the humans. Thank you. That was sick talking about scaling out human data collection for touch. Now a huge part of actually collecting data is that we actually need to scale up the data operations, right? We don't just need to collect the data.

We need to ensure that we have the operators, we have that entire infrastructure and logistics handle. And so we have Suen from Cortex where they talk a lot about how they do this at scale with robotic and other forms of data. Hi everyone, I'm Suin. I'm from Cortex AI and I'm a founding engineer there. Today I'll be speaking about some of the cool things we got these robots to do, some of the challenges we faced and some of the lessons we learned. Here you can see some of the robots that we work with. We mainly work with bmanual robots doing manipulation tasks and we also work with mobile robots doing uh task in more realistic environments like convenience stores. And you might wonder how these robots got so smart. Even in this clip you can see it's pouring the last drop of milk to the cup.

Actually this learning systems they just take pixels in and they output actions. Usually we have a top camera and wrist cameras. We also passed in the joints data of the robot. A simple language instruction. Then the model will predict some actions. We execute actions on the robot. You go to the next state and the loop continues. This diagram is actually a really good way to think about the modern robot learning stack. You have camera beams and joints as data. Software is powering uh data collection, training, inference. Hardware is arms and cameras. models is models are what policies that we run and to test if these policies are working you need evaluation and to make this happen again and again you need a good operations layer.

Robotics is often regarded as a hardware problem or a soft or or a model problem but it is also a huge data and operations problem. Recently we worked with Alen Institute for AI on their Mulmo act 2 paper and we collected over 700 hours of bmanual yam data for their data set and it is the largest open bmanual data set to date and we collect our data through teleoperation. Here you can see my colleague he's controlling what we call lead arms and the follower arms will copy the motion and even though it looks fun it's actually very hard. The main reason is human intuition. It does not transfer really well to a new embodiment. You know how to grab a cup with your hand. But not when you have to think through a robot arm, it's really hard. But it's it's a learnable skill.

And not just that, there's a lot more to be done before you start collecting data. Even the simplest task of folding a towel, you can fold it in two, you can fold it in three. You have to you have to come up with a task strategy. After that you have to practice the motion. Then you have to make sure the data collected is consistent across episodes and across different operators as well. When we started scaling these data operations to hundreds of hours, we realized some small workflow changes we added. They started to compound. Initially we had the friction of waiting for two or three minutes for each episode to be encoded. Then we move the encoding process to the end of the session. Then suddenly the whole encoding duration is much longer. Now we had to wait for good 30 40 minutes before we start the next session.

Then what we did was we made a small code change. We disconnected all the hardware. So you can run a new session while the previous episodes have been encoded. And what ended up happening was data collection and encoding and uploading processes they became completely decoupled. Another thing I want to talk about is how breath matters in robotics. By breath, what I mean is being knowledgeable across different layers in the stack and being able to operate up and down in the robotic stack. The reason I'm saying is I've realized the problem space and the solution space might not be in the same layer in mo most of the times and the more intuitiveness you have across the layers, it's much easier for you to solve problems faster. Let me explain this with a few examples.

So when we started running policies on these robot arms, there was a task where the robot had to grab a jar and the grippers broke and you can see the clip the grippers flying off. And I thought, okay, maybe the model learned something wrong or I could just lower the gripper's force from code. But my colleague, he said, okay, let's just design our own gripper. We were we were working with third party hardware, but we could still innovate on top of that. And this is a good example of a hardware solution for a problem which I thought is in software. Similarly, whoever like worked with these cameras, you know, they get disconnected often and you unplug it in unplug it, then plug it back in, then it starts working magically. Then one of these times, one of our operators tilted the camera accidentally and the top camera view was off.

So the all the data we collected that day we had to throw away because it was not in the correct view and we were trying to make sure the camera mount is more rigid but I coded this I w coded a script a tool to check if the top camera view is good. So what we did was every session we take two or three minutes at the start then we check if the camera view is correct then we can make sure the data we collect is actually good. So this is a good example of a software solution for for a problem that we thought is in hardware that that's why moving across the stack and thinking from all these layers actually helps a lot. I also want to talk about why evaluations in robotics is hard. Similar to software you can eval evaluate robots in simulation and you can paralyze that. But real world is where things get messy.

For example, lighting could change. There could be distractors, there could be actuator and camera noise. So you have to account for all those things. Recently when we worked with Malm act when we work on malmarmac 2 we ran thousands of real world evaluation rollouts across five policies and that taught us like how hard of a problem this is. So when you run real world evaluations, this could happen when it's a failure and it would happen again. In robotics after you run every roll out, you have to reset the environment manually. Not like in software you can run parallelly. You have to manually go and clean it up if it makes makes a mess. And I've done this hundreds of times and I can guarantee you it's not fun. uh then we realize it's very expensive to do this all the time but that is the gold gold standard as of now.

Another hard thing about evaluation is when a robot fails to do something it's really hard to figure out where it fails. Let me let me explain with a few examples. It could be the data. Maybe different operators use different strategies. Maybe I folded it in two. Someone else folded the towel in three. Uh maybe it's a training setup. you the adaptation you wanted for example Laura versus full fine-tuning then it could be the setup I've had scenarios where I tried to load a model and some part of the model got initialized with random weights and the model is like going haywire and it could be the wrong action chunk size as well compared to the what the size that you used in training and maybe the evaluation setup itself could be wrong maybe you are trying to evaluate in distribution but the placement of the object object is slightly off.

Lastly, I want to talk about safety. This clip is something I accidentally recorded. You can see the joints doing a 90° in less than half a second. And if someone else's hands were there, they would have gotten hurt. We really talk a lot about robot safety when it's deployed, but I think there's a lot lot of safety concerns when it's developing as well. uh I can say like in data collection if the leader arm suddenly dies which happens sometimes the whole weight might be on the data operator. In evaluations we had cases where we are testing a task which involves test tubes and one of the robots they broke the test tube and you have like glass pieces going around u and stale action cues that might lead to sudden arm movements which is also a safety concern. And there's much more like this.

I also want to talk about running AI written code on robots because especially AI coding tools are becoming mainstream. Uh to give context one scenario that we use uh AI coding tools for robots is basically when we are using lay robot we are huge fan of layer robot from hugging face. So when we want to adapt that library to robot arms that we work with there's a lot of scaffolding a lot of interface work that we need to be done. So we use AI to do that and move faster. But when we try to run it, we run it like it can fail. Uh we do the normal software checks, fundamentals, normal PR reviews. Then we try to check in simulation and we try to test in logs. You can send the actions to the robot but not execute them. Just look at the logs first. Then when you want to test it on the actual robot, you can just move one joint at a time.

you can slow the speed down. Uh yeah, these are some of the things that we follow. Yeah, one thing I want to emphasize is that you don't have to be an expert in every layer of the stack, but if you have more knowledge about different layers, it's really easy to solve problems and move faster. That's it from me. Thank you. Okay everybody, um that is the conclusion. Um thank you Savine by the way. Thank you so much. Um this is the conclusion of the first half of our afternoon for AIE. Uh you guys have done great being so engaged throughout 9 to 5 6 PM of like programming for the last two days. Um we're in the home stretch with uh more really really cool talks that are coming up um after the break. Uh a lot of the most crack startups around the world are going to be sharing um what they are actually building.

Um a lot of them actually my Twitter friends that I've known for a while and I got to invite them and meet them in person which is also very cool. Um so please do stay for that. Um and while while this break is happening um I want to give a little bit of context to my friend there in green called Kazaya. Um just wave. Uh so Kazaya is someone who uh you know has a day job just like the rest of us working in consulting but she is also a mindfulness coach and wants to find a way to be able to bring more people into that kind of practice especially in spaces where there's just so much going on that a lot of us can feel things like overwhelm anxiety and just want to find a systematic way to be able to take a pause and just be able to kind of you know slow things down a little bit.

And that is why we wanted to create a little bit more of a curated experience for the breaks versus, you know, putting up AIE logo and some music and let you guys have coffee, right? Um, we wanted to put like thought into every minute of the programming. So, um, that's why we brought her on. But I also wanted to share another story about how this all got started because I think it's very much in the spirit of AI engineer and this like changing definition of what a builder and engineer is. Uh Kazaya actually with no background in coding actually vibecoded this entire experience. She found GitHub repositories that helped create the particle visualizer on the screen that you're going to see and she did that all in the last four weeks.

I mean I think you know we kind of pilled her on AI and then she just like went and you know went ahead and decided to um to build it. So, um I couldn't be more happier seeing people who are in all different kinds of, you know, spaces, uh industries, like being able to be empowered with these tools and just create these amazing things, right? And um all of this is possible to be able to connect things like uh meditation and mindfulness to an actual tech experience that we get to be able to show on stage today. So, um do kind of enjoy uh the next uh 15 minutes or so um you know to kind of slow things down and uh you know get that little less reserve of energy before we finish off the day. Thank you. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey everybody. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. with our programming really quick. Uh and and while our next speaker Jay sets up, uh it's going to be an incredible talk. I've been looking forward to it.

I spoke to Jay a little bit behind the scenes and this is a talk about humans about the human side of AI where you if you may work in a team and people are wanting to get in on AI, they're wanting to uh level up, they're wanting to design for other humans, often times we we end up with generic prompts and generic results and and we may not even know how to really get the most out of it. And so this talk from Jay uh comes to us previously from Canva Excanva is going to talk to us about that and I'm very excited about it. So, if you're ready, if you're feeling restored, if he's Is he ready? He's not yet ready. No. He He answered me. He just said no. That's so nice. I can kind of see no. While While they get ready, how are you feeling? You good? Show me by applause level. Okay, that's nice. Very good. That's good. I'm happy.

It's It's a good conference. And it's sad that it's almost coming to an end. If you want more even after the end to say a You don't want more? I don't think we won't do it next year. How's that? I'm joking. I'm joking. It's fine. They're freaking out backstage. Can he say that? I don't know. We gave him a mic. Um, this is going to take a while, huh? This is the worst part about being an MC. Now I've got to like think about entertaining you all. But that's easy, right? >> Hey, thanks. He said I'm What's your name, sir? Ari >> art. >> Test test. >> His name is Art. Jesus. The guy's a piece of art. You ready? >> I think we're ready, man. Your biggest round of applause. Let's go, baby. What's going on? Wake up. Come on, baby. >> I'm Jay. I used to be at Canva. I used to work at Grab. How you doing?

Today, I'll be talking about prompts don't have opinions. You do. So, for context, right, I'm really tired of the that's being put out there. Prompting ain't is but slapping tricks. So, I have my notes on my phone. If I'm looking at my phone, it's not that I'm looking at an agent, but I'm looking at my notes. So, I'm tired of these design influencers, these leaders, these people that have positions and high power that talk about the design process, yet they haven't done anything or shipped to millions of users. like Jon Snow, they know nothing. So, take it from me and the people here who have actually built something for millions of people. And let's talk about this. Oops. Oops. So, historically, right, uh I think there's a parallel between AI and actual products that have been built.

So General Mills, a baking company in the states, in 1947, they released a cake mix and people weren't really vibing with that in general. And when they added an extra step, just adding an egg to that instant mix, people were invested. They felt like they were creating, which is so interesting, right? Because it's the same with AI. I think for for anybody that's designed for AI in general, people create value when AI outputs it and helps them, right? And it's it's called the IKEA effect. People are going to be invested when AI is actually collaborating and acting as a partner. Uh, and it's it's fascinating with products that you've seen out there, whether it's Canva, Google, Figma in general. You let people have the choice to either use AI or edit or generate with you. Some people, you know, obviously a little bit hesitant.

You see it with them not trying to use tokens, but it's fascinating, right? Oh, it's gone out. Is it because All right. Awesome. Amazing. Cool. We're back. We're back. I hope. Keep it active. Oh, we're down. >> When something like this happens, just give him a huge round of applause to not make it awkward. There we go. >> It happens. It happens. No worries. So, I'm going to keep going. We're good. Awesome. So, how do I work with AI? And how do most people work with AI? For me as a designer, I use it as my intern, not my art director. So, you've probably seen this video on LinkedIn. It's fascinating, right? Uh people are token maxing and using all their these their tokens. It's sloppy. Um and it's the same with cloud code uh in general, right? You use up all your tokens, you get upset, you're like, "Ah, dang.

" Like, I you you lose all my credits. It's expensive. Why would I want to build? So, I guess I would ask most people, right? Would you delegate decision-m to a human expert or AI? I guess when you're designing for real people and you know I encourage people to talk to people outside of the tech bubble that you're in because average people are hesitant to use AI. So if you frame it in a way where people are actually spending less time to think about things then people are more inclined to potentially use AI in general which is fascinating right and they did a study on this.

people are more inclined to use AI uh when you frame it as a loss of time uh and speed in general and we we did this right so for me I built canvas sheets uh AI powered spreadsheet and it's fascinating to me when there are other competitors out there that still use this hashtag error and it doesn't really communicate what's actually wrong uh and if you talk to normal people that use spreadsheets on a bas daily basis they're overwhelmed by this right so at Canva we try to make it easy for someone to use formula and we give them and talk to them as a human being to give them a suggested fix which is fascinating in itself right and it's the same with the voice uh assistant stuff that I've done as experiments as well AI builds the happy path uh as you've seen uh humans break it they don't care right and you can't prompt for environments whether you're outside dealing with road noise babies crying right and it's unfortunate because I think a lot of people if you've built for voice it's so expensive to go on the wrong path.

Uh, and if you've ever done it before, it's just hard to do in general. So AI can't solve everything. So I encourage you to think outside of the data set, right? I think this graph you've probably seen a lot. So when you're designing, if you're an entrepreneur or even designer or creative or dev, think about what is the driving innovative competitor advantage that you have and who drives that? Design. Design has always driven this value to have an advantage on competitors in general. So James Dyson is also a good example of this, right? He's prototyped 5,000 and 100 vacuum prototypes if you read about his story and he didn't get a call until one person took uh a chance on him in general, right? The same with Apple keyboard.

A lot of people hate it, but you have to remember they had to think about the smart shortcuts, the things that people would say, different countries, different words come up as well. And the team, I'm sure, went to Steve Jobs and iterative uh continuously on this just to get to the spot where it is now, right? And they have to consider, right, who they're designing for the world, these new add-on additions as well. So designed outside the data set, right? So I' I feel like and it's related to what Josh Newton talked about earlier, AI speeds up the loop. It doesn't replace design craft or judgment in general. So for me, right, I I teamed up with a designer at Canva. Oops. Oh no. Again, >> classic. All right, cool. Hey, hey, hey, chill, chill, chill. We got this. So when I was at Canva, I designed uh columns and layouts.

Shout out to my boy Simon Lynn in Taiwan who's a legend who also helped with this. And these are complicated interactions, right? Not everyone's going to get this. So when we went to ground talk to real users with real prototypes, we had to think outside of the data set. So AI can't solve complex interactions, complex products, you still need to talk to real people and actually test with things that AI can't potentially generate or think about. So it's the same when we're working in workshops as well. Uh we actually build um code templates. What does that mean? Well, we build code templates of our products and it helps people get into cursor, get into claude and actually build ideas during workshops, brainstorming and empowers everyone, right? We as designers should not gatekeep in general.

We should empower everyone to bring their ideas, build their ideas through AI so we can prompt and actually test this on the ground which is really important. So, it's the same with smart homes and and voice as well. Uh, it's very fascinating to see where Huawei is going with the future as they look into smart homes. Think about voice and being contextual because AI again can't be reactive. It has to learn. It has to be trained. So, how do you think about this and have a smart system that adapts to normal people's behavior? So, lastly, build the world that you want to actually live in, designers and devs and folks that are here, especially entrepreneurs, right? because people are investing in experiences and design is going to be uh that lever that pulls things forward, right? And CTO at SpiceJet uh India, he even talks about it.

AI is very expensive right now, but the overhead of of hiring people is cheaper, which is a fascinating quote, especially in the age of AI. So, final hot thoughts and hot takes before I end today. Mute the trash on on AI design on social media because there's a lot of out there to be honest. Talk to people outside of your networks and bubbles because the average person is actually pretty scared of AI right now and of course it's out. It's fine. Users don't care if your product is better, right? They don't care if you they have a cool feature that's better than your competitor. You need to actually design for these people and their needs and be contextual. And finally, to the design leaders out here in this region and to the world, I think you have to give space and time for people to actually adapt AI.

I've heard too many stories about designers that have actually uh been called out for not designing enough screens, uh being bullied by these poor design leadership because they don't know how to use AI, right? I got told that my work doesn't matter, but guess what? I I designed a product that went to millions, so I don't know what they're talking about. So to be honest, right, I think it's important to empower your team. So last last point that's not up on here. Christina Caul, she went out to Artemis uh spaceship obviously around the moon. She talks about finding your crew. So I encourage you to find your crew, your networks here. Feel empowered, feel connected to the network that you're adapting and working with in AI because it is important because in the world that we want to be in, you don't want to be anti-A.

You need to be AI fluent. Just be anti-bullshit. Thank you. >> Give it up. Be anti-bullshit. How many of you are anti-bullshit? I tell you what, I am. Wow. Really? You like The rest of you, huh? Anyway, um please give it up. We have We have a co-m. Check this out. It's Usman, everybody. That's right. Usman, less than half my age. Um I won't tell you what that is. And he's so active in the local community here. building. What's the last thing you built, bro? >> The last thing I built was like a, you know, religious app, right? >> Like how you built one. >> Yeah. >> Um, mine is for Muslims all around the world. And how you can uh track your prayers and like all of the different suras in the Quran, which is our uh holy book. Yeah. >> Similar to your Bible. >> Dude, that's so cool. And you built this? >> Uh, yeah. with Google AI Studio.

>> Let's go with Google AI Studio. Give it up. Like a builder. How old are you? >> I'm 13. >> He's 13. What? That's the future. How One last question while they set up. How what's the experience like building with AI Studio? Like are you just prompting things? Are you writing code? Like what's the >> Well, of course, in the beginning, right, I couldn't like vibe code at all. It took me probably a year or two to really figure things out. Yeah. And it's I've come to a conclusion that vibe coding is not that hard. You just need to put in the hours. >> That's right. You just need to put in the hours. Fantastic. So, you're introducing the next speaker. Is that right? >> Yeah. >> Let's do it. Give it up everybody. >> All right.

So now we have Alex Lee who has come all the way from San Francisco to Singapore and he's come to introduce how AI needs design systems. Currently the users like the AI studio and all that stuff. The designs are horrible. I'm going to be honest right now. We want designs that actually match the users brands. Give it up for Alex Lee. >> Oh, you need the mic. Sorry, guys. It's like, how is he supposed to do the talk without the mic? Alex, one more time everybody for Alex Lee. >> Thank you. Thank you everyone. >> Okay, perfect. The slides are here. Um, yeah, I'm Alex, one of the founding engineers at Magic Patterns. Actually, just want to get a quick poll. Has anyone actually heard of Magic Patterns before? Raise your hand. Oh, actually there's a couple of you. Super cool.

For those who don't know us, Magic Patterns is an AI design tool that gets you from idea to production in a matter of minutes. We've been used by over 2,000 product teams, KPNG, RAMP, etc. But I mainly work on design systems. And so, you know, in the world of AI, it's been much easier to build new features and new functionalities, but the hard thing we still have is consistency. And so, I'm here to tell you why design systems have not only been needed even more, but are crucial in the AI world today. And so before I begin, let's talk about some history about why design systems are even needed in the first place. So before everything, the world or the web was the the wild west. Every page was different. It looked like your MySpace page with different widgets everywhere, different buttons.

Designers had to reimplement, engineers had to reimplement, and there wasn't really any shared systems in place. And so to restructure this chaos, we have design systems. It's a shared language that your product teams can use. You have your tokens which represent your colors, typography, your spacing. And thanks to Brad Frost's atomic design, we have a great hierarchy and nomenclature for components. We have your atoms, the buttons, labels, inputs. We have the molecules composed of those atoms, maybe your form modules or your search bar. And then we have the organism level components and templates to create larger things like your sidebar or your dashboard layout. And so the promise was simple. We had consistency, speed, and scale with the thanks to design systems. But maybe things were a little too consistent.

Maybe, you know, instead of it took a lot longer to add a new button into the design system. There's bureaucracy now. you have to ask the team, can I add this new thing in into this layout? And so we weren't thinking about things from first principles. It was not about what how do we solve the user's problem from the ground up, but it was more how do we use the tools in our design system or components in our tool shed to solve that problem. And that rigidity was not that helpful. And so the industry took a step back. Design systems can be a little too enforcing. And so let's think of things more as a framework rather than a set of rules instead. This way you can have that creativity but still have those guard rails in place to get your consistency in your brand whether it's your typography, colors, logo, images and set like that.

And so finally we were at peace. We have a way to build creatively but also with guard rails in place and nothing disruptive ever affected the tech world again. Right, man. I feel like in even the last six months my workflow has completely changed. I'm sure for every one of you I don't even write code anymore. I just asked the agent to write it for me. I'm sure for design product management everything has changed as well. And I think it's interesting, right? The cost of implementation has now become basically free, especially if your company is already paying for those opus 4. 7 tokens, right? And so the questions change from can we build this? How long does it take to build to, you know, do we even want this? Do we need to m do we have something we want to add? Do we want to maintain this?

Does this new feature use the components in my design system? Is this new feature align to my brand? And so with that, we have all this chaos that AI has created for us. And we go back to why design systems were created in the first place. And specifically, we need those guard rails. And so this is what the AIS, you know, in the world of AI without that context, you specifically have things that are not necessarily to your brand, right? Things get hallucinated. You might have components that are hallucinated. You might have colors that are on to your brand guidelines. And overall, you really need those foundations and context to make things work. It's not just your Figma mocks. It's not just your story book, not even like a design MD. We really need context to align our agents to build things that align to our brand.

And so we came up with a solution on our end, which we call the AI native design system. Obviously, there's not that much difference to a normal design system, but the key things are now that we have two pillars that this design system relies on. Your documentation and your code. You have your system level rules, your tokens, like I mentioned, your color, typography, spacing, and then your components, but specifically backed with code because the more code aligned your design system is, the more close it is to what your users are actually seeing. And this also allows the agent to understand the props, the variance, the way in which to use those components directly. And so what is a realworld example of this look like? Here's one of our customers, Headway.

Headway is a mental health platform that helps people find licensed therapists and they already had a design system and so we helped synced it for them. We took their documentation and their code and created it the same structure that I mentioned before. Storybook as a source of documentation lends itself to system level rules and component level rules based on stories. And then their actual code ingested either as an MPM module or synced with GitHub for tokens and like I mentioned those components. And it's crazy because I can't show this in a live demo because it might take too much time, but the differences are stark. I generated these ahead of time, but with the same generic prompt of build me a dashboard, you get something completely different. Without a design system, you get something that works with your UI, right?

Or it's a nice generic SAS dashboard, but does not fit with maybe your brand or your product. Same prompt with that design system context. This matches really closely to what Headway's brand looks like, right? We have our logo. We have our components, colors, typography, all matching in place. And now we're actually able to ship really close, highfidelity code, even with more simple prompts. And now this completely also changes what that design to engineering handoff looks like. Right? In the old world, I had this Figma mock. As an engineer, I would have to look at it and check my story book, see which components that align to, make sure the color tokens are correct, right? And it was very hard and I had to build everything from scratch. But now we're not even working with designs anymore. We're working with codebacked prototypes.

And because these prototypes are using my actual design system components, I can hook it up by an MCP to something like cloud code cursor codeex and just be like, oh prototype tool, design tool, give me this design, make a new feature out of it. and those same underlying fundamentals, both code bases should be using my same design system components and I should be able to get something at a much higher fidelity. But because those prototypes are also codebacked, I can do it the other way around. I might have a feature that's not necessarily in mocks yet or in the world of Vibe coding, people are always producing new features. And what I can do now is just say take this piece of code, take this page and convert it into a prototype that I can easily iterate on.

And now because of this MCP round trip, I now have high fidelity transfer on both directions. And so as agents evolve, so will our workflows. But I think the one really hard thing we've not been able to match yet is craft. AI alone will not replace craft because without context, you're not going to have the intention, the touch, kind of the humanness that makes great products the way they are today. But design systems are here to add that context. And so in the past, design systems used to help us build with craft, but today they help our agents understand what craft looks like. So, I hope this helps understand why design systems have become ever so important in this AI world. Thank you. >> Thank you so much, Alex. And coming all the way from the US, uh the next speaker Woo. Yeah. Um cool.

Uh next speaker is going to be uh Sabina from Magic Path, not Magic Patterns. Um, I did sort of tell these guys that, you know, they they exist and they'll go after each other, but I thought they'd be kind of fun. But, um, yeah. So, I thought it'd be fun to tell a little story as well about Sabina. Um, she actually studied chemistry, I believe. Is that correct? >> Yeah. But now she's a design. >> Is that like Breaking Bad? >> Like Breaking Bad? >> Like Walter White? >> This is Singapore. We can't say stuff like that. >> Sorry. >> It's okay. >> But anyways, um, but that's cool. I think again um you can kind of study anything and then become anything. And what did you study? >> Nothing. I studied nothing. I have zero degrees. I I'm just I'm uneducated. >> Yeah. So sometimes guys, you can just do things. No one's stopping you.

Just if you're chem if you're a chemistry person, you can design. Um so that is a little quick background about Sabina. Woo. Hello, my name is Sabina. I came all the way from New York City just to talk to you guys. I'm so excited to be here. And I am a designer at Magic Path. Not patterns path. Light mode, dark mode or light mode. Um, so it's funny. I actually hosted a workshop. If any of you guys were there two days ago, hello again. Um, and I completely redid my talk this morning because I realized, oh my god, I'm talking to like capital E engineers. So this is for you. Um, if you saw on the uh schedule, my talk was should designers insert May 2026 design trends here. Uh, and that was penned in March because I was like, Sherry, this space is moving so fast like god knows what, right?

Like I don't even think skills were prevalent before uh I I submitted this talk. So, um, that evolved. I didn't do that. Should designers code? Should fish swim? That didn't work. Should designers design? This is actually a good point. I will uh come back to it. But I think if you're a designer right now, who's also touching code against their will? Yeah. Okay. And then I realized, wait, I'm not talking to the right crowd. Should engineers design? Yes. And so this talk is going to be for you guys nerds. Um so, uh if you for me engineering is really scary because div blocks are scary, but if you think of div blocks, it's flex flexbox. And if you could go flexbox, it's auto layout. So in like 90 seconds, I'm going to teach you everything you need to know to take my job. I hope you take my job, right? I'm tired.

So if you see a font that looks like this and you're like, that's very clean, very easy to read, very human, right? Um I prompted that this morning. Like this is called uh sand serif. Uh it's very approachable, very human. You probably see on every developer site modal, you know, linear claw. They have their own thing. They're expensive, but inter is a very good reliable thing. And people usually just play around with the tracking and the kerning. You know, if you ever see that A versus A, that's just like an expect element. You can change it out, right? If you see this font, you're like, "Wow, I'm technical now. I'm seeing numbers. I'm seeing something that's very scientific. " This is called a mono font. Blank mono guys to mono is probably what you need to know. It's very like, "Oh my gosh, if I go on my website, like tech, right?

That's awesome. " If you see this font and the difference, you know, attention is all you need. Latte is in it. Um, Times Roman, anything that's kind of serious, uh, anthropic answering my question of should I drink five shots of tequila before this. Very authoritative, very professional. This is called SIF. And if you want to know in 3 seconds why we have a difference, SIF is when uh, back in the like Roman or Greek ages, uh, people would draw like kind of what they were going to like stencil out. And these little marks are from actually like the paint brushes of people drawing. So that's literally where it's from. Now you know. Okay. If you see something like this, shaders, interactive things. If you see, wow, like how the hell does that happen? I don't know WebGL. Um, yeah, this is shaders.

All you need to know is that uh you can go to unicorn. studio, get that done. If you want to actually know the math behind it, go to my friend Maxim's blog. He works at linear. He's fantastic. Um, and that's everything you need to know. So, um, let's see what else. No gatekeeping here. You're like, "Wow, I'm on a hero page. Here's Magic Path's website, which you will all see soon. Here's Cursor's websites. How on earth do they do these hero animations? " Guess what, buddy? Yeah, that's right. You just take the codebase, you throw in an animation thing, and you make a new branch and you say, "Hey, make it awesome. Make it pop. " Um, usually people have a recording of their product on here, but I advocate for this because, uh, you kind of want to speed things up.

you know, there's like kind of a etiquette when it comes to making people wait through your AI generated whatever. Uh, and it's just faster and you can do a lot of really cool things. Like if you see my prompts, I'm just like, make it pop, make it bigger, like make it in 10 seconds, whatever. Okay. Also, I'm not gatekeeping designers, too. This is for you. If you ever see something on a website and you're like, "How the hell did I do that? " You rightclick it, you go to inspect element, and you dig around until you find the computed layout, and you copy that into um magic path, which you'll see soon. And yeah, this is um this is all to say that I think it's really interesting. Engineers have taste, right? I writing good code requires some sort of like finessing.

And I think design has been such a blackbox for engineers that they don't realize like no, you can have taste with this kind of stuff, too. Like everything you just saw, like that's 2026 designers in a nutshell. Like Um, I didn't go over instrument sands, but okay. So, something I want to kind of segue into is how are we kind of defining design and work today? Design today, there's a lot of um there's a weird uh pattern that we've kind of behavior that we've encouraged of you iterate, refresh the page, iterate, refresh the page.

you're kind of stuck in the single viewport and if you want to see a version you kind of have to like do this awkward dance of like pressing the back button or whatever and like you don't really think you kind of like iterate until it's like good enough but you don't really pause and reflect and think about oh wow like maybe there's something good from this iteration versus this iteration right you're just kind of moving forward uh and not being introspective which apparently is uh not masculine so um given how hard it is to predict the oh what does that say uh the future of design like I work you know at magic path and I see a lot of design tools that are like oh um you got to export it as this whatever file like oh you have to natively make it in there. My thesis is like I don't know how the hell you guys are designing.

I really it doesn't matter. Um I want to be able to give you guys the best tool possible to kind of meet you wherever you're at. Whether your design is like in a halfbaked next. js JS app, if it's in a Figma file, if it's in like your head, if it's in your teammates's head, it doesn't matter because um yeah, I mean, creativity comes from anywhere and I don't want to be the person to tell you where it comes from. So, I was talking to Sher. She actually invited me to this talk back in March and I was like, "Hey, like uh I don't know what I'm g I don't know what I'm going to give a talk on. " And literally, this is what I told her. Like, I made these slides the day of. So, it's not out of laziness, it's out of accuracy. So, yeah. Okay. This is a quote I think everyone should remember.

I think this is kind of like the whole thesis of this uh conference. Uh I'm just going to read it out loud. John Collison, who is like one of the Collison brothers, part of Stripe, he says, "As you become an adult, you realize that things around you weren't just always there. People made them happen. But only recently have I started to internalize how much tenacity everything requires. That hotel, that park, that railway, the world is a museum of passion projects. " And I say this to say, you know, uh, you know, some people just throw skillmd files and they're like, you know, put the fries in the bag, whatever. But I think there's a beauty of like kind of understanding like, wait, before I just like park this skillmd file I found on Twitter in my chatbot. What's in it? Like, do I want every single thing?

Like, do I even like yes, it's Airbnb's design system, but do I want every single thing? No. You kind of want to finesse things, right? It's it's similar to like whenever like someone gives you like a PR that was obviously not looked at like they can't explain every single line of code. Not that they have to, but you know like dealing with someone else's AI crap doesn't spark joy. I think everyone can agree on that. Let's see. Okay. Um this is all to say like I'm saying this all from the heart and um you know as a designer before this I did a AI design startup where I tried teaching people design. Uh so you know there is no corporate shilling hat on here. But now there is wait. Damn I wish I was smoother. If you want to go fast go alone. If you want to go far you should use magic path. Then you should use it with your team.

Use it with your enterprise multi- aents. We just released it two days ago. So there's me Chloe Park. if any of you guys know her, she's fantastic. Um, so you can not only design in Magic Path, one on a canvas, which I think is the right way to go, two, with multiple agents, whether it's the side chat bar, um, and three with your actual enterprise team. So, get the marketer, get the CEO in, like put too many cooks in the kitchen, see what comes out. You know what I mean? Um, the cool thing is that, you know, I have been seeing all the love for cursor and all the love for codeex going on here. I'm such in awe. And the great thing is that you can actually use magic path with your existing tools.

So I give a workshop I use cloud code but you could use codeex you can use whatever like I think I saw someone with like the Amazon IDE was that Kimmy or uh anyway you can hook up Magic Path to these different agents say hey like you know if you have like a bunch of you know pro subscription credits like use that up on Magic Path. Don't feel like you have to buy more credits. Like again we're trying to meet you where you're at.

Um my boss Pietro who is such a trip if any of you guys know Pro he's like such a crazy guy but he made this really awesome video and like where he just shows using codeex you can make these really amazing designs and I think like this next generation of design is just going to be about you know we have the tech we have to communicate to people that no this is how you can actually achieve like engineers designing and designers learning how to work better with engineers um so we have all tech it's just like being able to you got to put it in people's face and be like, "Hey, hey, you know, use this. " Um, design from everywhere. I actually had someone say like, "Oh, I wish I could design uh with Magic Path from my phone. " I would never do that because I think that's too much cognitive overload.

But if you want to hook it up to Telegram, WhatsApp, whatever, you can like you let your design bake and then go check on it later. So, making that uh aware. So, the cool thing again is closing the loop between design and code. Um, I don't have it on here. Oh, no, I do. I do. Uh, but basically you can have a magic path design, put it in your codebase, finesse it. Even if you do edits to the local file, you can put it back in magic path so you always have a clean file. And again, these all have live links, so you can send it over Slack, send it over iMessage, I don't know, whatever. Um, and yeah, so uh this is kind of just like a wish it was bigger, but this is just me trolling around my file like uh you know, again, because it's a paintbrush, I want you to make art.

I want you to make projects, things that might never be shipped, but at least you told yourself, you like spread everything out and like really thought about it, right? Because I think in the future we need to do things that make our brains wrinkle a little more. I think mine's like, you know, like inflating. So, um, you know, this is just me playing around with art projects. Like I plugged in I bought a Japanese texture pack off Twitter and like I like hooked it up to my you know local uh agent or my external agent and then it put really awesome things in magic path and I can see that being used for like landing page or some other creative endeavor. Okay so the last thing I wanted to say is oh shoot over um this is my incredible team nothing great is built alone part two. We are primarily based in New York City.

If you're ever there come say hi. We're in downtown Manhattan. It is such a blast. And okay, so take a picture of this because guess what? All the slides are on there as well as recommended readings. The myth of the paperless office. There are some blogs that like, you know, Maxim's blog is there. There are some really good resources there for you guys. I also have every single slide. It's not totally accurate, but it's up there. Um, my email and Twitter, please tweet about this. if you um actually make something and you DM'd me, DM it to me or if you DM me in general or send me an email like I would love to like personally onboard you and help your team get set up and yeah, we can host your design system. I actually think that's the biggest question I've gotten. They're like, "Oh, can I transfer my design system to here? " Yes.

Uh I think that's it. >> Thank you, Sabina. >> Huge round of applause for Sabina. Everybody keep it going. Yes. Get the mic, young man. We uh look how many of you design images with like chat GPT or Claude or some Yeah, many. Okay, this is like 10% of the room. Um I think many of you don't do it because one, it's kind of >> you know, uh like we kind of know what slop looks like. Um or it's it it makes mistakes. Six fingers, right? Anyone see Katy Perry at the Met Gala? You know what I mean? It's a cool art. Anyway, um image generation either for brand assets like logos, um business cards, things like that has always been somewhat of a challenge because we know what slot looks like, but also where's where'd you go? Oh, there you are. Did you get a mic? Go grab it, bro. No, they they don't need it yet. Go get it. It's fine.

Anyway, this is BTS. Anyway, um so here's the deal. When you get when you get um and I'm I'm invested in this now. Hang on. Oh, let me just can we use this to introduce her and then we'll give you the mic. Thanks. Um, anyway, so when you get an image from an AI model, you get one image. It's like a flat image, you know, but if you're a graphic designer, you work with layers. You know this, right? Like like you have like a background and a foreground and all kinds of layers. Well, how cool would it be if AI could do that for you? Give you like a Figma ready thing with all the layers that you can use. And that is what I'm getting ready to hear about. I'm very excited. Who's the next speaker? >> Priya. Introduce her, bro. >> Yeah, I know. >> It's okay. He's new, but we're training. We're training. >> Okay.

So, now our next speaker is going to be Priya, who came also who also came from San Francisco to Singapore, which is a 17. 5 hour flight. and she's going to be talking about how AI can become your design partner and help you create some really cool stuff like uh similar to Canva but better. >> I don't know. Anyway, just that's free. Give her the mic. Fantastic. Give it up for Priya everybody. >> Good evening. Uh thank you so much for that intro. I feel like you explained uh a lot of the things that I was going to talk about. Um my talk is I'm the co-founder and CEO of Leica and we're building the infrastructure to train and evaluate creative AI models. And what that really means is I spend all day yelling at image generation and video generation models because they don't understand our prompts.

And we are working on building the infrastructure to get them to be better at it. And uh we want to avoid the problem of death by prompting. Um I think he asked this question. How many here have used chat GPT or nano banana to generate slides, presentations, social media posters? And I didn't see any hands go up. Are you all lying? Okay, now I see more hands go up. So obviously most of you use chat GPT or nano banana to generate images. And I'm sure um I'll share like what I was doing today and most of you might empathize with what I was going through. Um this is like devil wears Prada a poster and I asked uh the I asked Gemini to replace the image mask uh with a woman with blonde hair and then this is what it gave me. That's okay. Uh I still had some patience left in me and then I prompted again and this is what it gave me.

And then it continued to get weirder. This is what I ended up with and now I completely lost my So I I thought okay this is not going to work. So this is like progressively worse results that I saw. So what we do at Leica is a little bit different. So if this is the image and this is the same prompt I gave change the image mass to fill it with a woman with blonde hair and green eyes. Uh it isolates everything into layers and then it fills that layer with that exact image. The level of localized edits you can make is crazy. If you have layers exposed and you're able to delegate each layer, you can also move the text around. You can change anything that you want here. And you might be asking why is this small?

Well, I guess like the reason why we're able to do it is some of the uh companies that are doing image generation or video generation, they output MP4s or PGs and they are frozen file formats and the layers are not exposed and every with every prompt the design state is reset and text is not a very interesting input uh medium because many people don't know how to verbalize what they want. So there's a lot of loss in translation and there's no human AI multiplayer experience today because of that and the way we have solved it is really to do this layer level editability and layer level editability is not just for humans to move things around but there might be other specialized models that you could use for different layers. It could be for text generation, SVG generation, photo generation.

You don't always need to use one giant model for everything. And you might be asking why should a startup tackle this? Why haven't the big labs already solved this problem? And the honest answer there is there is no data. With code, there's a ton there's tons of like GitHub repositories. LLMs have gotten really good at text processing. Whereas with graphic design, you just have these three giant companies. They're all walled gardens. Figma, Canva and Adobe hold billions of editing traces and data that none of the labs have access to or no one in the community have access to. So when we as a startup decided to tackle this problem head-on, we thought from first principles and also decided to uh tackle the problem of like what is the missing gap in the market and that is data. So we went ahead and collected over 1.

5 million layered graphic design composition. So what that looks like is some of this has been open source. So you can actually go and check it out. This is like a fun explorer that we built where we have put out data from across so many different design categories, 50 plus categories from Instagram to a business presentations to posters and each data point has several rich annotations on what the image looks like, what are the crops like, what is the positions of it and if there are semantic and logic groups then you can actually see which elements need to be grouped together. So you can teach an AI model how to do refflow of content or if an aspect ratio needs to be changed, it really knows how to plan the layout. All of these things all the frontier models today suck at.

And you can play around with this data and uh parts of this have been open source. So you can also give a lot of these configuration files as uh uh skills for a cloud agent and it performs a lot better and you can also train models or build eval on top of it. So the way we approach this problem is you can get oneshot outputs today from ton of degenerative AI models and some of the results are really really impressive. But when you hear comments like AI lacks taste, what that really means is designers obsess over details. Somebody is thinking about what the corner radius of a rectangle needs to be. What should the crop type be? What should the distance from the margin be for a text box? And thousands of these small tiny decisions is what elevates a design.

And AI models don't quite understand how to think about some of these tiny decisions. And every small misstep here makes that output very hollow and sloppy. So the way we've approached it is really to isolate everything into layers and each layer can be shaped very differently with proprietary data from an enterprise or other data collected uh uh from elsewhere and the layer level data is going to be very helpful because in an enterprise people don't have unlimited tokens to spend especially in marketing functions let's say in e-commerce you have to generate banners that align with a certain brand guidelines across so many different countries. Let's say in Southeast Asia where there's like ton of languages and you just want to change the text or specific graphics but retain all the other elements as is.

You just want to be able to manipulate those layers. Or there are instances where you want to combine camera generated imagery with some parts of human written copy and fill the other pixels with something that's AI generated. And you should also be able to combine a constellation of models because as more and more models come out, you probably want to delegate different aspects of design to different models. And this architecture allows for it. Because today, if you want to oneshot everything, that is an engineer's idea of how a model should function for creatives. Whereas creatives, creativity is just inherently incremental and iterative. You walk a few steps backward, then sideways, and then you probably decide you want to scrape the design and start over. and the current models do not allow for that.

We also came up with a multi- signal reward-based learning system where design is easy to game if you just use human preferences. And especially when you work with brands which have different expressions of taste, you want to be able to come up with part of rewards that are human-based preferences and augment that with certain objective uh rewards that measure whether the output is valid and meet certain design principles. And then we have two models.

One is an AI judge that is able to increasingly update itself on its rubrics so it can get better at discriminating good from bad and then use that updated AI judge to retrain your generator that can continue to get better because design has a shelf life and you constantly want to expose really good examples and train your model to be up to speed and also build an architecture where you move beyond textbased prompting so that you can capture different types of interactions that can be part of the training loop. This is not the reality today. I never smile when I'm working on evaluating any of the image generation models. But if there's anything you want to take away from this talk, it's that my slides were all inconsistent and all over the place. And that's how AI models are today.

No matter what the Twitter hype or LinkedIn hype is, models are very poor at layout planning. Getting visual consistency at scale without human intervention and editability, especially layer level editability is extremely extremely difficult. So if you're interested, you could scan the QR code. Uh we have the hugging face link, GitHub link if you want to use the data set that we have open sourced and we've also put out a graphic design bench. You can use that to train your cloud agent or uh you could also try to use that as eval uh if you have internal models that you're training or reach out if you're interested in this space. Thank you. >> Y'all are such a wonderful audience. Always clapping for your speakers when they do a great talk, which is all the time. Great. One more round of applause for Priya, everybody. So good. So good.

Our next speaker uh is is so cool. He he has an amazing story that you're going to hear in just a minute as we introduce him. Uh I'm not even going to introduce him. I think he's a pro now. Give it up for your other MC. Everybody, Usman. >> Thank you. All right. So, now we're going to be introducing our uh next speaker who's come yet again all the way from San Francisco to Singapore. And uh that's a 17. 5 hour flight by the way. Anyway, he's come a long way throughout his journey where he came from zero to hero. He used to live in a hacker house. Uh specifically the closet and and uh he was a college uh no not college uh high school dropout at 12 and now his company uh what was your company? >> Hyperspell. Uh now his company Hyperspell has come such a long way that it has raised over $6. 7 million not 67. >> All right.

Hey, give it up for the announcer everyone. Let's go. You did an amazing job. All right. How are we all doing? Final day of AI engineer. Let's finish strong and make it happen. Hey everyone, my name is Connor Brennan Burke. I flew all the way here from San Francisco. 17-hour flight. I am incredibly jet-lagged right now, but we're going to push through it. Woot. Let's go. All right. So, we at Hyperspell build company brains. And what I'm going to tell you today is how to build a company brain. That's right. How to make it so that agents actually understand how your company works. And this doesn't work. All right. There we go. All right. So, I think this is a theme we've heard quite a bit today from different speakers. Um, your agents, to put it bluntly, are clueless geniuses, right?

They are they're like this um, you know, savant, PhD, slightly autistic intern that is absolutely brilliant, but doesn't know anything about your company. Every single day for them is like the first day at work. They blindly follow uh, whatever they read. They're kind of naive. They'll take instructions and just go with it. And so you need humans to watch over them. The problem and the thing that gets us to AGI is not better models. The models are already brilliant. It's getting the right context. Your agents are clueless geniuses and the lack of context is the reason that they don't yet get work done reliably. All right. And so the question is how do you solve this? So the obvious answer is connectors, right? We've all done this.

We've said all right I'm going to give my openclaw access to my slack my drive and my notion I'm going to use connectors in anthropic and claude and chatgbt but the problem with this is again as we've said agents are kind of naive anything they read they assume is true but it turns out that documents themselves are not actually often true um so they'll find a doc they'll miss the correction they'll find an old version that's out of date uh if there's two different sources they'll conflict with each And whichever one they find first, they'll interpret it as true. There might be the same person mentioned in Slack and Gmail and Notion. They don't realize it's the same person. They're like, they think it's five different Lisas instead of one Lisa. And there's also no recency, right?

You find old, deprecated, outof-date documents and they try to operate off of that. Um, so connections give access. They don't give understanding. So everybody here, I know not everyone here is working yet, but the folks who have, how often have you gone in and started a new job and read a document and been like, "Oh, okay. This is our strategy or this is the process and then you go do it and you talk to somebody and it's like, oh no, that's out of date. That's no longer relevant. You got to talk to Bob instead and Bob knows all of it and like talk to this person. " How many people has that happened to almost every single person here, right? And so the thing about this is that by giving connectors to agents, we are assuming that truth is in documents. But that's not how things actually work.

So the source of truth, as we call it, is rarely true. It turns out that the moment information is created, it starts to become out of date. Documents themselves are a lagging indicator. You might have a reorg change or a customer exception or a new deploy. And so reality gets far away from the dock and it requires human beings to update docs to make them true. And so how companies actually operate is you have the extremely messy reality where there's Slack threads and meetings and emails and exceptions and all these things happening and then you have this document. So people try to record stuff but we're all not great at updating documents and recording them. And then you have what's actually true. And so often, as we just said, the way to get to what is actually true is by asking someone, right?

You ask your boss, you ask the person who's been there for like five years who has all the context. And so, human beings are good at understanding this. You know, not to just blindly trust any document you get in any process. You ask the person, but agents don't know to do that. Anything they read, they assume is true. And this is why you can't just let them run across your organization. If we want to deploy agents at scale, we need to give them a source of truth. So how do you solve that? You create a company brain. So every single organization needs one single source of truth for agents. One company brain. Now what is that? It's not just connectors. It's not just rag across sources. It is one source of truth that has confidence in it. That understands who created this document.

that brings in threads together from the email and the slack and the noten and the messy meeting that surfaces conflicts and identifies okay there's two different sources that say different things how do we resolve between them it figures out what reason and it creates one source of truth that agents can actually trust what does that give you it gives you better answers it gives you aligned agents it gives you durable knowledge it makes your organization ready to deploy AI the reason that so many enterprise AI deployments fail is they try to deploy agents that agents read the documents and there is no company brain for them to operate off of. This is the thing that we need to make AI actually work. Now the other nuance here is we assume that context is human generated but that is not true anymore.

So traditionally, you know, you have people in meetings and slack and docs and emails. But now you have miscontext, right? You have the meetings you have with humans. You also have all of your prompts to your agents. You also have your open clause memory. You also have the traces and reasoning. And that is context too. The loop that you go with with your clawed code to get to the final output. All that context is useful. And if you don't put that into the brain, you're missing a huge source of context to get to the final output. And so we are all moving from organizations that are mostly human to mixed to in a few years, the majority of context is actually going to be created by agents and all of it needs to be in the brain. Now there are different types of knowledge inside of companies that you need to put in this brain.

There are stable facts, so things like legal identities, your org chart, your brand colors. Then there's process knowledge. How do we do onboarding? How do we do deal review? How do we respond to incidents? There's also tacet wisdom, things that are only in people's heads. And so remembering, okay, how to close this customer or what's a better sales strategy or uh this particular test is going to be flaky or this integration doesn't work well. There's all have this tacid knowledge. It's very rarely written down and it's very rarely in a single source of truth that you can make work. And then finally, you have stateful reality. So you have open deals, active incidents, today's blockers, and the company brain needs to have every single one of these and store them all differently.

In order to get to a source of truth, you need to know how quickly things are evolving and what that central place is. And so the way to actually build this is we found from our experience working with everything from tiny startups to mass and Fortune 500 customers is that you want to start by ingesting all the data and so pull in all the sources of truth again your Slack your Gmail your notion your GitHub more and more now we have meeting recorders as well you also have your agent traces right agent generated context now for example meta is starting to do this and meta is even logging keystrokes some people have recorders of their screens. You need to embed all of that. The next thing you need to do is create a context graph.

What a context graph is is it's a single graph entity that finds every single fact in the organization, understands when it was true, who authored it, how confident are we in it, and embeds that all in one place. But the thing is context graphs and graph databases in general are not great UX for agents. Agents are not post-trained on them. They don't natively have an understanding of how graph databases work. And so actually the best way to represent this for agents is a file system. And so you create a file system with for example data at the company level. Who are the people in your company? Who are the prospects? Who are the customers? You have your decisions? You have events. Below that you have files for each team. And then you have each individual. And the great thing is because file systems are universal.

You can use them with cloud code. You can use them with cursor. You can use it with open claw, nano claw, internal agents or even your own personal agents. Now let's talk through how the brain gets built. Um the first step is context capture. So you have all these messy sources. You need to ingest all the historical data but you also need to get it real time. Something like Slack has real time context and if you miss that as it's happening then your agents won't have up-to-date information. The second thing is normalizing it. So we talked about uh that understanding the LISA in your emails the same as the Lisa in Slack getting to one single entity dduplicating it structuring it. The next thing is synthesizing.

So sometimes data conflicts and when data conflicts you need to actually bring that to humans to say okay we have this trade-off which one do we choose and then finally you serve it to the agents and have a single source of truth. Get the agents the right context at the right time so they can get work done. The hard part is actually not search, it's synthesis. It's bringing all this information together. Is anyone here familiar with Carpathy's second brain idea? Okay. Does anybody have a second brain already? And has anyone here used obsidian as a source of personal truth or uh second brain? So this is exactly that but for your entire company, your whole team, every single person in your organization and every agent in your organization. Now what this enables is you get to a point where the company starts learning from itself.

Every single action that you take creates context. The humans execute, the agents execute, work gets done, new context gets created. All of those traces then are synthesized and put into the brain and then future execution gets better. Imagine where every single claude code instance can now take those learnings and those new uh takeaways and share them with the entire organization. Every salesperson if they learn a better way to sell that's instantly shared with everyone. What this enables is companies that improve recursively over time. Traditionally we've had people context drain where people walk out the door and then take their contacts with them. Now you can get organizations that self-improve, that get better, and every single person, every single agent constantly makes it better without adding another meeting. So we are Hyperspell.

We believe every company needs a brain. We build it for you. We are contracts infrastructure for AI agents. If this is a problem that you want to solve, you should find me afterwards. Send me an email or find me on Twitter. Thanks everybody. >> Wow. Just wow. Such a genius, bro. >> Give it up for this guy. >> I can't believe just an ordinary guy like him could make a a masterpiece like that. Uh I believe all of uh everyone's brands or companies deserve a big brain like his. Please welcome our next uh speaker who is hangong hang hong Lee and he has come to show all of us that uh we can all shift fast with code and how like you can uh do something as good as him. Thank you. And please give it up for Hangong. >> All right. Thank you, Usman. That was great. Right. Thank you everybody for coming. Right.

Today I'm going to talk about the three primitives that we need to ship fast with cloud agents. Right? Everybody wants to ship fast and I was telling man behind that we should like multiply ourselves, right? How can we multiply everybody and right now? So like we are light sprint, we are current YC company, we are three Singapore founders and we are three curious Singapore founders. We are looking to um figure out what the nature of work is going to be in the AI age, right? The nature of work is changing really fast with like we three of us we have a bunch of experience doing product doing engineering and we're trying to figure out what that means, right? So right now we're building cloud agent environments, right?

We're helping teams build their environments up so that their whole team can ship um make changes to codebase and existing code bases um reliably, quickly and safely. Right? What is cloud agent? Right? This this slide you probably you know know everybody's talking about cloud agents today um and yesterday and the day before. So quick one most cloud agents are mostly from a managed environment. See, they are like a coming out of the the cloud basically and a service that's set up usually set up by the company, right? They're also non-interactive. So, you fire them off and then they go around and they build something and they come back with what they built works in the background. Sometimes they're called background agents. Some people confuse cloud agents and background agents. They're the same thing. They just work in the background.

A quick a quick kind of like primer bring everybody where kind of how we got here, right? Starting we started with the agent inside the computer helping us type cursor. I was an early cursor user that was super fun like command K and everything. And then we had coding agents cloud code cursor again, right? Everybody is inside our computer. It works when we work. It stops unfortunately when we stop. But now today like cloud agents basically they they are everywhere. They work for us all the time. Um, if you know how to control them. And so today we're talking about that. The promise is great. Like cloud agents hope to reshape your organization. They want to kind of um build out like your backlog, finish your backlog basically like you know they can build anything. Um, anybody can kind of like put together stuff.

And the last thing is like similar to what Hyperspell is doing like you know they the promise is that they're going to learn your organization and help you make operations better. The best of these companies are already using cloud agents. So uh they get a three to 5x like improvement and sometimes even more. uh some of the startups that we're talking to are using them very effectively and a lot of people are seeing that they like the number of PRs that are merged just created coding agent created PRs are rising incredibly fast. Okay, so now how to make cloud agents work for you instead of against you, right?

So um a lot of times like you might not give the cloud agents the right context and so um we want to make sure that like you want to make sure that you give the right agent the right context and you want to do um make sure to give them the right plan and context. The other thing that you want to do is to make sure that you know where the agent is at any point in time. Right? So you want to make sure that you have the cloud the agents doing the work that you're you're asking them to do and you are able to check where the agents get stuck or are they currently stuck or are they still working.

The last piece is as an engineer I feel like uh it's super important like is if if my whole team is going to ship code to me I need to review them and if I need to review them then I need to spin up uh coding environment for that and then I need to make sure that that works and the worst is it doesn't work and then I have to go back and tell them that it doesn't work and they need to um rebuild the PR which I could just do by myself. Right? So at Lightream, we're thinking of it in three primitives. You need to plan properly so to make sure the agent has the rest has the best uh stuff. You need to orchestrate, you need to make sure that you know where the agents are, and you need to preview. So I'm going to jump real quick into like, you know, our app. And so I I feel like I've talked a lot, but I haven't shown anything.

Um so right now I'm trying to I'm going to introduce you our application. Oh, this is the middle. Ah, yeah. So this is the Lightprint platform. And so the live stream platform is basically a workbench for your team to collaborate, right? So you can see that like it's just a bunch of um camb boards and a lot of tasks and stuff like that. And then yes, plans at the side you can see. So basically what's happening here is that we are basically helping uh to create tasks. So we we put the prompt in the task format and so it actually is grounded by the codebased context and basically able to quickly uh enrich your task with a lot of information so that the coding agent can like get launched. So we support a whole bunch of coding agents. We have cursor, we have entropic, codex and and these are just harnesses under our system.

And basically we have our lights cloud agent which is also a harness around that harness. Right? Once you launch the cloud agent you can basically click through to dive into the codebase into the the code inside. So what you're seeing now is plan mode, right? We we want to change this screen. It's kind of boring. It's not AI. is just basically a list of your recent tasks and your recent plans, right? So, let's uh let's kind of use our plan mode right now. So, we figure fire our plan mode. We support right now Gstack and our own Lightrint plan mode, right? So, we're using our current light plan mode. And basically what it does is that like the idea here is that we want to create a multiplechoice with recommended and with other and that's really like our favorite like use case, right?

like people everybody loves choices and everybody loves like you know um the AI coming up with choices for them right but we also love visual right we love like to see the mock so we also made the AI kind of restricted the AI to kind of say please make a good experience for the user by showing them something visual right so we allow the user to pick like different options even make more options right say okay you know please give me two other new options and then those two options will just be added along and then you can pick from those as well. Right? So we we didn't really do much to tell the agent not to do things but we just basically gave them a set of like um uh guiding principles.

So, what you get at the end after you've made all the selections, you actually get a full-fledged like inapp um uh preview of how your your feature is going to look like. And it's kind of interactive sometimes, depends on what the AI chooses. And then here, you can also make changes to the colors in in our case. And after that, we're going to generate a full like spec that will just send it over to the coding agent, right? It's going to put it on our bot and then we're going to send it and then um we'll check back in maybe 20 minutes later to kind of get get that sorted out. Right. So now it's like picking the agent and shipping it off. Yeah. And a few moments later. So now it's done. And so now we can actually go into the preview part of our system. Right. So that's really a big thing for me.

Um it's able to look and click through the app. And this is basically once set up for any software factory out there. If they don't have a preview mode, you have to ask them, hey, you know, how how can my guys preview the app that I have created a PR for? Because that's super important because nobody hates a PR to review more than the one that doesn't work, right? So like we leave it for the um all the whole team members able to preview the app before we send it over. So we have been using light sprint at light sprint and we've achieved a lot of success having super a lot of fun doing things in parallel and also kind of like doing things in p like on on local host. So we are mostly cloud agents.

So if something is like a a mobile bug or a you know a small issue that that people talk to us about we'll put that on the board and then we'll fire off a cloud agent to do it. Right. So lightrint will build cloud agent first right we we have plan we we think that people should plan with uh a plan and you can use our visual plan wizard right they should orchestrate and they should preview um and that's super important right so here are my socials and lip sprint like links so feel free to take a screenshot and use them and thank you so much for attending this Ah, woo. Thank you. Hang is such a cool product. I was watching the demo back there and I was like, "Whoa, I I can be a full-on product manager now. " Right. That's so cool. Thank you so much. You know what I've noticed over the past few talks?

I've noticed a a consistent color scheme. Have you noticed this, too? Uh, right. It's it's all clawed. No. Anyway, um, no no no disrespect. Everybody's got this orange thing going on, and I'm just like, whoa. It's it's a it's kind of interesting. It's a little bit derivative. Anyway, um our next talk I'm excited about this one because hey, can we get a round of applause for the organizers actually? They did such a good job. They did such a good job. Really, really a Grim Sherry. Everybody did such a great job because you may not know this, but the talks are structured in such a way that they they lead into one another. Okay? And it's so cool. There's a natural order here. So the previous talk was about um the the the project management side and the next talk is also about that.

There's maybe a hot take here which um you know the Louis our next speaker will clarify but the hot take is in the future we will likely just plan and orchestrate agents that write and ship code. So so the job of writing and shipping code moves and we just become planners and orchestrators. Um, and and that's kind of the thing. Lou is going to tell us a story also about his previous business um that that tried to get traction but couldn't. Uh, and and you know, I'll say this. W stands for win and L stands for lesson. And so he's going to have some lessons learned here. Uh, please, your biggest round of applause for Louie. All right. How we doing, Singapore? Woo. Let's go. Oh, it is 5:00 p. m. on a Sunday. Let's keep the energy high. Okay. Last thing standing between you and a cold beer, maybe. Um, okay. I'm Louie.

Uh, I until very recently was the co-founder of a startup called Vibe Camban. Um, I also run an AI community in London called AI tinkerers. So, if you're ever in London, come along to an event. You'll have fun. Um, and what I want to talk today uh about is why I was building this startup and why I shut it down. How basically the job of software engineering is quickly becoming essentially plan and review code that is generated by AI. Uh and I guess I don't know how many people in the room are kind of interested who who is like a startup founder or is going to found a startup at some point in their lives probably. Okay, good. Well, I I will try and talk about like some of the reasons why we ended up shutting the company down and like maybe what could be gleamed and learned from that as well at the end.

Uh so very quickly I'll tell you about what it is we were building. So you got to go back to ancient history. It is May 2025 and my desktop starts looking a bit like this. I've got like loads of tabs open. Claw code has just dropped and I'm trying to juggle running multiple agents in parallel. And I started thinking this is kind of a completely new way of doing my job. and what is going to happen when accuracy goes to 100% and I no longer actually need to babysit what the agent is doing. And I started imagining what that interface would look like. And essentially it's like all of the parts of software engineering without the codew writing part.

Um and if you think about a lot of the software that we have idees, debuggers, uh UIs for like testing, network requests, things like that, most of the software that we use is actually for writing code. And so if you eliminate that part of the job and you're just left with the planning part and the reviewing part, uh you could come up with a radically different UI for that. So we started building Vibe Canban and it's kind of in the name basically. It is a canban board where you create tickets kind of similar to how you would do in Jira. Um but the difference is you can click on any of those tickets click a play button and then you will have an option to run it in codeex in claude code or six other different agents. And once something has finished running you then get a nice interface to review that work.

So one of the ways is reviewing the code obviously. Uh, another way is testing something if it's a, you know, a website or an app or something like that. Um, so this is all ancient history. Seems really obvious now. It wasn't very obvious in June 2025. And a lot of what we were doing at the time was kind of pioneering new ideas. There's a bunch of stuff that we shipped that we then deleted from the app that I'm not showing. So it took some experimentation to get here. So why did we do this? Well, it is because everything is becoming planning and review. Um, if you think about how you might budget your time for the different tasks involved in software engineering before the GitHub co-pilot moment in 2021, most of our time was spent in an IDE kind of scrutinizing code, looking at code to some degree.

And what's happened over time is that that has shrunk as a percentage of total work that we do. So you get the co-pilot moment and then you know suddenly autocomplete is completing a lot of code and then you get chatgpt and you're able to like paste in code and get you know another function out and paste it back in or you you no longer need to go to stack overflow. It's kind of you know making it a lot faster to iterate. Then you get cursor in 2024 and it's almost like you're still looking at the code but you've got this kind of chat on the side and then eventually you get to kind of where we are today which is claw code where to be honest I think you know there's a lot of vibe coding going on. You almost don't even need to look at uh what's going on.

Uh, and so I guess it kind of poses an interesting question of like do we get all of that time that we used to spend writing code back or has it just shifted work to other parts of the development process? I think the answer is probably a bit of both. I think it has sped up uh the the overall job of software engineering, but at the same time I am now spending a hell of a lot time more planning and reviewing the work that I have to do. It depends. And so one of the ways in which and this is kind of like more of a practical way to think about how this framework of like planning and reviewing is useful is I think you can actually speed up your work with agents if you figure out how to get them to be really accurate. Uh and one of the ways to get the accuracy of coding agents up is to spend more time planning. So what do I mean by that?

I mean the most basic version of this is like the codeex or clawed code plan mode. So just use it. I use it for absolutely everything. Uh the kind of the complicated version of that is use a framework. So there's lots of great specdriven development frameworks out there which I'm sure there's been talks on. Um and you can do this like interrogation method where you get it to ask you questions about the task that you're working on exhaustively until every possible question that you could have about a task has been answered. But the key thing is you're basically spending more time in planning before you ask an agent to do something. The consequence of that is that most times your agent will complete the work uh accurately and it'll need maybe one revision, two revisions.

The other way, and this is something that I think we're all a little bit guilty of, is you don't spend a lot of time planning and you suffer the consequences of needing to do a lot of reviewing. So, you know, how often do we just throw in a kind of loosely defined feature and, you know, complain when the model gives us something back that is halfbaked or is just completely missing the point. And so you you're more likely to go back and forth with the model more times if you spend less time planning. I think the other dimension to this is actually the type of work. And this isn't something I've really seen talked about too much. It's kind of a halfbaked thought, but if you think about the types of engineering work, feature development is just radically different from migrations.

And so these different workflows around spending a lot of time planning versus you know uh and and maybe if you're doing that you're able to run more than one agent at the same time versus a reviewheavy more human in the loop flow where you're not running things at the same time that probably favors uh more front-end work. So you know it's difficult sometimes actually to express all of the requirements for a complex front-end feature. there's a lot of interaction involved. There's a lot of visual uh you know things that need to be communicated uh versus backend where you're describing logic and it's it's much easier to kind of find a common language I find when you're describing backend logic and therefore you know the planning and running multiple things in parallel tends to work a bit better for me in those situations.

So, uh, TLDDR basically if you spend five minutes planning, you will probably save yourself a lot of time reviewing. And I recommend always, you know, pushing the the slider that way whenever you have the chance. Okay. And then we can use history to kind of figure out where things are going. So, GitHub Copilot would run for a few seconds before giving you a result.

uh you know the original version of cursor 2024 would work for you know more than that 30 seconds before yielding a result and we're at claw code where it's kind of running for like five minutes before giving me a result on average and so the reason that's happening is because there's increased tool use so we've got agents giving you a response agents running a type checker then giving you a response agents running a type checker then using playright then giving you a response and you can extrapolate that like you know as more and more jobs are brought into the loop. Basically, the time that coding agents are taking is increasing. And so, we're at this interesting point in the history of coding agents where we're about to go really quite far beyond what is comfortable to sit there and watch.

Like, what are you going to do when a coding agent runs for 20 minutes? You're not going to sit there and watch your terminal, you know, like twiddling your thumbs. I mean, you might procrastinate and end up on Twitter or something like that, but I I don't think that's a good use of of of my time, and it gets boring very quickly. So, you know, if I had to predict, I would say a year from now, you know, we're probably looking at, you know, these things running for half an hour, and we'll need to find ways to paralyze this uh a lot more. Um, okay. And I think I'm almost out of time, so I'm going to wrap up with some quick observations. I think basically the work that is emerging is is is managerial.

So if your job on a team of software engineers was to write a lot of code and not do a lot of review and not do a lot of architecture and all the other things that you know maybe would be associated with like more senior or tech lead roles. All of the other stuff is basically going away the code writing part and what will remain are all the kind of traditionally managerial functions. Um, and yeah, I mean, we should be building experiences and interfaces that maximize the focus of the developer. So, things that keep them focused on what is important, eg planning and review. Right, I'm going to have to wrap there because I'm out of time, but thank you very much and it's great to be here. Thanks, Singapore. >> Keep it going for Louie, everybody. >> That was an incredible talk. I am a I am a manager now. We'll give it to the next speaker.

I'm a manager now. Hey, how we doing? How we feeling in here? What? Why are you even here, man? Go sleep or Anyway, he wants to do something. >> All right, so let's play a little game. Can you guess what our next speaker Wait, no. >> What? >> I didn't mean what, bro. My >> Can you guess? >> No. Can you guess where our next speaker flew in from? Uh, shout out your answer by the way. Your options are Singapore. I mean like he stayed in Singapore of course and then Sri Lanka or yet again San Francisco. Shout out your answer. Come on. >> SF. She said it. Where is it? >> San Francisco. Hey, we're on a San Francisco train. Yeah. Give it up for San Francisco, everybody. >> Way too many people from San Francisco. >> That's that's where it happens, brother. AI engineers. >> That's where dreams are made. >> Yeah. Yeah. What a great quiz. Thank you.

Thank you so much. Give it up for your coc everybody. Usman, our next talk >> comes to us from Harsha who works at Interphase. It's an AI research lab and he's going to talk to us about how they train specialized coding models with a new architecture beyond the transformer. So your biggest round of applause for Harsha. >> Thank you. Thank you. Great intro by the way. Good evening everyone. My name is Harsha. I'm the co-founder and CTO at interface. We are a research lab that is reinventing transformers. Today I want to talk about how we managed to build a new architecture for deterministic developer tasks. Now it is no mystery that in the past two decades AI has gone from being a rigid machine learning model to a larger scale generalizable uh intelligence which you can use today for AI workflows.

We've gone from building uh structured fine-tuned models to today prompting that allows you to build agents. More specifically, think about this early 2010s to 2015s. You're a bank. You want to do OCR. How would you go about it? You would have to purchase or procure large data sets. Not only that, get a talented team to build that model, deploy and then maintain it. This could easily cost you about a few hundred to even millions of dollars. Thanks to invention of large language models, we are able to do that with prompts. However, there's still a problem.

problem of hallucination though models like GPT are now massively multimodel and we are seeing it with Gemini they still hallucinate this happens because of context drift when you want it to be behave deterministically for large inputs of data you see hallucinations happen we at interface are solving this exact problem by designing a new architecture that we train so we bring the uh rigidity of a large language sorry machine learning model and the flexibility of a large language model. So how do we do go about this? You use machine learning models as strong encoders for very specific tasks and then you use large language models to create the decoding phase of it. Today I want to showcase a few things as to what this model can do. I want to quickly showcase three things. I'm going to talk about it.

I'm just going to quickly run it so that we have time to talk about it. So first thing this is a real document. I want to extract data from it. Not only the text but I also want to detect the faces on it and also calculate his age to verify it. So we run interface for this. This is what interface gave us. Not only did it extract the text, it gave you the bounding boxes of where it saw the text in the image, the actual pixel coordinates. It got both the faces right. And more importantly, it managed to calculate the age correctly. That's right. Now, let me show you one of the specific model providers or OCR providers that also does OCR. That is Redu. A lot of you might have heard about it. Reductor did extract the text correctly, but it failed to do the other parts of it. Detect where the text is and calculate the age.

Now, this happens because of a stronger encoder. Let's go to the next one. We want to scrape this particular LinkedIn page. Surprised that Gary doesn't follow me yet, but okay. So, we want to extract Gary's experience. LinkedIn can be a pain to scrape because of the blockers and bot checks they have. I want to extract his experience beyond this button. Now, that's going to be interesting. So, let's see what interface did. Not only did it give us what it saw on the first first page, but it goes all the way back to his internship. We are able to do this because of our own script model which is able to scrape uh LinkedIn. And lastly, I want to go about uh a PDF, a dense PDF. Sorry. Uh so I just have to run this again. So on this screen you're seeing a dense PDF that is supposedly a research paper for this particular model.

We want to extract this entire text and translate it to Hindi and also count the number of characters in this PDF. As it runs, I want to go back to the presentation cuz that's going to take time and then talk about it. So now that we saw the demo as to interface can do, I want to talk about how we managed to do it. I want to talk about what we actually trained. How did we do OCR? Before that, I want to showcase where we stand as well. on your screen. This is M OCR bench which tells you how good a model is at handling complex documents not only from research papers but also complex handwriting for massively multilingual uh OCR. We are number one when compared to even specialized models like Chundra OCR or even specific providers like Redu. This is the example that you saw and this is the output that you saw.

What is happening under the hood is that this image is fed as input to the encoder that we trained which is at CNN stack that tells you the text regions. Each of these text regions becomes the uh becomes the location of crop. So you crop the image from where the text is use that to give it to the decoder to gen generate the output. Now this gives you confidence scores. This gives you bounding boxes and metadata that you can actually trust beyond just simple text. we can go a step forward and feed that information to a larger model which is a decoder that we also have conditioned upon to get extract structured output. That's where the age aspect came from. You get the information and then you condition on top of it. That is OCR. Now I go to object detection. How did you manage to detect faces?

Now that is object detection with natural language. YOLO models are great but they only detect specific objects that they are trained for. We are number one for natural language object detection. Meaning you pass in a prompt. Let's take this room. I give a picture of what I'm seeing in front of me and I say detect everybody who's wearing a black t-shirt. Interface would be able to do this. That's a complicated thing to do. How are we able to do that? So you take the same image, you have a text encoder which is encoding the text aspect, understanding what the user wants. You have an image encoder which is understanding or representing the image in positional aspects and then creating a contrastive segmentation meaning it is pulling pixels which are closer to each other allowing you to detect the objects accurately.

If you use that information a step further you can now segment those pixels. Same thing image encoder, prompt encoder, and then you have a mask decoder that will classify all pixels to give you a latent mask. ASR multimodality is a huge thing. Not a lot of models support speech out of the box and I want to talk about it today. We we are one of the fastest models when it comes to ASR and we also have the lowest VR per error rate. So how do we do it? So when you give alarm form audio, we first detect wherever the speech is happening and then crop those audio clips. So we get the chunks and then use those chunks to extract acoustic features for an encoder which is also trained to extract embeddings for feature features. Now these embeddings are used to cluster.

Clustering allows us to segment uh segment features into groups and that gives us dization output. So now you know which audio is by which speaker but the text comes from the encoding part again where you convert the audio into a spectogram. A specttogram is basically a visual representation of audio and then you use that as a frame to generate or classify text. So whatever the pronunciation is that would particularly be classified into text. So before going to the next thing let's see what interface gave for translation. So for this you can see that interface not only managed to extract all the text and translate it to Hindi but it also stayed relevant and safe where it's not supposed to. Like it did not translate addresses, it did not translate author names and it also correctly calculated number of characters.

Now we put this against claude 4. 7 opus to see what claude would do. We gave it three tries and that's why I went back to this. Claude failed all three times. It did because of a timeout. But even if it were able to and if it is a long horizon task, there's a problem with multilinguality, especially with South Asian languages. Let's go back. So we saw three things vision, audio, and text. While working with these three encoders, we train these adapters to work with the same decoder. So you would get accurate data, but you know where that data is being extracted from. You could solve multimodality this way. Today I'm super excited to showcase our numbers as to three modalities that I was just talking about.

We compare these we compared interface to models that you would traditionally use in production and these models are economical and can do the tasks in one shot. But we are comparing them for deterministic tasks tasks where there is only one output. If you're looking at an image, my name cannot magically change. It's going to be still hersa. Yoan and I and my team have been researching about how do we build uh task specific models for about a year now. We did the same things. We picked small language models. We procured large data sets for a lot of money and we kept running into the same problem of determinism. Models hallucinate. That's where we thought we have to go back to the board, redesign the architecture and rethink it. We observed that data is not the bottleneck.

The architecture is and that is what interface is supposed to solve. Lastly, it's been such a pleasure to speak in such a amazing audience and such a beautiful country. Thank you interface. Honestly, those benchmarks were so impressive. Thank you. That was incredible. Um, what fantastic benchmarks. What's up, Usman? >> Hello. >> How's it going? >> Great. >> How's it going, everybody? >> You know, I swear to God, you as an audience, you make me feel like Michael Scott. You watch the office. You know what I mean? I'm just here. Am I entertaining you? And you're like, "No, I'm ready to go home. " Don't be ready to go home. It's not time yet. Okay. I need you to like be inspired. Are you inspired? >> That's better. That's better. Usman, what's up next? >> Well, now we have um some guy named Harishi. Fun fact, >> this is so awesome.

This time he's actually based in Singapore and >> Singapore tech. >> We love Singapore. >> Yeah, >> that's it. Energy over. >> Um, >> continue, please. >> Okay. Um, he his uh his app is actually based on how his personal mistakes with AI and especially coding. I'm pretty sure all of my vibe coders here can relate as to how much mistakes or errors or bugs we've all come through and push. >> Look at this wallpaper too by the way. >> Oh wow. >> Right. That wallpaper is how you know it's going to be a banger. It's so cool. Are you ready Hi, >> you are good to go. Hey everyone, again your biggest round of applause for Hish. >> All right guys. Okay, so that actually was a custom version of Bliss that we made from a talk that I gave at the unconference called how to leave Greenfield. So if you don't know Bliss, at least you know Greenfield.

So this is welcome to no country for all code, right? And it's a working title. I think everyone keeps changing titles all the time. So it's not a talk about coding agents. It isn't a talk about agents for coding. It's a talk about building agents inside large existing systems, right? With old code, organizations, and data because that's what we end up doing, which is about repairing over rebuilding, update over create, about old code and organizations over new. And it turns out if you start from those base priors, a bunch of different primitives fall out, right? You prefer simpler reusable units of work instead of trying to oneshot context windows, right? You remove things from context instead of adding things. And you separate control flow from prompts and prompts from code.

And you calibrate for behavior instead of step-wise success and failure. And you build cost aware systems that separate build and runtime so you can percolate resources effectively. And turns out if you do all of those things well, you get to ship outcomes and you get to do things once and have them stay done. You get to fix things and that break and have them stay fixed. And you get to vibe when you want to, right? which makes it so much more fun. So that's really that's the bulk of the talk. I'm just going to spend some time explaining that but if that's good uh we can go right into it.

So before this I spent a few years in electronics and software and where the bottleneck was like always data and it was getting it into a shape where it can be useful for a decision and after sort of thinking about it for a decade I started Southbridge with that conviction that 3. 5 turbo was this unlock right that last unit of general intelligence that we needed and we could build the rest since then we've built connectors for data systems that self-heal regenerate we've built ETL systems for healthcare financial energy We're beginning to solve ingest I think as a species but also as a company. Ingest as a horizontal category right whether that's for new customers, new data sets or even user uploaded data. And everything that we've done since we got founded was in service of solving that first mile problem for data with AI.

But the problem with starting with data though is like your difficulty from day one is turned up to 11, right? Because you start in the critical path and your work needs to be long horizon from day one and like reliable as a baseline. A full run on even small data which is like a gigabyte right verifying formats, validation, resolving entities all take like millions of operations and that those errors stack up. Context windows if you remember Gemini going from 2 mil to 1 mil actually started going backwards right but even if they went like a 100x we would still have way more data in like a day than what you can process. But then again, the most important sort of biggest killer of data companies that I've seen is diversity, right? Data as a stack as a whole is very very diverse. Both in the macro and also the micro, right?

In the micro, humans, us as a species, turn everything in we can into a canvas. Documents, Excel sheets, PDFs, like the like internally we have the joke that the the merge cell button in Excel was one of the greatest crimes against humanity. And in the macro though, companies really are unique, you know, TM snowflakes because you've got different stacks, programs, SOPs, security boundaries. Even the same database like little Postgress viewed through different internets and permission systems looks like completely different systems. But one important separation that I want to make here is between online and offline agentic systems, right? And it's a it's a useful way to think about these things.

um like online versus offline is things with a human in front of them and things without right and as much as like I want to we want to stroke our egos most real systems have far larger offline components and online ones especially all the ones that we've worked on right you only really need an active latency sensitive human in the loop if you build things fresh every single time like if you can build reliable systems that oify over time and record your preferences like all of that work can move offline to run overnight on local models for cheaper and that agents can function like appliances. They can do a job repeatably thousands of times, right? You fill your dishwasher at night before you go to bed. The next one is that I we still believe that coding agents are going to become the base substrate for agentic work, right?

Not because all agentic work is coding, right? In fact, I think we'll saturate on coding very very soon. But because the coding agent loop is becoming the thing with the most amount of resources, the most RL, the most deployment pressure, and it's got universal primitives, read, write, edit, shell, right? And in the same way that V8 and browsers became the substrate for a crazy amount of software that wasn't actually websites, coding agent harnesses, we believe, will become the engine layer for a lot of agentic work. Okay, so that's enough about the the general structure of things. What did we actually learn? Right, the first thing was to stop pushing one shot, right? Single shot performance I think can be crazy fun to push when you're building things and like same here like you know complex instructions, long plans, giant skills.

I think Sabina was talking about fries and more fries in the back compaction. But repeatable work, which is where we said runs counter to all of those instincts, right? It just is not how you want to build. if you want self-driving agents, right? Because the first thing you want to do is break things into small atomic pieces and in Hankqu which is this runtime that we use and we've used for a long time and sort of recently open sourced those small little boxes are called codons, right? You chain those to then get the behavior that you want and you make them reusable and composable. And if you break it down this way, it makes it so much easier to reason about long runs, which ends up becoming the bottleneck.

your ability to reason about what happens at hour 20 or hour 25 like you the human ends up becoming the bottleneck to you building complex software right the next thing is to remove things from context right I'm still surprised at how few harnesses systems uh just frameworks out there have a way to remove things from context right like the default behavior we've always had is have boundaries that delete context and archive what you don't need right preventing this thing that internally we've come to call world line rot which is you know Ted Lasso says be goldfish ends up being a good thing. The next one there is to just separate separate components by type. Like as an industry, we keep having to relearn this, right? Back when I was in college, we had van Harvard architectures with like code and data separations becoming a thing.

And then later on we had PHP and like CGI and it took us another four years to learn that you had to separate model view and controller. And agentically, same story, right? you if you want to build reliable systems you want to keep these five things as separate as possible like data promps control and the rest right and in the last year we've worked with a lot of people we've touched a massive amount of information we've read millions of words of AI generated results like I I like I said that's our you know superpower which is that we read the outputs we read the outputs for you and we read everything that comes out of these things nine times out of 10 if something breaks it's because there was a wrong abstraction shared shared between you and the agent or because something was left in context that just did not need to be there.

So going into a bit more of our things, right? Like we usually build on the principle that the best part is no part, right? So simple tools sequence work like we talked about and you only add things if you absolutely have to. So I hope it's not a surprise when I say that we've never actually needed parallel agents, right? A single primary agentic thread for us in the line of work that we do for reliability has way too many benefits to give up, right? So many programming languages, Python, JavaScript, a lot of them will agree and we'll look at some of the benefits on our side in a second. But for our version of the event loop, that little hack is what we call sentinels. So we initially designed those things to monitor long agentic runs, but they've become our most powerful primitive.

So sentinels are LLM calls that trigger on some combination of events from the primary loop, right? They trigger, template their context, and then write the result to a file. A sentinel could wake up every 50 tool calls, summarize what happened, and then go back to sleep, right? But turns out they're amazing at catching behavior without creating so much complexity that you have to troubleshoot the the eval system. So laziness, mocking, bad data hygiene, file rights, shell errors. You define the pattern that you want in something reusable that we call the sentinel and then you fix it in the main thread. Right? Way more than hooks. This is far far better for us for coalesing behavior. So I'll do one more just one more which is budgets. Right?

Long horizon systems on our side just need to be cost aware on every axis that matters right but if you do everything that I said so far you can make a declarative budgeting system which is really the best kind like SQL. You can express what you have and the system figures out the gap in between. Right? In fastmoving spaces like AI where models, harnesses, implementation details change all the time, declarative actually wins because it keeps you from needing to rewrite things. So we've got all of the different axes, money, tokens, time, data access even in the right time. You express at build time how those should be distributed. And at runtime, you actually know what resources you have. So you can solve for the two things, right? So finally, if you do all of these things, you can ship outcomes instead of building tools, right?

And I say this to a room full of people, me included, who care a lot about the craft, who care about the tooling, right? But most people, they don't care how their dishwasher works. They don't care how their car injects fuel. They want clean dishes. They want to get where they're going. Like, so our northstar has always been to deploy systems that ship the outcome, right? Which might be getting a customer on boarded as quickly as possible, validating research hypothesis, cutting integration time, right? or just doing all of that without embedding what we call Achilles into your data. And for that agents need to become infrastructure. They need to become boring, repeatable, predictable. And so that is really just the goal for us, right? To build things that get to become legacy. It's only in code that really legacy is a bad word.

So in some ways you're trying to bring that back. So lots of things that couldn't go into the talk, but uh you can go here for for the long version. Thank you guys. Woo! Ah, Hershi, thank you so much. That was such a great talk. You know, I got to talk to Hishi backstage and I was already prepared. Wow, what an incredible talk. One more round of applause for Hishi, everybody. Oh my goodness. Incredible. Our next talk is is is another exciting one. I went backstage and asked him, I said, "Hey, what's your talk about? " And he said three words. He literally just said three words and nothing more. No more words were spoken, Henry. Um, the words were MCP versus CLI. That's that's the talk. And I'm really excited about how many of you um use MCP on the daily. Almost everybody. Wow. What do you use it for? You over there with the glasses.

What do you use it for? Debugging production. Awesome. That's actually a good use case. We um internally where I work, we use a project management tool called Monday. Anyone use Monday here? Monday monday. com. Um it's I'm not going to say anything. Anyway, um they they have a UI like a web UI, but they also have an MCP server, which is so amazing because I can be working on something in in cursor, my preferred IDE, not Spawn. Um and and I have the Monday MCP server inside and I can just say I'm going to this conference added to Monday in the agent and it just does that and it's so cool. So I'm a huge fan of Team MCP. Um but of course CLI also have reason to exist. I mean, Claude Code um is a is a CLI agent, a coding agent with an MCP client functionality, right? And so, how does this land? Well, we're going to find out.

Henry's just setting up here, and in a minute, we're going to hear about MCP versus CLI, which may not even be a versus. It could be an MCP and end CLI. Um, do you think CLI is kind of going out of style? Anyone? No. Yeah, of course not. Because if we don't use it, agents will use it. I think it's a fantastic user interface. I'm slowly running out of things to say. Oh, good. Look at that. Hey, listen. We're almost at the end of the conference. This is going to be a great talk. Give your biggest round of applause for Henry >> Mau. >> No, >> we have a bit more. >> That's okay. Oh, he's You got to You got to extend. Choose extend display. I'm tech support now. There we Is it Is it ready? No. almost. Okay. No, see what they're doing is they're extending, but he hasn't dragged the window. This is now commentary, everybody.

That's what I love it. Thank you. Oh, pity. You know, this is this is the You know what you call this? You call this pity applause. Thank you. I need it. Put a coin in my hat, too, while you're at it. There we go. It was extended. They dragged it. Okay, let's try this again. your biggest round of applause, Henry Mao. >> Thanks for the introduction. My name is Henry. Uh, hey everyone. I'm the co-founder of Smithery. Uh, today I'll be talking about the ecosystem of MCPs, CLIs, what we've seen here from Smithery and how that relates to giving your agents more agency. So, a little backstory. Uh, at my previous startup, Jenny AAI, we built an AI academic co-pilot for academic researchers. And one thing that really bothered me when I was watching users use our product was that they would often have multiple windows open.

Uh they would be using different apps along with track GBT and they will waste a lot of time copy and pasting between these apps and their AI AI of choice. And this is a broader problem that affects every single knowledge worker. Whether you're hopping between terminals, between your coding agents, or jumping between your CRM and Google Docs, we are stuck in a sort of copy and paste hell because humans were essentially acting as the adapter layer for AI. You were in the loop prompting the model for every single read and write access to different services. And prompting is really the tax that you pay when models can't access your data or take action on your behalf safely. And that tax is pretty expensive. So I started Smittery about a year ago to tackle this problem.

MCP just came into the scene and I saw it as a way to help bridge the gap between agents and services. So we started Smidy as an open MCP registry and we tracked a community of thousands of developers who published their MCP servers on us. We built uh a gateway that aggregated these services and unified authentication so that agents can conveniently access all your APIs grouped as a single toolbox. We currently process about 100,000 tool calls a day for our users. But our journey wasn't smooth at all. Uh if we're being honest, uh MCB had a lot of hype after launch, but also had a lot of issues. The protocol was definitely ambitious. It tried to build a standard uh while agents were figuring out how to call tools well and it had to change its spec rapidly in early 2025.

The implementations of MCP clients and service were poor and that led to a lot of frustration with users. So by the end of 2025, I think a lot of people started proclaiming that MCP is basically dead just as fast as it exploded. In fact, at least five people in this conference, I think over the last two days have asked me the same question. Is MCB dead? And we're going to get to the bottom of this because many of the criticisms criticisms that people have raised are valid. The main reason why people had bad experiences with MCP was that most harnesses back in 2025 had a very naive approach of adding tools into it into the model context. They simply dumped every single tool into the context window like this diagram on the right side.

And imagine, you know, imagine you're browsing the web with Chrome, but Chrome did imagine if Chrome did not render HTML at all. It just dumped raw HTML and CSS to you and ask you to figure out what to click. And that's what we were basically doing to models. A harness was dumping all the tools to the model and expecting it to do well. It gave the model information overload and instead uh instead of rendering a usable interaction layer. So this wasted a lot of tokens. It caused context rot and it degraded model performance significantly. And to make things worse, many MCB servers uh built back in 2025 were poorly implemented and basically watered down versions of their official APIs. A lot of them didn't implement proper authentication.

Um and developers would handcraft these uh prompts basically in the tool descriptions to try to prompt inject weaker models. Uh these were all antiatterns that couple task specific behavior uh which should really belong to a skill uh into a tool description. So the lack of a good developer experience eventually led people to look for alternatives. Um coding agents got good at bash. So the natural question people asked uh was why not just use the CLI. The CLI had many benefits. First the CLI had progressive disclosure built right into it. It had pipes so you can compose different subcomands together. Uh, and it's built on a mature Unix stack. But there is a hidden category error we're making here where we're comparing CLI to MCP. MCP stands for model context protocol. So it's a protocol, not an interface.

And comparing it to a CLI is a bit like comparing apples to oranges. And this diagram hopefully can explain it a little bit better because a protocol's job like REST and GraphQL is to define a standard of how to communicate, not necessarily to render uh define how tools are rendered to the model. What was missing was a good harness that renders MCB well to the agent and we refer to this as uh native MCP rendering. The good news is as of early this year in 2026, major harnesses like Claude and Codeex have finally built proper ways to render MCPs. So we wanted to test this at Smittery. How do modern harnesses actually perform when they use their native MCP renderer versus Bash and CLI? So here's the experimental setup we did. We ran a benchmark on three core APIs, GitHub, Linear, and the Singapore bus API.

We chose these APIs because they represented a diverse set of um API styles and also uh training data uh contamination. We also chose three different models listed here. Um and the main thing we changed was the interface we provided to the agent. So we either installed uh all these APIs as MCB servers on the agent harnesses or we provided a CLI to their bash interface. Our goal here is to measure accuracy and token efficiency. So, here's a question for the audience. Just a raise of hands. How many of you think that native MCP did better than CLI? Okay, we got some people. How many of you think CLI did better than MCP? Okay, there's more people. And how many of you think it doesn't matter? Like, it's just a tie. Okay, we got some people here, too.

So to our surprise, native MCP actually won in both accuracy and token efficiency, which really busted the the myth that we've been living with in the last year. Um, and that's be that's mainly because the model harnesses have updated themselves and became more efficient. But what I was more interested in here is what are the principles of agent experience design that really matters like what made uh what can we do to make CLI better uh or what are the principles uh of a harness that actually makes MCP uh work so well. So we did some ablation experiments by changing the construction of our CLI to see if we can match native MCP's performance.

Um so we did an experiment where we added descriptions uh better description to the CLI and we also did some experiments where we added um a search functionality to the CLI and what we found was these two things mattered the most out of a bunch of different things we tried. First is self-documentation. So if you provide agents with discoverable well-escribed tools it will perform better. And the second thing is search. If you provide agents with ability to search through subcomands in a CLI or tools with an MCP, it performs significantly better because this reduces the number of steps it needs to find the tool for the job. So, if you apply these two principles to your CLI, you can mostly close the gap in performance against native MCP. Full uh experiment experimental details are on our blog.

So, at this point, you might be thinking, well, I don't really care about token cost. My company's paying for it. uh or models will get cheaper. Uh you know, the results are close enough. I'm just going to use a CLI. And you're not wrong, right? If you're an engineer running things locally, you should probably just use a CLI. I'm not being sponsored by MCP, by the way. Uh we ran this benchmark after Smitter launched our CLI offering. So, we can work with both. But I do want to give MCP some credit uh where it's due. For one, CLI works if you want to set up a sandbox. But with a good harness, MCP works just out of the box. So these are use cases where you actually want to run a cloud agent um that is sandbox free.

The reason you might want to do this is because it will be more lightweight and have lower latency for lightweight tasks that are unrelated to coding. So portability is one advantage of the MCP. Another benefit is that MCP puts the responsibility of context engineering on the harness. So that means if cloud code updates and improves its harness and how it interprets tools, your tools will improve as well. But there's one more benefit of MSP that's a little bit more subtle and matters once you want to move towards a world where agents have more agency and that is permissioning. Because the major weakness of CLI that we found is that it's usually way too broadly scoped because it's made for developers and it has a huge attack surface when you want to run it with little supervision. CLIs give the keys to your kingdom.

So whenever you're running a CLI agent in the background in a longunning job, you're kind of stuck with two terrible choices. You either make an ask for approval, which doesn't really scale, or like most of you out here who are probably guilty of this, you're going to dangerously skip permissions. And the one thing MCB has here is that it defines an opinionated small surface. So it makes it uh so it makes it easier for you to secure it. This choke point allows us to apply policies and guard rails to your agent. So for example, if you're using spitter's gateway, we provide a policy DSL so you can enforce fine-rained permissions on what your agents can or cannot do. So this primitive gives you peace of mind as we graduate agents to full autonomy. So to answer the question, is MCB dead? I don't think so.

But that's also not the point of this talk. MCP and CLI, in my opinion, both have their purpose, and it's a principle behind agent experience, security, and authentication that are here to stay. MCP might no longer be in the zeitgeist. And that's fine because the best thing that can happen to a protocol is that it becomes boring like HTTP. Boring enough so that we can move on to solving more ambitious problems and push towards a world where agents are driven by outcomes, not prompts. where agents can fully graduate from a chatbot to a co-orker. That is how we move from humans in every single loop to humans being on the loop. Thank you. Come chat with me later outside if you're interested in wiring up your agent. >> Yes. Chat with Henry. One more round of applause, everybody. Henry Mau, we go from humans in the loop to humans on the loop.

Honestly, I'm ready for that. Look, listen. Our next speaker, I've been told, I don't I've just met him today, but I've been told he is, and I quote verbatim, the most cracked engineer in all of Singapore. You hear that? They So, look, I'm not even I'm not even qualified to introduce him. So, I'm I I need help. Ivan, Ivan, give it up for Ivan, everybody. So Raj, I've had the pleasure of knowing Raj for quite a while now and it's absolutely incredible what he pulls off. We had a hackathon once. He came in and said, "Oh, I'm going to build a way for agents to collaborate. " And so we he finished it and we said, "Oh, what else are you doing the weekend? " He said, "Oh, there's the Mistro hackathon. What are you doing then? " He's like, "Oh, I already built a tool to help me build my submission for the next hackathon.

" And then he won out Gemini hackathon. And he almost won the Mistro hackathon. And then he said, "Oh, I've been hearing about this Kim 2. 5 thinking. It's pretty cool. " And I said, "Oh, that's nice. " So, what did he do next? He post trained it himself and ended up beating it and using it as his main agent. Raj is absolutely incredible and honestly, I'm excited to hear what he has. >> Thanks, Ivan, for that l. But yeah, I am Raj and today I'll be talking about my journey in creating evolutionary harnesses as well as evolutionary algorithm in general. So a little bit about how I got to this. Initially it was a paper that me and my friend were working on. We were thinking of how do we create diffusion models um from scratch and we're creating specifically like a medical diffusion model for chess acties.

And while we were working on it, we realized that there was very little data to begin with. And as we're going through different papers, we stumbled upon one paper that talked about um models having like human notions of interestingness. And that paper basically used like a language model as a judge for an open-ended like RL curricula. And it exposed me to the whole world of open-endedness and algorithms. And that was my first time using that. And I think the next question that like naturally came out from that was essentially like if we claim that agents can be open-ended and that they keep producing novelty forever, how does that look like in our own like ecosystem in our own biology? And I think the sun is a very good answer at that.

Um basically for like energy particles that come from the sun they basically come on earth and they get emitted back into space as well as higher entropic like photons and the gradient that basically enables this is all of life itself. Life is the thing that creates more entropy and it's a very particular kind of entropy that took three billion years or even more to create and generate. And the question then was like how can we map this towards some similar like systems like agents itself. So that was what we I tried to do whereby it was like what if sun itself is compute the DNA that has evolved these smaller cellular single cellular organisms into the complex beings like us that write code that engage with code that can think that can react to things and create more entropy. That is basically the trajectory that these agents have.

Um and the selection bias itself is the harness. Um which basically evolves as models have been evolving. A very interesting paper that I read after that was basically a paper that showcased a single agent that slowly improved itself over time. It was called omni epic whereby you had different environments and the agent started out being very specialized in a c in a single environment and as time progressed it started becoming more and more general. That generality of that agent made it perform tasks that were immer that that showcased emergent behaviors and that was a very interesting like feedback loop that then led to the creation of another paper that that same author wrote in which the code itself was when they replaced that to become the code.

um it actually illustrated a a significant in um improvement in performance when whereby the agent like from just performing it at 20% in Swenge it went up all the way to basically 50%. And that was when I realized that if you could evolve the environments that you place these agents in and you evolve the tools, um both are the things that you could have a lever on and that could eventually improve the overall agents performance. And if you look at the trajectory of everything like we've had models that are way better than the harnesses that we have, every company is trying to create custom harnesses. I don't think that's the right way to go about things. What if you could then instead have self-evolving harnesses? There have been papers on that like meta harnesses, ROMs, and a bunch of other literature.

And the next step to that will be the agents themselves. What if you could somehow keep that memory state somewhere else and evolve that agent? What follows next will be things like world models, not physical world models, but world models that interact within a codelike environment or various code like environments that could be very differentiated. And something that I talked to my friend who worked on a pretty interesting world model paper was that what would be more interesting will be seeing how the architecture of agents within these world models look like. They may be novel and not handcrafted.

It may not use the same techniques that we do but that's something that will be interesting to see and we are seeing that nowadays as well whereby initially the scale at which models initially grew up it took us a really long time to saturate MMLU and the other benchmarks but every few weeks you see a new soda model coming out and that's not because we have more better or just better quality data it's because training loops have gotten faster and models are just closing ing the loop themselves to a degree. Um, and my point is that scaling laws still hold to a degree. They will hold and continue holding as long as humans are more interesting than the agents or the harnesses themselves. This could come in the form of like different architectures which are not handcrafted. They don't necessarily have to be humanmade.

And this is something that I believe will just remain. What I found out in my journey so far also has been that when creating the bigger meta harness that I made, what was what improved model performances generally was the trajectory. It was never really the weight. It's similar to how like the DNA just remain like and the way that we exhibit its characteristics change. Um the artifact worth studying is the path and the reasoning traces and why a model did something and not the end state if Yeah, if that makes sense. Um, and another thing that I learned while building code graph was that iteration loops are very important to this. The most successful life forms are ones that like just adapt really fast, those who die really fast. And if you can close that loop faster, it just lets you do more things. And that can come in many forms.

A great example of that is language. What language are you writing your code in? I think for me a lot of my work right now has been around or written in zigg or rust but I realized that eventually like when you want to create better and better tools um languages that have smaller com times actually end up creating better tools and you can create better tests for these tools even if that language is not memory safe. I do believe that eventually maybe this year or next almost every company would start writing some sort of their own meta agentic language and whatever happens these models would keep getting better and better and they don't have to be human readable. So these are just a few tools I built for myself that I've been using internally like muanry which is just a faster rip grab that enables my agents to get more context.

um the exact lines of code are retrieved. Code DB, this is fully open source as well. It's a triagram search for my own harnesses as well whereby agents get the exact lines of code that they need to change so that they don't have like context rot. Um nanobrew was then created because once you start putting these agents up in the sandboxes, you realize that one way to get um coding environment set up, you could snapshot it. The other thing is you could just keep pulling like abt get and getting the packages and dependencies that you need. But I was like what if you make that faster as well so that you can resolve that environment and that's how nanobrew was born which is significantly faster than appget and homebrew itself.

And it was this was also another parallel tool that I realized I had to create for my agent to be better at like just navigating the internet. something like agent browser but also using lesser tokens by using the A1Y like extensions that CDP or Chrome exposes to people and this actually improved the agents ability to browse the internet at scale. Finally, back to the whole evolutionary loop like Dev Swarm was made whereby in Dev Swarm what was orchestrated was basically a set of tools or models that can change their shape.

So you could have like maybe a few Opus context windows coupled with a few chat GBT windows with a whole multi- aentic framework and the source of truth would be something more rigid like terminal bench or legacy bench and as more people started using this I started getting more telemetric data on what works and what does not. So quick side is that all of these like lie into some sort of a fitness function which in a coding agent the harness rewrites every single time. And finally the harness which was code graph codegraph was soda on terminal bench for a while but it's no longer soda and it was essentially just made from that very fact that it was a self- evvolving harness that just got better and better with different models as time went by and it created its own tools. All of that work is also open source.

The trajectories are also open source, but I've not released this like at scale yet, but you can for sure check it out as it's still a work in progress. So yeah, what I ended up building out was just one harness, but the tools that came along with it were also some sort of an evolutionary loop for myself whereby all of these five items essentially made the harness better. And with that, yeah, I guess thanks for coming to AIE this year. And yeah, I just feel like this year will be one of the few years where you keep seeing the bitter lesson. Bitter lessening. Yeah. Thank you. Jesus, I feel like I should like just bow down right here. Oh my god, what a talk. Thank you. Give one more huge round of applause for a man. My mind is blown. My mind is absolutely blown. Will you come set up while while I like wrap a bit, bro?

This We got to collide on stage. You know what I mean? Oh my gosh, that was insane. Like, what a talk. What a Look, I Yeah, they're going nuts over here. I don't know. All of you are asleep. But like, >> that's it. What's your name? >> Daryl. >> Daryl. Oh, that's right. I see you. Yeah, the lights. Um, listen, I was literally looking for one of those open source projects that he shared. I was stuck without it. He has saved my my whole idea. That is crazy. And and he's he's so young and he built this thing. I'm I'm genuinely Can we have another meditation session so I can meditate on that? You know what I mean? My gosh, Raj, incredible. Um, we've come to the end of the conference. Oh, w is here. Yeah. It's very sad. It's very sad. Um but we we must pay our respects respect there. Nobody died.

We must we must pay some some attention and some homage to um a grim the final talk. He look he has won the most hackathons in Singapore. I was told uh and he's somebody who came up in this ecos who grew up in this ecosystem and who was doing his part through the conference and the team and the volunteers uh to really bring it home uh and make AI uh continue to grow in traction and and and vision here in Singapore. And so the brains behind the conference, the heart behind the conference, I've spent the day uh with him walking around and it's very clear to see everyone knows him, everyone loves him. Let's show him how much we know him and love him. A big round of applause for a grim sank. test. Hello everyone. Um, there's the last talk of the day, so we're going to keep it nice and fresh.

Um, and it's about how to vibe a conference in under three months. This story goes back to July of 16th, uh, 2025. Um, Rachel, Sherry, and I were getting lunch and I think just general disdain about the state of affairs in Singapore around AI events. a lot of talk, not real builder friendly moments happening. And we hadn't really started doing any events at this point, but we felt like the culmination of whatever we do eventually will lead to us doing a conference. And I sent a message at that point saying, I think we're going to yolo our way into running the biggest conference in town. I didn't think it would happen, but I guess looking at this weekend, it kind of worked out, right? But obviously you can't yolo this, right?

Like it's big to think that okay, we can pack a thousand people into an auditorium and like give them all the AI stuff that we can find. But you got to test the audience out. Is it because the ecosystem doesn't respond or is it, you know, the ecosystem super responsive but the events don't serve them. So we tried doing a few things. A week after that message was sent, we ran a meetup for cursor. Um, at that point we were like, well, maybe it's one of the first developer meetups of that scale with AI tools in question. Maybe 100 people will show up, maybe 200 people will show up. I think we ended up with 900 signups. And we eventually let 500 people through the door. And that was pretty crazy to me at that point.

Fast forward a few months later, we thought, okay, let's do a hackathon since hackathons used to be pretty big when I was coming up in the scene. And we thought, okay, let's do a 24-hour hackathon. see how many people will sign up. Maybe people will come, maybe people won't come. 1,200 people signed up. We let about 500 people in. Um, and people flew in from as far as maybe the Netherlands, all around the region. And that kind of gave us a lot of confidence as to maybe it's not the events themselves, but people do need a space to be. So 90 days ago, we met Swix and we told Swix, "We're going to run AIE Singapore. " And I think he wanted to laugh at us at that point because he was like, "Are you guys serious? Like I can't help you as much. I have other AIS to also run do have you guys run a conference before? Will people pay?

How are you planning to do any of this? " And I think our response generally was, "Yeah, I think we'll figure it out. " And that has kind of been the motto behind the entire event. So if there are any rough edges around it, I do apologize, but we did try to figure it out. And that's kind of how this went. And all of this sort of centered around high intent. We had intent that we want to make this the best builder friendly event that we could. We wanted to make sure the people in the room had intent that they want to be here. Tickets are not cheap, I understand, but we wanted to make sure the people who actually want to be here are here. We wanted to make sure that the speakers who want to be here are here. So we flew them out.

We wanted to make sure the sponsors who want to be here are here and they gladly sponsored the conference and got involved with this. So everything was culminating in a way where everyone who actually wants to be in this room today or through the weekend was here. We did not give out free tickets. There were a lot of people who were waiting that things might happen. They might find themselves a free ticket like maybe at other conferences. That wasn't really the case here. So all of you are in this room because you paid for it and you really wanted to be here. So big shout of like round of applause for you guys and you guys kept showing up like the rooms were full all day and it's like 6 p. m. and you guys are still here.

So clearly something was working in the quality of the talks, the things that are happening that you wanted to be here all day every day. Um talks being full regularly. Every speaker has told me they've had a great time on stage because the crowd has been super receptive to everything they wanted to share without really knowing whether Singapore is the same kind of audience as what they would expect in San Francisco for instance or London for instance. And it's been super heartening to see full rooms every single day. But the thing is that you can't just copy conferences from overseas and bring them into Singapore, right? It would have been very easy for us to be like let's just take AIE welfare and then copy paste it into Singapore. But Singapore is a different audience. Singapore has different kinds of people.

Singapore has different kinds of expectations from conferences. If this was a research heavy conference, maybe we lose half of you. If this conference is too easy, maybe it doesn't feel like you're getting the rigor that you expect from an AI engineer conference. So finding that balance is a very uniquely Singapore thing. Additionally, you kind of have to make this conference your own because if you're not going to copy something wholesale, what is your contribution to what programming looks like? Sherry had like I think about 21 versions of speaker lineups. How do you categorize speakers together? How do you make sure that if you're listening to openclaw related talks, you're hearing a few at the same time? Because you then get to sort of see perspectives across few speakers and then come to your own judgments.

Maybe earlier today you heard magic path and magic pattern sort of follow each other. Similar names, similar domains, very different approaches to how they think about product. And this allows you to sort of get your own opinions on how things work. But additionally, we wanted to add our own flavor to AIE events. Everyone here had a ticket to the workshops. This is usually not a default at other AIES, but we think that if you're going to do a builder first event for the first time in Singapore, you need people to build. Like this is not a thought leadership event. This is not a fireside chat panel about the future of AI event. This is a builder event. And if you're not building at least one of those days, then we've kind of defeated the purpose of all of this. So workshops were part of it.

We added some decompression sessions because we feel like AI anxiety, token anxiety is such an uh given these days, given how quickly things are moving that people need a way to understand their relationships with AI and find a way to decompress amidst 30 plus talks every single day. That part is important. Obviously, in true Singapore style, we want to make sure you guys have a good time. So, we threw a massive party last night where Jeff Huntley and I ended up DJing before we had a headliner DJ come through. But that's again some things that we think if you're going to do an event in Singapore, we have to do it the way we like to do things here. But obviously, as much as the talks are great, the programming is great.

The whole point of running an event like this is the hallway collisions that happen, people you meet in the expo, people you get to talk to, you had the main teams from most of the sponsors here in person. you had the speakers that you could meet at any given point of time. Whether you're getting a coffee, whether you're having lunch, um whether you wanted to just meet them because they're sitting around you attending talks as well. Giving access to speakers, giving access to teams is something that's very rare in Singapore. If you go to any conferences, whether they be for AI or other things, you you'd mainly see a marketing person sitting there telling you about the brand, exchanging name cards, and that's about it. That's not quite the experience when you're trying to meet companies. Some of them have never been in Singapore.

Some of them have never set foot in any of these conferences. So creating those moments outside the theater was really important to us. And I believe that a lot of you got the opportunity to go around the expo, meet the team. Some of them have flown 17 plus hours to be here. Some of them have never been to Singapore before. So creating that experience for us was really, really important. And we hope that like AI allowed you to get that over the weekend. But the important thing here is not just about the people in the room already. It's about how do we position the next generation to also benefit from this. As I mentioned, tickets are expensive.

But we shouldn't gatekeep opportunities from kids who are coming up in the scene in university in school through extenduating financial circumstances to access conferences of this ilk because they will be the ones building. So we provided scholarships. There was some information about this outside, but essentially we had one of our sponsors was supposed to be a big organization that we've heard of pull out two days before we were supposed to announce the scholarships. And that was pretty gut-wrenching to us because we wanted the kids to be involved. So Rachel, Sherry, and I had decided that we'll pay out of our own pockets and do this. But But a lot of builders in the scene in their own personal capacities decided to chip in and we could bring 20 students in.

20 students who got to meet the speakers, hang out with them, learn from them, and maybe have that opportunity of a lifetime that they wouldn't have in any situation. We have some students side stage. We'd love to have them on stage. So, could we have them on, please? So we found these students through all the hackathons we've organized, all the events we've done. These guys show up for every event we do. And obviously all our events are free by design because we want them in the room. But this is the pinnacle of opportunities that we could have provided. And these are obviously four of like 20 people we sponsored. And you might have seen them around.

They've been the ones doing all the recaps on Twitter, posting about it, writing about their experiences, meeting all the people who have flown in, and this was an incredible thing at least that we could have done to make sure that the kids enjoy this. So, thanks again, guys. I do want to shout out the people who did chip in. I Patrick Kelly from Arise. Arise is actually a sponsor for this conference, but Patrick decided to chip in his own money to support the kids on top of it. Neil Chang, Ivan, Leo, Casper, Suken from Iterative, Zayn, myself, Sher, Rachel, a lot of anonymous builders who chipped in to sponsor 20 students. So, again, a big round of applause for everyone. So, we've heard this quite a lot of times, especially in Singapore. There's no scene here. Nothing is happening. I think I need to fly to SF to attend a conference.

But I think at the end of the weekend, I want everyone to feel that you guys are the scene. You guys showed up regularly. Every talk, every workshop, around the expo, at like 8:30 yesterday, 9:00 a. m. today, through the rain, through whatever conditions could have stopped you. You guys showed up for all the side events we organized in the leadup to this. Every event was oversubscribed. Every event had hundreds of people showing up. Even if you didn't know the companies, even if you didn't know who was going, just because you knew that there was something bigger going on that you could be a part of. And I want that some that to be something that you guys remember that because this is something that goes beyond just AI Singapore. This is what's going to build the AI builder scene in the country for years to come.

And that's why this isn't an isolated moment. I hope you guys keep showing up. I hope you guys keep building. I hope you've made friends over the weekend that you'll stay in touch with. I hope you go and build at hackathons, maybe start some stuff together. I hope you post about it. I hope you don't look for permission to share the work that you're doing because this how people get to know that Singapore is a city where action's happening that it's not only SF where things are happening. It's not only London where things are happening, but Singapore, not just in Asia, but in the world, is a city to be reckoned with.

And on that note, I really want to thank all the speakers who came, the sponsors, our main sponsors, Diamond sponsors and platinum sponsors, OpenAI, ZAI, Google Deep Mind, Cursor, Arise, the volunteers who didn't sleep, the team that held it together, the thousand of you who came. I want to call the team up, both the organizing team and the volunteers up on stage because these guys have been the backbone for the entire weekend running without a hitch. These guys made sure you guys got fed. These guys made sure you guys got your badges and access sorted out. These guys made sure that you didn't see the stuff that was slipping through the cracks just so that you guys could have the best conference experience possible. We're not done yet. Hold on. >> So obviously like in true Marvel movie fashion, you know, AI engineer will return.

Uh we have a signup sheet for people who are interested. Uh we'll send out some early tickets and like information as we figure things out, I guess. But we do want to make sure that we have your intent recorded so that if and when we announce in the near future, you guys are the first ones to know because you guys took a chance on us. For guys who you've never heard from for a conference that's never been in this part of the world to take a chance and show up for the first edition in numbers and regularly is something that we cannot take for granted. And we are really, really, really grateful that you took a chance on us. So, thank you so much again. >> Can we get some music in here? >> Yeah, we do another photo. We do another photo. >> No music. >> Where's Swig? Swix, come up. >> Swix, come on up.

>> Swix is the man behind AI engineer globally. He's also Singaporean if you heard yesterday and him letting us do this is why this is happening. So thank you Swix. Swix please night. Going to be just you and I. >> Just you and I. >> All right. 3 2 1 Can we dance? How do we photo? Hey, hey, hey. Hey, feel me. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey.

Related Videos

Tourist guides adapt as AI and social media reshape travel habits

2026-07-11 · CNA · 03:35

As travelers increasingly rely on AI-generated itineraries and social media recommendations, tourist guides face fewer opportunities and must innovate to adapt.

Singapore to train AI models using local clinical data, medical guidelines

2026-07-09 · Ong Ye Kung · 03:25

Singapore will train AI models on local clinical data for patient diagnosis and treatment, initially focusing on diabetes and eye diseases before healthcare system rollout.

AIxTech Industry Forum: How to Lead in the Era of AI Coding Assistants

2026-06-25 · AI Singapore · 03:22

An industry forum on effectively adopting AI coding assistants. The key insight: AI won't replace engineers, but those who master these tools gain competitive advantages.

AIE Singapore Day 1 ft. Minister, NanoClaw, OpenAI, Google, Vercel, Cursor & more

2026-05-16 · AI Engineer Singapore · 08:00:00

Day 1 of AI Engineer Singapore — the Minister's opening keynote, NanoClaw demos, and engineering-focused sessions from OpenAI, Google, Vercel, Cursor and other leading teams. Singapore's first AI Engineer summit, positioned at the engineer × AI practitioner layer.

HSC Pipeline Engineering: building an engineering knowledge base with RAG AI

2026-03-20 · HSC Pipeline Engineering · 05:00

Through the AISG LADP programme, HSC Pipeline built a locally deployed RAG AI knowledge base, breaking down engineering-knowledge silos and improving decision-making efficiency.

Ong Ye Kung on AI, genetic screening and preparing for a super-aged Singapore

2026-03-04 · Ong Ye Kung · 30:36

Health Minister Ong Ye Kung talks through AI applications in healthcare and Singapore's strategy for a super-aged society.

More on these topics

Economy & Enterprise Adoption