EU and US could reach trade deal this weekend - Reuters
On Tuesday, 10 June 2025, Datadog (NASDAQ:DDOG) unveiled a suite of AI-driven enhancements at the DASH Conference 2025, showcasing its commitment to innovation in observability, security, and developer productivity. While the company highlighted significant advancements, it also acknowledged the complexities and risks associated with AI technologies. The focus on AI integration positions Datadog as a pivotal player in the evolving tech landscape.
Key Takeaways
- Datadog introduced several AI-powered tools, including Bits AI for security and development, aimed at automating tasks and improving efficiency.
- Partnerships with OpenAI and Cursor were highlighted to enhance AI development tools.
- New features for securing AI stacks and data observability were announced to bolster security and data quality.
- No specific financial results were disclosed, but the effectiveness of Datadog’s platform was emphasized through cost savings and productivity improvements.
- Future plans include expanded AI capabilities and partnerships, along with enhanced security measures for AI agents.
Operational Updates
- Datadog OnCall is now widely used, with over a thousand companies adopting it.
- Cloud SIEM has processed more than 230 trillion log events in the past year, doubling from the previous year.
- The Bits AI Dev Agent is generating over 1,000 pull requests monthly, saving thousands of engineering hours weekly.
- Toyota Connected maintains four nines uptime across 12.5 million vehicles, showcasing Datadog’s reliability.
- FlexLogs is a rapidly growing product, managing over 100 petabytes of data monthly.
Future Outlook
- Upcoming features include sensitive data detection in API responses and support for new attack vectors.
- Datadog plans to enhance AI capabilities and maintain partnerships with AWS and OpenAI.
- A new feature aims to facilitate seamless migration from Splunk to Datadog.
- The company anticipates increased integration of third-party AI agents within enterprise environments.
Key Product Announcements
- Bits AI SRE: Autonomous AI agent for troubleshooting production issues.
- Datadog OnCall Voice Interface: Voice-driven incident response tool.
- Bits AI Security Analyst: Automates SIM signal triage and accelerates remediation.
- APM Investigator: Identifies root causes of latency issues.
- Datadog IDP: Managed internal developer portal.
- GPU Monitoring: Enhances visibility into GPU performance and cost.
- AI Agent Monitoring: Tracks AI agent behavior and interactions.
Datadog’s strategic focus on AI-driven automation and enhanced visibility underscores its role in navigating the complexities of modern AI-powered environments. Readers are encouraged to refer to the full transcript for more detailed insights.
Full transcript - DASH Conference 2025:
Olivier Pommel, Co-founder and CEO, Datadog: Good morning. I’m Olivier Pommel, co founder and CEO at Datadog, and I’m really excited to welcome all of you to Dash this morning. Now, I won’t be very long. If you’ve been with us before, you know that we prefer to do more showing and less talking. But things I’d like to thank our sponsors and our partners, and you can meet them on the expo floor.
I also want to tip my hat to our Datadog ambassadors for the great work they are doing with our community. And most importantly, I want to thank all of you, our users and our customers. I want to thank you for your trust and for building with us. And many of you are here today from some of the largest companies in the world as well as the top teams that are building the future of AI. So it is a truly inspirational peer group and a great opportunity for all of us to exchange and to learn from each other.
And many of these stories will be shared on stage today and tomorrow, by the way. Now, as the CEO of a publicly traded software company, job number one for me personally is to make sure that we keep investing enough in R and D. The world is being reinvented every single day, and I think we can all agree that change is happening much faster today with AI than ever before. Of course, this creates incredible opportunities for all of us, But these come hand in hand with an explosion of complexity and with a whole new category of risks. So our job at Datadog is to make sure that you can tame that complexity, that you can get those risks out of the way, so that you can happily and productively ride those technology waves all the way to success.
And that is why we are so focused on building with you. We have a lot to show you today to help you observe and understand your applications, to help you build and run them securely, and of course, to help you take action, Or even better, to do it ourselves so you don’t have to. And to start us on that path, I’d like to invite on stage my co founder Alexey.
Alexey, Co-founder, Datadog: Thanks, Olivier, and thank you all for joining us today at Dash. I’m really excited to show you what we’ve been working on. It’s been almost three years since AI entered the world stage. It may feel like an eternity to you. That’s because we’re all on the cutting edge of adoption, whether it’s using coding agents, whether it’s weaving inference into applications, or building infrastructure with lots of GPU.
Another reason why it feels like we’ve been at this for a long time is that the state of the art is moving so very fast. Right now, there’s a lot of focus on building better reasoning and general purpose intelligence. But as good as the general purpose models get, I think there’s still a lot of room for industry specific, specialized ones. Coding models are a great example. They power the coding agents you probably use every day.
Observatory models are another great example, or even security models. And we have been contributing to the field. Our AI lab recently published a state of the art time series foundational models. It’s called TOTO. And it comes with an associated benchmark called BOOM.
Now, what makes them special is that both are designed specifically for observability. And in the spirit of open science, we’re making all this work available for free, open weight, on a hugging face, so that it can benefit you and others. I personally find a lot of promise in this work. I’m really excited. But with any breakthrough in the field, I think the bar to clear to make all this AI truly useful keeps rising.
That’s at least how we think about it. So we ask ourselves, how can we apply these new techniques to make a difference in your daily work? What does it mean for AI and agents to help you observe and understand, optimize and troubleshoot, secure and remediate, not just in theory but also in practice? To find out, let me hand it over to Tristan.
Tristan Ratchford, Engineering Manager, Datadog: Thanks, Alexey. Hi there. My name is Tristan Ratchford, and I’m an engineering manager here at Datadog. Last year at Dash, we showed you that Bits is capable of operating like an SRE by helping you troubleshoot and resolve your production issues. So when your monitor triggers, Bits will proactively launch an investigation, look across your entire Datadog environment for signal, and find the root cause in minutes.
And for the past year, we’ve been hard at work making Bits even better. So let’s take a look at some of the big changes that we’ve introduced. Firstly, bits is now looking at even more of your data, things like dashboards and deployment changes. It’s able to correlate issues across various levels of your stack using our in house data science models. Next, BITS is now able to perform deeper root cause analysis by continually refining its investigation.
Just like the five whys framework, BITS is continually able to ask why to reason about the root cause. As a result, BITS can now handle more complex tasks that span multiple services. Tasks that could take several hours or several engineers to resolve. Finally, we’ve given bits memory. You can now teach bits to remember steps that were useful and correct ones that weren’t.
We’ve also built a data set with a massive number of real world production alerts that we’ve been using to evaluate business performance against and to hill climb on accuracy. But today, I’m excited to announce that you can enable Bits AI SRE in your account. Using bits is like instantly adding an engineer to your team who is already familiar with your system and is on call 20. But enough talk. Let’s see bits in action.
Let me show you how bits resolves an issue from start to finish. So the moment you’re paged, Bits jumps right into action. In this case, we were paged because an endpoint on our Flight Query API is experiencing high latency. Bits will start its investigation by gathering context about the alert from your Datadog environment, your runbooks, and from lessons learned from previous investigations, all in under a minute. And like you or me, BITS is pulling related telemetry from your logs, metrics, traces, and more.
All right. Check this out. This is the really cool part. Now, based on its initial findings, BITS will then generate a variety of hypotheses to what it thinks the problem could be and then go verify each one of them concurrently. So with our latency issue, BITS is considering the problem is due to database query timeouts, a faulty deployment in the endpoint code, slowness in a downstream service, or a spike in query traffic.
Bits will then go evaluate each hypotheses using your telemetry to determine if it found the root cause, if it needs to move on, or if it needs to dig deeper. For example, let’s take a look at this branch. Here, this is hypothesizing that the latency is due to database query timeouts. Why? High DB load.
Why? Increased API traffic. So as you can see, unlike other agents, BITS is not a black box. You can follow its reasoning every step of the way. Bits will then continue to drill down until it finds the root cause.
And with our latency issue, Bits has determined that the root cause was due to the database query timeouts from that branch we looked at earlier. So right there, you can see the power of the hypothesis tree. Bits is able to simultaneously investigate multiple chains of reasoning in minutes. You get a thorough investigation every time. And every step of the way, you can dive in and look at the evidence and reasoning that went behind it.
You can also make better by teaching it steps that were useful and correcting ones that weren’t. So it’s continuously learning. And just like a teammate, you can ask Bits questions about its findings or get help taking next steps. So for example, here I’m asking, who owns the flight database so I can page that team and get help? Finally, Bits will pipe all of its high confidence findings back to Slack or a ticketing tool of your choice.
So there you have it. BitsAI SRE, a fully autonomous AI agent that’s able to help you troubleshoot and resolve your production issues 20 fourseven, so your engineers can focus on what’s important. And some of you have been using Bits in the past year and have seen great results. For example, Thomson Reuters is using Bits to accelerate how they triage issues across their global operations team. And Fanatics is using Bits to stay on top of their alerts when it matters the most, like during the Super Bowl or tomorrow’s NBA final game.
But now you too can add bits to your team. Check out this link to find out more. And now I’m gonna kick it over to Shah, who’s gonna tell us about incident response. Thank you.
Shah, Product Manager, Datadog: Thanks, Tristan. Hey, everyone. I’m Shah, a product manager here at Datadog. Last year, we announced Datadog OnCall, and your response has been incredible as it quickly became our most in demand preview to date. OnCall was pushed to general availability a few months ago, and there are over a thousand companies using it to improve on call processes.
And we’re not stopping there. We want to go beyond sending simple pages and making robotic one way phone calls. So let’s take a look at how on call and incident response is evolving.
Unidentified speaker, Voice AI: Hey, Shah. Just calling to inform you about a critical latency alert on checkouts. I can acknowledge it and share more details with you. Want me to do that?
Shah, Product Manager, Datadog: Let’s go ahead and act that.
Unidentified speaker, Voice AI: Sure thing. The p 99 latency on checkouts has increased from nine hundred milliseconds to five point five seconds starting about five minutes ago. I’ve acknowledged the alert, so it won’t escalate any further. I also have the linked runbook handy. Would you like me to share the suggested next step?
Shah, Product Manager, Datadog: Yeah. Let’s get the step, and could we also post to Slack in the checkout channel that’s going on?
Unidentified speaker, Voice AI: Got it. I posted Slack in hashtag checkout ops. The suggested next step is to look at the upstream payments API for any latency or errors. Want me to pull that information for you?
Shah, Product Manager, Datadog: Yeah. Let’s grab that.
Unidentified speaker, Voice AI: The error rate for the payments API has climbed to 8%. It seems there was a deploy to the payments API shortly before our monitor went off. I can pull the next few steps from the runbook.
Shah, Product Manager, Datadog: Think think I know what we have to move forward.
Unidentified speaker, Voice AI: No worries. Feel free to reach out if you need another pair of eyes. Let me know if there’s anything else I can assist with.
Shah, Product Manager, Datadog: Cool. Thank you. Bye. That’s a preview of our voice interface for incident response. In real time, you can get details of the underlying monitor, get next steps, and take action even before jumping onto your computer.
Okay. So now that I’ve gotten all of that context from my Voice AI, I’m ready to jump onto my computer and take action. So I already have a tab open to Datadog, and I see this handoff notification on the bottom left. This is for the page that Voice AI just told me about. This is new for on call and incidents.
This handoff notification lets me jump in right where I left off on the call. No more digging around for the page, the alerting monitor. It’s right there when I need it. So let’s fast forward a little bit, and I’ve gone ahead and declared a Sev two incident and kicked off a coordinated response with my team members. I’ve docked my incident, and I can see all the messages and graphs my teammates are posting.
And what you’re looking at here is not a Datadog chat feature. This is a real time sync with Slack and soon Microsoft Teams and Google Chat of what my teammates are already posting. Shared links and screenshots of graphs are rendered as live graphs that can be compared with anything else in Datadog. And while I’m doing this, the doc sticks with me no matter what page I’m on. It’s like turning Datadog into incident mode.
So with handoff notifications and our doc experience, you can collaborate in the same space that you investigate incidents. And in the chat, a teammate highlighted that there is customer impact, so I’m going to go ahead and update my company’s status page. And to help do that, I’m happy to announce today we’re launching Datadog status pages. So I don’t have to sign into another tool, and we already have a lot of the context to pull from your incident response. You can we basically can pre fill almost everything for you.
So, basically, you will never forget to update your company status page. So we support templates, custom domains, have several customization options to help keep all of your customers in the loop. With Datadog Instant Response, you can now co locate everything you need to dive into the issue, work through it with your teammates, and update your customers. You can run your end to end process in Datadog. The voice interface is in preview, and you can try it out today on the expo floor after the keynote.
Hand off notifications and the docked experience are available now. And to sign up for the status page preview, you can do that today. So to learn more and sign up for previews, check out the link here, and I’ll hand it back to Alexey. Thanks.
Alexey, Co-founder, Datadog: Thanks, Xia. Next time I get a page with that much energy at 2AM, I’m going to wake up really fast. So you’ve just seen our new on call, and it’s a real step up from the old static messages that we’ve all been receiving for the past fifteen years. But you know what? What else can we improve with the judicious application of AI to help you cut the daily toil?
Security. To talk about making life easier, if you’re involved in security, here’s Ron.
Ron, Product Manager, Datadog: Thanks, Alexi. Hi. I’m Ron, a product manager here at Datadog. Today, I’m excited to tell you how we’re going to bring AI to Datadog’s Cloud SIEM. Datadog Cloud SIEM helps you triage all of your security threat indicators.
It’s unique because it brings together security and observability, allowing for more thorough threat investigations. Now, Cloud SIEM is growing rapidly. This past year alone, Cloud SIEM has processed more than two thirty trillion of your log events. That’s more than 2x the year before. Now, as these event volumes continue to grow, how do we help overburdened SOC teams manage alert fatigue and high false positive rates?
Well, our newest feature is Bits AI Security Analyst launching today in preview. Bit Security Analyst vastly reduces the time that SOC teams need to spend triaging SIM signals. Bit’s autonomously investigates SIM signals, recommends a triage resolution, showing its investigative steps with accompanying data queries, and allows for immediate remediation right in Datadog. Now let’s take a look at the workflow of a security engineer. I start my day and I open Slack.
I see dozens of new SIM signal notifications, but today I noticed that some have threaded comments. Let’s look at one. I see that Bits has investigated for me overnight. While I see a conclusion, let’s click through to see the full investigation. This is the Bits security analyst investigation for an AWS CloudTrail signal.
Bits presents a clear and reasoned conclusion. The signal is benign because it’s legitimate administrative activity by a verified employee in a sandbox environment. The insights derived from the detailed investigative steps are summarized clearly and succinctly. It’s investigated all the key IOCs and analyzed all of the log results. I can scroll down and expand each step to see BIT’s agentic reasoning.
This specific step shows that while the suspicious activity was irregular and low frequency, it’s suggesting administrative tasks. Bits suggests further investigation. Bits then proceeds to investigate historical signals, IP addresses, user agents, and user behavior. Using the MITRE ATT CK framework, Bits decides which steps to include and which entities to investigate, pivoting intentionally along the way just like an expert security analyst would. Now, reviewing that just took me a few seconds, way shorter than the thirty minutes it would take me to do that investigation manually.
Let’s take a look at a suspicious signal investigation. This is another CloudTrail signal. It could indicate enumeration of AWS services. We’ll definitely want to investigate further, but speed matters. This could be an attacker probing our system.
So let’s look at how Bits uses actions. I click on take action. I could use a preconfigured SOAR workflow, but I’m going to use Bits AI. Now, Bits AI action interface allows me to type in any prompt, but it uses the context of the current investigation to recommend three different prompts: quarantining the user, completely disabling the user, or creating a case. I choose the quarantine prompt and press enter.
Now, Bits is searching for the right action to take and suggests that I use the attached user policy. I click in. It pre fills all the fields it can. I simply select the right connection, and I hit run. Bits has now confirmed that the user has been quarantined and also tells me that it automatically created a case in Datadog’s case management system.
I click into the case, and I see that Bits has prefilled all the relevant information, including the security agent conclusion and the quarantine action that Bits and I took together along with the original SIEM signal. Now taking easy action with Bits wasn’t just fast and easy, it was safe because it used only my team’s integrations, ensured I had the right permissions, and it even asked for manual approval given the sensitive nature of the action itself. Next, let’s navigate back to my SIEM signal list. Once I trust Bits AI’s investigative capabilities, I can simply filter to the benign signals, click, and archive them in bulk. Now I can get through to the rest of the items on my giant to do list, like writing SOC reports and threat hunting.
Bits AI Security Analyst truly augments your SOC team, automating sim signal investigations and conclusions, reducing triage time from thirty minutes to thirty seconds, and accelerating remediation right in Datadog. And you can try Bits AI Security Analyst today in preview by going to this link. Now I’m gonna hand it over to Mike, who’s gonna tell you even more about Bits AI.
Mike Leach, Product Manager, Datadog: Thanks, Ron. Everyone, my name is Mike Leach. I’m a product manager here at Datadog. Let’s continue this thread around autonomous agents that can proactively address problems within your applications. You just saw how BIDS can help you triage SIM signals and automate on call or investigation.
Now, to extend that idea into your daily development workflow, I’m excited to announce the BIDS AI Dev Agent. Much like many of you here, we’ve been trying out all the coding agents on the market, and we saw a huge opportunity to create a unique AI agent. Our new dev agent is deeply integrated within the Datadog platform, so it has complete knowledge of your observability data and uses live production context to autonomously detect high impact issues, diagnose their root cause, and create context aware pull requests. No other agent combines full stack observability insights with true end to end remediation. So the dev agent can deliver faster, more reliable fixes, dramatically accelerating your dev process and issue resolution time.
Actually, it looks like I’m getting a ping from the dev agent now. Let’s see what’s going on. So it looks like the agent found a high impact error. It’s a slice bounds out of range panic in my CodeGen API service. It’s been causing crashes for the last ten minutes.
The dev agent has already generated a fix and linked its PR here. It’s even cc’d me since I’m on call. Let’s take a closer look at the fix. So here on GitHub, the dev agent has automatically written a PR description summarizing what went wrong and the fix that it’s proposing. It’s clear, it’s concise, and it follows my team’s PR template.
It links to the error that triggered the agent. Now let’s take a quick look at the code changes. So in this bug fix, we see a common go problem of accessing out of bound slice indexes. The dev agent proposes a fix that sanitizes these inputs. Additionally, it adds some tests to validate the correctness of this logic.
While this is a valid approach that will definitely prevent crashes, I’d also like the UI to reflect when it’s sending invalid parameters. So let’s ask the agent to update the commit. I’ll just add a comment here asking for the change. And look at that. The dev agent has already responded and updated the PR.
This is great. I’m gonna go ahead and merge merge this PR. So in just a few clicks, I’ve accepted a fix that’s been proposed, tested, and documented by the Bits AI dev agent. And remember, I didn’t even have to go looking for this error. The dev agent proactively found it, fixed it, and sent me a Slack message completely autonomously.
That’s the unique power of this agent, and it honestly feels like having another full fledged developer here on my team. Now you might be wondering, how do I keep track of everything the dev agent is working on? Well, we’ve built a dedicated page for that. Here, I have complete visibility into every PR generated by my AI powered teammate. Whether it’s tackling runtime errors, fixing security vulnerabilities in your code, or resolving issues serviced by the BitsasRE agent, I can easily track the status of each PR, knowing whether it’s been merged, is awaiting human review, or is in the process of iterating based on feedback or failed tests, which helps keep my team informed and in control.
Today, the dev agent is autonomously sending over 1,000 PRs per month across many teams at Datadog, even more if you count the PRs that are manually created from agent generated code. We calculated that the Dev Agent is saving us thousands of engineering hours per week, and that’s time that we can reinvest in shipping features and not sifting through noise. We’re embedding the Dev Agent everywhere. Error tracking, traces, profiling, code security, real user monitoring, database monitoring, test optimization, and more, so you can diagnose and fix problems across all of Datadog. We’re excited for you to try out the Dev Agent for yourself.
Go to this link to sign up and learn more. Now I’d like to pass it over to George so he can show you how the Dev Agent is helping our users in APM. Thank you.
George, Staff Engineer, Datadog: Hi, everyone. I’m George, staff engineer in APM, and I’m excited to share with you how we’re embedding the BITS dev agent to help you solve some of your toughest problems, starting with latency. As an engineer debugging latency degradation, I’m looking at tens of services, hundreds of dependencies, all while coordinating with many teams. On a good day, this can take me an hour. Debugging latency is hard, and we’ve heard this from you too.
And that’s why I’m excited to announce APM investigator. Now in preview. Let’s take a look. I’m debugging a latency issue on my checkout endpoint. I see the p 90 latency is elevated, but the p 75 and p 50 seem normal.
Just above the graph, I see something new. Let’s investigate. This is a latency investigation. Usually, this would have been a headache. I’d start searching traces, metrics, logs, and pulling in different folks to help.
But here, I have all the details of what happened and what I can do to resolve it. Up top, I see the slowdown is limited to a subset of requests. I can see the method causing the slowdown and a PR for the fix by the by the dev agent. To give me confidence in the findings, I look at the supporting section. Here, I see a comparison between a normal and a slow trace showing me that this process premium users method is the problem.
Below that, I see a correlation between the abnormal behavior and request attributes. Requests tagged with premium appear more often in high latency cases in comparison to those tagged with basic or standard. Alright. It’s clear which requests are affected and where in the code I should look. Let’s solve this issue.
Scrolling up, I can go to the PR the dev agent generated for me. Here in GitHub, the agent tells me the cause of the latency issue is an inefficient method. It shows me the proposed fix along with the test cases validating the new behavior. In minutes, I’m able to root cause and fix a latency degradation, which could have taken me hours. And that’s not all.
The investigator can help you root cause many other issues, like app inefficiencies, faulty deployments, traffic changes, and more. Now, let’s take this one step further. What if I could fix issues before it alerted me? I’m stoked to announce proactive app recommendations. Now in preview.
Let’s take a look. This is the recommendations page, where Datadog gives me performance and reliability improvements for the services, applications, and databases my team owns and operates. Each recommendation is prioritized by Impact. Sticking with the latency theme, let’s look at this opportunity to reduce the latency on a service I own. This side panel replaces hours of investigation that I would have done.
I get a clear explanation of a problem, a suggested change, and the impact. In this case, the get card items method is calling a downstream API sequentially. If I can paralyze or batch these calls, I can cut down my execution time over 75%. Wow. I see what the flow of execution will look like if I make that change, and I can see the current latency.
Right below that is around six seconds. To help implement the fix, I just scroll down and the bits dev agent gives me a suggested change, and I can work with it to refine and apply these changes right here. But Datadog doesn’t stop at the service layer. I get recommendations across my stack. For example, here’s an opportunity to improve my product page experience.
Users are having trouble adding to cart. I can see when the issue started and the page that it’s happening on. To get a better sense of what’s going on, I dive into the example session replay. Here, a user is repeatedly trying to add their cart, and nothing is happening. Scrolling down, I can see the impact.
Over 45% of views on the page and 400 of our users are affected. I can see the source of the issue by scrolling down where the dev agent tells me that there’s a component on the page trying to use internal state that hasn’t been properly exposed. I get a suggested fix ready for me to merge in. And just like that, I’ve addressed two issues that could have paged my teams in the future. By analyzing the data that you’re already sending through APM, DBM, RUM, and profiling, Datadog delivers recommendations to improve your application and services.
Ones you’ve told us matter, like resolving n plus one queries and excessive lock contention. So let’s recap. You’ve just seen APM investigator and proactive recommendations. They represent a shift in how you operate through observability. With the investigator, you resolve your issues in record time, and with recommendations, you can address issues before they impact your business.
Join us in the expo hall and get access to these features by signing signing up on the link behind me. Now, I’ll hand it back over to Alexei. Thank you.
Alexey, Co-founder, Datadog: Well, thank you, Ron, Mike, and George. We just saw how the BitCi security analyst cuts a toil for security teams. And then you also saw how the BitCi dev agent is starting to pop up wherever it can use observability data from production, like actual errors, to save you time and write pull requests for your review. And last, in APM, you saw how an agent can help you troubleshoot and optimize application performance with a lot less effort. This is great help with running software.
But how about helping you build better software? We have something new here as well. And I’d like to hand it over to Mohan to share more.
Muhan, Engineering Manager, Datadog: Thanks, Alexei. Hi. I’m Muhan, an engineering manager at Datadog. As software engineers, it feels like we’re slowed down constantly. Whether I’m responding to incidents, creating new infrastructure, or even just deploying, I hit friction every step of the way.
That’s why today, I’m thrilled to introduce a fully managed internal developer portal. The IDP to help engineers ship quickly and confidently using what you already have in Datadog. Working with unfamiliar services is a part of daily life. I’ll never forget when I responded to an incident caused by a dependency going down, how hard it was to fill in the blanks in the middle of the night. Let’s see how Datadog helps with this.
This is IDP’s software catalog. Here, I can see my services as part of a greater whole. To set this up, I can start fresh or import my existing topology from Backstage. A clean start includes all the individual pieces of my system architecture using what’s already in Datadog. Using AI, these pieces are composed into context rich systems with titles and descriptions telling me how they relate.
Right at the top, I quickly find out where code lives and what documentation I can read. I see detailed information about the services my team is already monitoring with Datadog. When coming from Backstage, Datadog completes the picture, filling in the gaps and overlaying real time telemetry onto each component. Understanding my system in relation to best practices also slows me down. As an engineer, I really only hear about this from the occasional migration email.
I really only care when my own builds start failing. The feedback loop is slow, and information is scattered across tons of spreadsheets. With scorecards, I can see a list of best practices measured against my services. Scorecards keep me in the loop about any ongoing platform work. I can quickly see where we’re at with deployment, security, and alerting best practices and know that any required checks are passing before I start a build.
Speaking of migrations, they honestly tend to usually be pretty simple. But even if it’s as easy as changing a couple lines of YAML, I end up getting slowed down by all the back and between my infra and platform teams. Using self-service actions, I can find templates that let me manage infrastructure quickly and safely. I can perform actions on components like data stores and queues or spin up new ones. This create s three bucket with Terraform action was made by my Inflow team for me.
I can fill in any required information like the bucket, the region, and a justification, and then just hit create PR. A new pull request is automatically created and assigned to the infrastructure team for approval, and I can view it right in GitHub. Now that the bucket’s been created, I see it automatically reflected as a dependency in the system overview page. It complies with everything we need it to, covering regulation, internal processes and best practices, and any permissions. As a platform engineer, I love the idea of building templates like this, but I don’t wanna have to learn yet another product specific system.
The s three creation flow we just looked at was powered by App Builder, the way to build low code apps within Datadog. With AI, I can make these templates for developers fast. Because App Builder runs in a low code, controlled environment, I can get the joy of vibe coding with the safety of predefined components. Let’s say I wanna make a template for creating new RDS instances. I can start a new app from scratch and then start from AI.
I’ll tell Bits what I need, and it generates the template for me. Bits explains what it did and confirms any sensitive details around environment or policies. Then I’ll just follow-up with any tweaks. I’ll make sure it looks good. When I’m satisfied, I’ll publish this app for others to use.
Because Bits uses policies that I’ve already configured and vetted elsewhere, I can share this template confidently and know that it’ll run safely every time. Alright. We just saw a lot of stuff. Let’s recap. Datadog IDP is the only developer portal that knows your system and stays up to date automatically.
You can understand your services without overhead, track best practices with scorecards, and manage your infrastructure with AI. Sign up today for IDP and see how you can level up your engineering culture. Now I’ll kick it back to Alexei.
Alexey, Co-founder, Datadog: Thank you, Mohan. What’s created by the IDPs is that it can work directly alongside Backstage, and it’s always getting live data from the rest of the Datadog platform. Now, on the same theme of building software, let’s hear from one of our customers who is also working on helping you build software faster. Here’s Kursor.
Swale, Co-founder, Cursor: Hi. I’m Swale. I’m one of the cofounders of the company that makes Cursor, and Cursor is probably most people’s favorite way to code with AI. We really, really want to help automate some of the most tedious parts of programming so that you could just, you know, go from your idea to code in the page as fast as possible. Possible.
And we want it to be fast and fun and also extremely powerful. Perchure, at least the company very quickly in the last, like, six months has scaled by over a factor of a 100 in terms of infrastructure. Our tab model inference does, you know, almost a billion calls per day and over the lifetime of it have handled, you know, many hundreds of billions of files. Datadog has really helped us, you know, scale observability in a way that we never have to worry about it going down. Probably, Cursor would have grown much slower and would have had many more crashes if Datadog wasn’t as good as as it is today.
I think probably the most exciting part is that the models right now can’t see all the real errors that are happening, all the weird edge cases the code might be hitting, and there’s no better place to see that than Datadog right now. I think it would be really helpful to be able to aggregate this all in another tool like Datadog and then pass it to Pursor and ask it to fix the things using, you know, the ability to both understand the code base and to be able to understand what you have been working on. And I think that will be a big boost to productivity for many, many people. I have a lot of respect for Datadogging. Probably one of the most important tools to help us scale in the last, like, six to twelve months.
Alexey, Co-founder, Datadog0: Hi, everyone.
Alexey, Co-founder, Datadog1: Hello. Hello. My name is Alat Shiban. I’m a product manager here at Datadog. I’m also a Cursor user.
Like many of us at Datadog, we both use and love Cursor. Like Swale mentioned, we’re really excited about the possibilities of having agents access both Datadog tools, capabilities, and data. And that’s why we’re introducing the Datadog MCP Server. The Datadog MCP Server allows agents to both access Datadog data, it allows us to add live instrumentation, and use the breadth of Datadog capabilities to both find and fix issues for you. Let me show you what that looks like in an example.
I’m a developer and my users are complaining that the checkout flow is broken. They add items They click checkout. And nothing happens. Let’s try and ask Cursor for help.
In Cursor, I open a new chat. And I type in, I’m seeing an issue on coupon Django where clicking the checkout button doesn’t do anything. Can you help me debug this? Now the agent tries to figure out what the problem is, looks at the code, but it needs more context. Because of the MCP integration, it can now choose which Datadog capability to use to help debug this issue.
In this case, it chooses to use Datadog’s live lockpoints. Now let me explain what that is. Lockpoints are like breakpoints, only they don’t actually break the execution or pause the execution. And they work on live services. Once you add the lockpoints to your code, they start streaming back debug data from those live services.
And so you can see things like variable values and method arguments and execution paths without redeploying the app. They’re pretty cool. So the agent is asking us to now reproduce the issue because it’s using the log points. And we go back to the website, we click on that checkout button, and the agent starts collecting that data back from the log points. And it notices something interesting.
The accents on the names of the cities are being stripped out. They’re being removed. In this example, it’s Sao Paulo. But they’re being expected in the code And that’s a good lead for us. We click through.
We can see the code, the log points, and the live data coming from those live services. I can see the same information on the Datadog portal in the UI. And both share this information with my team. I can make updates and changes to the log points. I can also now generate unit tests.
Only this time they’re grounded in the production data coming from those log points so they’re more accurate. The agent now writes the test. We run it. It fails. It’s supposed to because we haven’t fixed the issue yet.
Now the agent proposes a fix. We rerun it and the test passes. We fix the bug. Let me recap quickly. With the MCP server, you can use Datadog in any AI agent that supports the MCP standard.
you can now use the kind of reproduce the issues in production in the local environment using the breadth of Datadog capabilities, even ones that you might not be as familiar with. And you can generate fixes and tests that are grounded in real production data, which makes them more accurate. The MCP server for IDEs is now in preview. It’s really important for us that you can use Datadog in any AI agent. And we’re happy to share that we’ve partnered with OpenAI to bring that operational context to their new Codec CLI.
Let’s take a look.
Alexey, Co-founder, Datadog2: Hi. I’m Michael Bolan, lead engineer for Codec CLI and OpenAI. Together with Datadog, we’re imagining a future where on call engineers work hand in hand with AI agents right in the terminal. Here’s a sneak preview of some of the work and ideas we’ve been exploring together. Codecs is a lightweight agent that can run directly from your command line, the primary environment, especially for SREs.
It can follow your instructions while troubleshooting issues, read and edit files, generate code, and run commands securely, all supported through multimodal reasoning. Now let’s see it in action. As an SRE, I might be wondering if something is wrong with So I ask, are there any errors? The Codec CLI interacts with the Datadog MCP server to select the right tool, execute it, and provide concise findings.
Next, to check if someone else already looking at the problem, I can ask, has anyone declared an incident yet? Again, the relevant information is retrieved on the fly from Datadog into Codex and shown in my terminal without having to navigate between apps. Because it retrieved all relevant incident details, I know immediately who is in charge and who I can follow-up with. Next, I want to know if the issue is still happening and want to confirm it using real time metrics. So I ask, are the latency spikes still happening?
Because the MCP server gives the Codec CLI real time access to Datadog tools and context, it can retrieve the relevant latency metrics on the fly and generate an interactive graph in my terminal. It also keeps all the results and context I’ve had so far, so it builds up more and more knowledge throughout the conversation. Finally, I say, update the Redis latency monitor so we can catch this sooner next time, and Codex edits the Terraform for me. This is how we imagine on call engineers will work in the future. You’ll no longer need to switch between apps when troubleshooting issues manually using different sources.
You can collaborate with agents using natural language in your preferred work environment and use powerful tools through seamless MCP integrations. We can’t wait to see Codex help you ship software faster and resolve issues even quicker.
Alexey, Co-founder, Datadog1: Thanks, Michael, the team for putting that together. Both the standalone MCP server and the MCP server for IDEs are now in preview. You can learn more on our website, sign up. And now I will hand things over to our CMO, Sarah Varney. Thank you.
Alexey, Co-founder, Datadog3: Thanks, Allah. I’m Sarah Varney, Datadog’s chief marketing officer. And it’s been so exciting to partner with leaders like OpenAI and Cursor to reimagine what we’re doing with SREs and developers and to meet our customers where they are. And honestly, that’s one of the best parts of my job, hearing how all of you are using the Datadog platform in entirely new ways to power new experiences. And we’re super fortunate today to have one of those customers here with us.
I’d like to welcome Dave Tsai, the CTO of Toyota Connected to the Dash Stage. Please help me in welcoming Dave.
Alexey, Co-founder, Datadog4: Thanks, Sarah. It’s great to be here. I’m Dave Tsai, CTO at Toyota Connected, and we’re building the future of connected mobility. Akio Toyoda started Toyota Connected to pursue the ultimate customer satisfaction. In 2018, he announced that Toyota would transform into a mobility company, and since then, the possibilities have been endless.
Toyota Connected is driving delivering a key part of that mobility mission. To support this vision, Toyota Connected North America was established in 2016. Our goal was clear, bring the connected vehicle foundation in house and drive the innovation from within. Now let me talk a little bit about our company strategy. At Toyota Connected, we focus on delivering connected vehicle services, both in vehicle experiences and out of vehicle services.
We have built foundational products to power this vision. Let me walk them let me walk you through them. Our core products include DriveLink, delivering safety and convenience to our customers, mobility, providing connected data services, the virtual agent, Hey Toyota, our in vehicle AI virtual assistant, and finally, multimedia, our in vehicle infotainment system that power our cockpit experience. And these products operate at real scale. So far, we have over 12 and a half million vehicles connected through these systems.
Let’s take a closer look at DriveLink. DriveLink provides a human assisted service through an SOS button built into the vehicle. For example, if you’re in a collision, pressing the SOS button connects you to immediate human support. We also offer enhanced roadside, automatic collision notification, and stolen vehicle locator, all designed to keep our drivers safe and supported. To show the real impact that we’re making with 12 and a half million vehicles on the platform and over 5,500,000 calls handled, of these of these, 600,000 were critical safety calls, and we’ve helped track over track over 35,000 stolen vehicles.
Now when we talk about vehicle tracking, it’s not just about recovery. It supports civil service responses too. And beyond individual vehicles, we operate a fully connected fleet. Our systems run at four nines uptime, and that reliability is possible because of observability and tooling provided by Datadog. We achieve our four nines uptime by driving our mean time to identification from minutes to seconds.
We built workflows and a software catalog that quickly connect the right people to the right incidents when they happen. Please come meet our amazing DriveLink team at the Expo Hall to learn more about how we achieve our operational excellence. And our partnership with Datadog goes beyond DriveLink. Mobility and the virtual agent also rely on Datadog’s full observability suite to help us build better and more reliable vehicles. And here’s a glimpse of logs and stats we monitor through Datadog.
We currently oversee roughly a thousand hosts, tracking about 800,000,000 8,000,000 container hours, and we’re excited to continue to grow our partnership with Datadog as we scale even further. With a suite of with a suite of tools Datadog provides, we have the opportunity to build even better cars, in the words of Akio Toyota. Thank you, and now back to you, Sarah.
Alexey, Co-founder, Datadog3: Thank you, Dave. We’re so excited to see how Toyota is using infrastructure monitoring, APM, and our Synthetics products across the entire Datadog platform to power this new connected driver experience for over 12,500,000 vehicles worldwide. As Dave mentioned, they’re also going to be on the expo floor demoing their connected car experience live on the expo Floor. I got to get a sneak peek of this this morning. It’s incredible.
I highly encourage you to check it out. So as you heard from Dave, observability has been key to helping Toyota build this new connected car experience. And now we want to go deeper on one of the core pillars of observability, and that’s logs. Last year, we launched FlexLogs with the idea to help you manage your storage costs more effectively. And today, we want to build on that vision.
And to share what’s new with logs, I want to welcome Kelly Kong to the Dash stage.
Alexey, Co-founder, Datadog5: Hi. I’m Kelly, product manager here at Datadog. Last year, we launched FlexLogs, decoupling storage from compute so that you could bring in more logs to solve new use cases, all while staying within budget. Just the takeaway, an online food ordering company uses FlexLogs to achieve full visibility across their stack, cutting MTTR and reducing revenue loss on missed orders. They’re just one great example among many.
In less than a year since launch, teams are now storing over a 100 petabytes of data per month, making Flex Datadog’s fastest growing product in history. We’re just getting started. You told us you need logs for years to comply with audits, investigate zero day security breaches, and perform compliance reviews. When you’re being pinged by three different teams for hourly updates, efficiency matters, and context switching only slows you down. That’s why I’m thrilled to introduce Flex Frozen, a new long term storage tier designed for historical reporting and regulatory requests.
Keep your logs fully managed in Datadog for up to seven years where you have one platform for DevOps, security, and compliance use cases. That’s not all. We’re also simplifying how you discover and analyze these logs. I’m excited to announce Datadog Archive Search, a powerful new way for you to find log insights regardless of where that data lives. Let’s play it out.
My compliance team just asked me to pull a user activity report spanning back three years. Whether I’m leveraging Datadog storage, such as our new Frozen tier, or my own s three bucket where I already have years of archive data, I now have the same consistent search experience where I can easily find relevant logs over any historical time frame. Within seconds, I’m getting data back from my external archives without having to write the perfect query upfront or wait for a lengthy rehydration job. Once I’m happy with the results, I can set up a full CSV report to land right in my auditor’s inbox. Archive Search makes it easy to produce reliable reports when you’re under time pressure or scrutiny.
But for those inevitable follow-up questions, you now have Datadog Sheets. Eliminate the endless emails and exports with a native spreadsheet solution built right inside Datadog. Opening my results in Sheets, I don’t have to worry about syncing my data or managing multiple CSV files. Pivot tables allow analysts and auditors to quickly summarize or drill down into data. For example, I can break down my earlier audit logs by different dimensions, such as team, user, or country.
Sheets is great for this kind of slicing and dicing or building real time reports. But for deeper analysis and multistep investigations, I need a different kind of tool, one that supports storytelling. Last year, we introduced log workspaces to transform log data and build multistep analyses on the fly. We’re extending these same capabilities to notebooks, your home for interactive graphing and collaborative analysis. The key is that I can now bring together all my different telemetry and context, logs, APM spans, metrics, and more into one unified canvas.
Transforming these different datasets is easy with intuitive one click operations that allow me to parse, aggregate, or filter. Tasks that used to mean exporting data outside of Datadog or reinstrumenting upstream apps are now as simple as applying a formula. Best of all, notebooks help you collaborate better with your team. Whether you’re reviewing, leaving a comment, or starting a discussion, you can do it all just like you would with your favorite real time editor. Working together, your team can get to insights faster.
But actually, I have one more teammate in here, one who knows this data inside and out. Bits AI is now integrated right into notebooks. When I ask for help with reviewing user access patterns, watch as Bits jumps into data analyst mode, adding in relevant metadata, writing SQL queries, and visualizing the final result in a logical, easy to follow chain. I can take this final output and save it to my favorite dashboard or continue working hand in hand with Bits and my team. Notebooks offers a new paradigm for advanced analytics with full context, powerful computational abilities, and real time collaboration.
But one more thing. For those of you with existing queries in tools like Splunk, where you rely on Piped Query syntax, check this out. If I copy and paste an SPL query into a notebook, Datadog automatically understands and translates it for me, recreating the same time series graph in seconds. Welcome to the future. Everything we’ve covered today stems from a simple belief that more data should never mean more complexity or work.
We’re reimagining the way you interact with logs from attention all the way to resolution. Visit the link on screen to learn more or to sign up for early access. Thank you, and I’ll pass it back to Sarah.
Alexey, Co-founder, Datadog3: Thank you, Kelly. You just saw a ton of new features for log management. Let’s do a quick recap of what you saw. with FlexFrozen, we’re delivering a new storage tier, extending your log retention to over seven years. With Archive Search, we allow you to query your logs from cold storage without requiring re indexing.
With Sheets and Notebooks, now also together with Bits AI, we help you analyze your log data in entirely new ways. And of course, last but not least, bring your own query, which makes your migration seamless. No matter what your Datadog log management use case is, we want to make sure we have you covered. But we also don’t want you to just hear about it from us. And now I’d like you to hear from one of our customers around how they’re using Flex logs to deliver superior uptime and performance all at scale.
Let’s hear the story of Okta.
Alexey, Co-founder, Datadog6: Okta is the leading independent neutral identity company. Auth0 is Okta’s developer friendly platform for customer identity.
Alexey, Co-founder, Datadog7: So if Auth0 goes down, this could potentially disrupt businesses by having them lose revenue, having customers upset because they can’t log in to see the information and data that they rely on on a daily basis. With such a complex tech stack and having an uptime SLA of 99.99%, you can imagine that every counts. FlexLogs has been a game changer for Okta Zero. Now that we have FlexLogs, we have our logs in one single view. This has enabled us to have faster root cause analysis and incident resolution.
Alexey, Co-founder, Datadog6: We have significant cost savings by consolidating metrics, logging, and tracing, and we reduce median time to mitigation with a great observability tool. With Chen AI technology, the security landscape will become even more sophisticated. Together with Datadog, we can address the evolving challenges and keep our customers safe.
Alexey, Co-founder, Datadog3: Auth0 and Okta together are a great example of a software company evolving in the age of AI. And we’re super lucky today to have the CTO of Okta, Bhavna Singh, here on stage to tell us more about how they’re rethinking the identity landscape for GenAI applications. Please help me in welcoming Bhavna.
Alexey, Co-founder, Datadog8: Wow. It is exciting to be at this high energy Dash conference. Right? I’m Bhavna Singh, CTO at Okta, the leading identity company with a vision to free everyone to safely use any technology. And as we see the tech industry evolving with AI, we are also working to make agent development and use of user identity by agents safe.
And the reason we need to talk about securing AI agents becomes more important as we look at these stats. 82% of organizations are experimenting to deploy these agents in production environments in the next one to three years. And if you look at the stats on customer expectation, more than 60% of customers have stressed the importance of trust in AI agents. And personally, I believe that as more people understand the power of agentic technology, this number will only grow. So as these AI agents begin to act on behalf of users, answering questions, automating tasks, and making decisions on our behalf, establishing trust in these agents will be essential for their adoption and effectiveness.
So who should build this trust? Well, you. If you are building AI agent applications, you are accountable. AI security starts with identity. As developers are focused on getting the agents to work, connecting them to data sources, and integrating with APIs, a strong, secure identity platform can ensure that they are running in secure environments.
Agents must be built securely right from the start and need to run securely from the deployment. At Okta, we have identified four critical requirements where securing AI agent development is crucial to building GenAI applications. Number one, starting with authentication. For AI agents to operate securely, they must be able to authenticate users just like any other application. It needs to confirm who the user is before providing access or making decisions.
Just as verifying a customer’s identity before making a purchase or a patient’s credentials before giving them access to medical records. Number two is API to API calls. AI agents will interact with different applications on behalf of users and will need API access to call these applications. Without strong identity controls, AI agents could access APIs they should not or leak sensitive data to unauthorized agents or be completely unable to perform tasks on behalf of users. This means access tokens should not be hardcoded.
They need to be stored in secure vault. Number three, another common use case we see is asynchronous workflows. Many agents use cases need them to work asynchronously. For example, actions such as data processing or transaction approvals can take minutes, hours, or sometimes even days. Security systems today are not built for an AI built for long running asynchronous workflows.
So an AI agent might need to perform a task long after a session has already ended. So there is a need to authenticate just in time when agents have to act without leaving the door open for attackers. And lastly, authorization. The need to fine tune data access is more understood use case in AI agent development space today. AI agents should only get the permission that they need and nothing more.
We identified these requirements after partnering and speaking with companies of all sizes and growth levels and built these capabilities out of the box in our OTH for GenAI platform, which is Okta’s platform that makes it easy for developers to solve these requirements with built in identity security for AI agents. As these agents are running, how will we ensure the agents are doing what you built them to do? Monitoring and tracking their behavior is the full circle we need to build this trust Because if they access the wrong data, take unauthorized actions, or if someone hijacks your agent and their and changes their behavior, the impact can be immediate and irreversible. Secure identity and observability have always been important in our software stack, but it’s even more so in today’s AI agent landscape. That’s why in the age of AI agents, we need to treat identity and observability not as optional layers, but as foundational technologies and practices.
Datadog and Okta are well positioned to enable customers to tackle these challenges that AI agents pose. And to highlight the innovative work Datadog is doing in this space, I’m excited to invite my dear friend Yan Bing, chief product officer of Datadog, to stage. Thank you all.
Alexey, Co-founder, Datadog9: Thank you, Wamha. It’s been a real pleasure working closely with the Okta team as their observability partner. Hi. I’m Yanbin Li, chief product officer at Datadog. To just share just how excited I am to be here, I actually accepted the offer to join Datadog after watching the Dash keynote on video a year ago.
And I can’t think of a better way to attend my Dash in person by showing you how Datadog is driving innovation in security and observability for your AI applications and agents. As Bhavna just said, security is even more critical in the age of AI agents. With all these new attack surfaces, that’s possible. And our security team has been busy at work. Since last dash, we launched more than 400 new features and detection.
And today, 7,500 customers, including one in every five Fortune 500 companies, use Datadog Security to protect their infrastructure and applications. Now, as you build and deploy your AI agents, we’re evolving Datadog security to meet the unique challenges of AI at every single layer. At the data layer, where training begins, at the model layer, where reasoning happens, and at the application layer, where you integrate AI into real world application. So to dive deeper into how we’re helping you secure every layer of your AI stack, let’s welcome Vijay.
Tristan Ratchford, Engineering Manager, Datadog0: Thanks, Yanbing. Hey, everyone. I’m Vijay George. I’m a product manager here on the security team at Datadog. Now let’s dive right in to see how we can secure our AI stack from these new attack vectors.
We’ll take a look at a few examples at each layer, starting with data. At the data layer, we need to prevent sensitive data leakage in training datasets and prompt response pairs. Let’s take a look at how this works while I’m training my new AI app with sensitive data scanning enabled. In Datadog, I can see a three d map of my entire cloud infrastructure, which gives me context into how everything’s organized and connected within my cloud environment. Here, this s three bucket has some training data to fine tune my custom model.
And with sensitive data scanning enabled, every file in this bucket is automatically scanned for sensitive PII, which I can now investigate further and jump straight into the AWS console to eliminate that PII. At runtime, I can also quickly switch to identifying PII data leaks in every LLM interaction. Here, I can see my attacker is trying to get a social security number from my model. Datadog automatically flags the input and sets alerts to catch sensitive data leaks when it’s been detected. And that’s a quick look at how Datadog helps detect and prevent sensitive data leaks.
And to help you go further, we’re expanding support to detect sensitive data in API response payloads and other data poisoning attacks coming later in 2025. Next up is the model layer where we need to make sure our AI model is safe and isn’t being manipulated. We’ll start by looking at a supply chain attack where an attacker is targeting the supply chain of an open source model. Now I’ve been testing a lot of different models from Hugging Face, and I’ve accidentally downloaded a malicious deepseq model that can run code and give a threat actor remote access to my app. Luckily, with Datadog, I can see that my DeepSeek model was loaded with PyTorch and triggered an unknown process running shell commands.
Datadog automatically detected the malicious model, killed the process, and stopped the supply chain attack directly at the source. Now let’s look at a example of a model hijacking attempt. Here, I’m using a tool we’ve open sourced called Stratus Red Team that’s gonna help me simulate a real world attack in my own environment. The attack you’re seeing here is an LLM jacking attempt where the attacker is using a stolen access key to hijack my model and use my LLM compute for themselves. This could mean I’m left with a huge bill costing me millions of dollars if I don’t catch it quickly.
Now when I get to Datadog, I can quickly respond to this threat in real time. Here, I can see Lucia Silva is my attacker, trying to access my custom model deployed on Bedrock. And from here, I can jump straight into the related signal to triage and investigate more. We’re continuing to add more support for attack vectors at the model layer, including model drift, model extraction, and jailbreaking coming soon in the near future. And finally, at the application layer, we need to protect our environment from code to cloud.
Let’s take a look at a prompt injection attack in my production app. Now here, I’ve built my app and added some bad code. Now when I open a PR in GitHub, I can see that Datadog prevented a prompt injection attack and blocked the merge automatically. Now if I override the block and the code makes it into production anyway, Datadog can also detect when an attacker exploits that vulnerability. Here, I can see the line of code that an attacker could exploit to trick my LLM and run commands to gain access to my entire system, which I can remediate now directly in Datadog.
And pivoting to my cloud environment, let’s look at a data poisoning attack at runtime. Datadog shows my app is training from a public s three bucket, meaning an attacker could poison the data and maliciously change the model’s behavior. I can now remediate the vulnerability directly in Datadog and meet AI security standards with our out of the box AI compliance frameworks. We’re continuing to build more detections, including agentic tool misuse, novel identity attacks, and denial of service coming later in 2025. And these were just a few examples of how Datadog security can help you protect your AI stack from these new attack vectors.
We’ve partnered directly with AWS to build out our bedrock detection library, and we’re continuing to invest heavily in novel security research, building a comprehensive set of AI detections across cloud providers to make Datadog security the product to secure your AI apps. AI is changing how software gets built today, and we’re evolving Datadog security to help you build and ship these apps securely end to end. We’re so excited to see what you build next. If you wanna learn more about securing your apps in the age of AI, come see us on the demo floor today. And now, I’ll pass it back to Yanbing.
Alexey, Co-founder, Datadog9: Great job. Thank you, Vijay. You just saw how Datadog Security offers security for each layer of your AI, from data that powers training to the models that drive inference to the agents delivering real world impact, all through an integrated security platform. Now that we have secured your AI stack, let’s talk about observing aid. As you integrate AI into your product and workflows, how would you know their behaviors and the interaction between them, and also whether they are delivering the user and business outcome that you’ve intended.
To explore how we deliver end to end AI observability, please welcome Angelie.
Tristan Ratchford, Engineering Manager, Datadog1: Hey, everyone. My name is Anjali, and I’m a product manager here at Datadog. As AI workloads move from r and d to production, GPUs become more and more critical. We’ve heard from you. 30% of model training failures are because of GPUs, and these clusters are often running idle.
Yet even as GPU sales skyrocket, SREs and ML engineers are left without end to end visibility in how GPUs impact their AI workloads. That’s why I’m excited to introduce Datadog’s GPU monitoring. Let’s see it in action. GPU monitoring provides full visibility into your GPU fleet across all major cloud providers, on prem setups, and GPU as a service platforms. You can view your fleet at the cluster level, then drill down to hosts, GPU devices, and even MIG slices.
And it doesn’t stop there. Datadog GPU monitoring solves for various issues. Let’s start with contention. Here, my ML team says that their RAY services are failing recently. In GPU monitoring, within the resource contention section, I see the spike in unmet requests, specifically in my cluster named Yanmega.
I filter down to this cluster. Immediately, I see that there are no a 100 GPU devices available. Not only are we maxing out our current capacity, Datadog has forecasted that demand will continue to max out capacity in the next four hours. Datadog GPU monitoring just helped me identify the type and number of GPUs to solve this contention issue with confidence. GPU monitoring also helps me solve congestion between my GPU nodes.
Let’s say my ML team says that their training times are taking twelve hours longer than usual. With Datadog, I can inspect RDMA and EFA network congestion between GPU nodes and NVLink congestion between GPU devices. This issue sounds like a data starvation issue. Let’s investigate our node. Clicking in, I see that switch one port one experienced a failure that caused a throughput drop in data transfer across my GPUs, impacting overall model training times.
I can reroute RDMA traffic to a working port to improve my ML team’s workload speed and resolve this congestion issue. Lastly, GPUs are a precious commodity, and idle capacity can be the biggest drain on our budget. GPU monitoring helps you stay on top of your total GPU spend. Let’s see this in action. Here, I see that within GPU monitoring, we’ve highlighted and identified the key cost optimization opportunities.
Looks like our cluster named Nidovino is our most expensive cluster with over $157,000 in total spent. Clicking into this cluster, Datadog GPU monitoring shows me my total devices allocated, active, and effectively used GPUs. I see that only 40% of my GPU devices are using their cores effectively, leading to over $28,000 in inefficient spend. I can also see this cost in the context of my entire cluster within CCM. Now, GPU monitoring breaks down GPU consumption by pods, processes, and jobs so I can identify non critical and inefficient workloads.
I see here that there’s a pod hogging eight GPUs with less than 50% core utilization. I’ll ask my ML team to consolidate this pod onto a fewer number of GPUs so that we can reduce our total spend. With GPU monitoring, I’ve connected wasted cost in my cluster to inefficient workloads so I can optimize my cluster’s GPU usage. To recap, Datadog’s GPU monitoring helps us solve for resource contention, data transfer congestion, and wasted cost across our GPU fleets. I’m so excited for you to try this new product.
You can sign up at the link for the preview. And now, I’ll hand it over to Victor to talk about LLM observability.
Tristan Ratchford, Engineering Manager, Datadog2: Thanks, Anjali. Hey, everyone. My name is Victor Vong, and I’m an engineering manager here at Datadog. And today, I want to tell you about the latest innovations in LLM observability. For the past couple years, we’ve seen our customers begin to explore using AI in their workloads.
In 2023, we saw mostly experiments. And at that time, we launched LLM observability to help our customers better observe their AI workflows. But as customers started building on top of these LMs, they needed to go beyond simple monitoring to ensure the outputs from their AI applications were reliable. That’s why last year, we added new capabilities like hallucination detection to help our customers trust their LMs. But now, in 2025, as our customers have gone even deeper into using LMs, we’ve seen them begin to deploy their own custom AI agents.
And while these agents have been very powerful, they also present a new set of unique challenges. For example, agent based applications are a lot more complex than regular workflows. It’s hard to see how these agents make decisions or pick tools so they’re not always reliable. And most tools out there aren’t ready to handle these fast changes. To help you build better custom agents and observe their performance, we’re excited to introduce AI agent monitoring.
Let’s see how it works. Let’s say I’m building a personal finance app called BudgetGuru. BudgetGuru tracks my spending, manages my personal budget, and gives financial advice all using AI. Now let’s take a look at how we could observe the agents powering BudgetGuru in LLM observability. Here, I can see the user’s input and the LM’s response.
What my agent did here was it used multiple LLM calls and different tool integrations, which means to figure out the final answer I’d normally have to scroll through a bunch of complex traces. But now with the new agent execution flow graph, with one click, I see a clear view of how my agents work together to create the final response. There’s a lot my agents are doing here. I can see the triage agent calling the investment and education agents, and the investment agent is calling the budget agent for more information. And all of that is being summarized and sent back to the user.
But thanks to the new agent execution flow graph, all that noise is being filtered out and I can just focus on what matters most. And I can also see how each agent was configured using the new agent manifest. When I click on the triage agent, I can quickly see its instructions, tools, guardrails, agent framework, and model information making it easy to understand the agent’s behavior at a glance. But when my agent breaks, I need to do more than just understand a high level view. I need to drill down and see what’s going on inside each agent.
To help do this, we’re excited to introduce AI agent troubleshooting and LLM observability. Let’s see it in action. Here the user asked about their latest dining spend but got a vague answer that missed all the important details. So I’m going to open the agent execution flow graph to see what’s going on. After taking a quick look, I notice an error flag on my triage agent.
So clicking into it, I see there is a tool selection error being highlighted and notice it’s an irrelevant tool call. It looks like the web search tool was prematurely picked because of a vague prompt. So to fix this problem, I wanna try testing with a few different prompts to see which one gives me the best results. So I can do this using experiments. It’s a new way to quickly test and validate changes you make to your LLM applications.
Let’s use the same example. I’m going to pick a dataset and add this trace to it. Now that I have a dataset with that problematic trace, I’ve decided to test out three different models with three different prompts. I would normally run these experiments, dump everything into a CSV, and analyze all that data by hand to figure out the best prompt and model setup. But doing this was always so messy and a lot of work.
Thanks to Datadog’s new experiments SDK, I can run all these experiments in parallel and very easily analyze and pick the best prompt and model. Let’s see this in action. Here on the experiments page, each line is an experiment and it’s set up. And I can compare things like duration and tool selection accuracy. With one click, I can filter for the highest tool selection accuracy using the cards on the left, and I’ll also filter for low duration using our brush filtering.
And in two clicks, I went from nine experiments to just two, and it looks like it’s coming down to prompts v one and v two using two different models. Let’s compare them using the new experiment comparison page. Here, I can easily compare the experiments, see all the details side by side, and a quick summary at the top. And after taking a quick look, I can see that the GPT 4.1 model with prompt v two has the highest tool selection accuracy and roughly the same duration. So I’ll choose that combo to deploy into production.
I’ve now gone all the way from troubleshooting my custom AI agent to improving it through experiments. So to recap, we’ve just seen how Datadog’s LLM observability can help us monitor how our agents interact, run experiments to test our changes, and debug and troubleshoot errors all in one single platform. We support all popular agentic frameworks such as OpenAI Agents, CrewAI, Landgraf, Pydantic dot ai, Mistral Agents, Google’s ADK, Amazon Bedrock, and more. And we’re excited to get this into your hands. Sign up today at -con.i0/agents.
We look forward to working with you towards an agentic future. Thank you.
Alexey, Co-founder, Datadog3: Now
Tristan Ratchford, Engineering Manager, Datadog2: now I’d like to introduce Kathy who will talk about how to monitor your evolving enterprise stack that will soon include agents built by others.
Tristan Ratchford, Engineering Manager, Datadog3: Hi, everyone. My name is Kathy Lin, and I’m a senior product manager here at Datadog. So Victor just walked us through how Datadog helps teams evaluate the performance of the custom AI agents those teams are building. But there’s now an ever growing number of external agents integral to the business that these teams don’t build in house. So understanding these party agents’ behavior is equally as important to achieve new efficiencies and accelerate innovation.
And we’ve heard from you that keeping track of what each agent is doing and how they’re interacting with each other is extremely challenging, especially when you’re worried about security breaches or wasted investments. The good thing is Datadog is all about providing visibility to help teams scale safely. To help solve the challenges that come with integrating AI agents, I’m excited to introduce Datadog’s AI agents console. With AI agents console, you can now monitor the behavior and interactions of any AI agent that’s a part of your enterprise stack, whether that’s a computer use agent like OpenAI’s operator, IDE agent like Cursor, DevOps agent like GitHub Copilot, or enterprise business agent like AgentForce, all in addition to your internally built agents. And with this visibility into both custom and external agents, Datadog helps you understand which agents are supporting your business and what actions are they executing.
Are they doing so securely and with the proper permissions? Do they deliver measurable business value? And lastly, how are your end users engaging with your agent powered business? So let’s jump into Datadog, observe these agents, and get some answers. With a few simple set of clicks, I can instantly see a comprehensive summary of every agent that’s powering my business.
And for each of these agents, I get key insights out of the box. For example, I can see the total monthly costs of using these agents as well as the error rate across each of my agents to easily detect the most ineffective ones for further investigation. Now let’s deep dive into one of these agents, Anthropix computer use agent powered by CloudSonnet 3.7. Here, CloudSonnet powers my Slack based AI agent, which creates personalized spreadsheets for each of my customer success managers of their churn risk customers and the respective product features that have blocked implementation, requiring Sonnet to access multiple systems like Salesforce, Jira, and Google Drive. I can also see more granular insights about this agent, such as the task completion status, which actions Sana took and which ones failed, and when I wanna dive deeper into this agent’s performance, security, or business value, I can do so by using the tabs on the left.
And if there’s an increase in the number of task failures, I’m alerted instantly. So let me click into this tile to see why this spike is happening. This brings me to the activity insights tab where I can see user engagement insights like daily active users, who my power users are, and quickly filter to those failed sessions without needing to write a single query. And it looks like we have quite a few failed interactions here, so I’m gonna click into one to see what’s going on. By doing so, I get a replay of every action this agent has taken, which is amazing because I’ve just gone from not knowing what this agent is doing at all to seeing exactly where it’s clicking and what it’s entering into the browser, browser, like signing into Salesforce or navigating into the analytics tab to pull that list of turn risk customers.
I can also click anywhere on the corresponding events timeline on the right to jump to that exact moment in the replay. Now let’s see if we can figure out why this interaction failed with this error. I can click into this detailed side panel, which quickly reveals why the agent has failed to generate that requested spreadsheet. It looks like it lacked proper permissions in Salesforce to view customers’ churn risk states. This led to the agent repeatedly trying to query on an unavailable field.
So by simply granting Sonnet proper permissions, I can restore user engagement and boost the business value that Sonnet provides. So to summarize, Datadog’s AI agents console allows you to innovate safely and with confidence. You’ll get full visibility into every agent’s actions, insights into the security and performance of every agent, quantifiable business value for all of your agents, and ultimately proof that your AgenTic AI investments are paying off with your end users. And we can’t wait to get this to all of you. Sign up to become one of our design partners by following the link above.
Thank you. And now back to you, Youngbin.
Alexey, Co-founder, Datadog9: Thank you, Angelie, Victor, and Kathy. You just saw how Datadog provides end to end observability across your AI stack. GPU monitoring monitors and troubleshoes your GPU’s congestion and contention and cost so you get the best out of your GPU investment. LLM observability help you build and operate your LLM applications, including agents. And with AI agent monitoring, with agent troubleshooting, and experiments.
Last but not least, AI agents console gives you full visibility and control across your entire sphere of agents running your business, whether they are developed in house or by party. So that was AI observability end to end. Now switching gear. We also know AI is only as good as the data powering it. So how can you gain deep insight into the quality and lineage of the data that’s powering your AI?
I’d like to invite Kevin to tell you more.
Alexey, Co-founder, Datadog0: Hi, everyone. I’m Kevin Hu, a staff PM at Datadog and formerly the CEO of MetaPlan. Together with my friend Ian, who leads data at Ramp, I’ll be talking to you about a new topic for Datadog, data observability. As we just heard, companies are increasingly using AI to provide better experiences to their customers and build more efficient businesses. Underneath these AI systems is proprietary data, which is data only your company has and serves as your durable, differentiated advantage.
In other words, your AI is only as good as your data. And now I’ll hand it over to Ian, who’ll talk about how Ramp uses data as a competitive advantage.
Tristan Ratchford, Engineering Manager, Datadog4: Thanks, Kevin. Hey, everyone. I’m Ian, the head of data at Ramp. Ramp helps over 35,000 companies control spend, automate accounting, and manage vendors all in one place. We help the average customer save 5% per year on expenses, and we’re headquartered right here in New York City.
Across the company, we collect unstructured data like receipts, invoices, and bank statements, as well as structured data from systems you’re familiar with. Our data team transforms this data would you mind going back a slide? Thank you. As well as structured data from systems you’re familiar with. Our data team transforms this data to power critical use cases across the company.
I’ll start with one example. Capital markets operations. Ramp’s business is extremely cash intensive. We work closely with banking partners like Goldman, Citi, and Barclays to maintain big lines of credit that we borrow against our receivables. That means we need to know which businesses owe us money at any point in time down to the down to the cent.
And that’s hard. Credit card transactions can be reversed. Authorizations can be held and removed, and Uberride may be multiple transactions once you include a tip. We also depend on many parties who may send us data with duplicate rows, missing entries, and incorrect numbers. When that happens, it breaks the trust that we have with lenders.
By flagging when data doesn’t pass the smell test, data observability helps our capital markets team sleep better at night and in turn helps us extend customers the credit they need to run their businesses. Moving from operations to product, one of the most exciting products we’ve built combines data and AI. It’s called price intelligence. Over time, we’ve collected millions of PDFs, receipts, and statements across customers and vendors. Traditional OCR and rules based systems didn’t scale.
But with large language models, we convert this massive and messy set of documents into structured data. Then we surface pricing trends, outliers, and benchmarks across billions of anonymized transactions. So when you’re looking at a contract, you can see what might be overpriced, how it compares to peers, and whether you can negotiate it down. But invoices change. Pricing models shift.
LLMs aren’t perfect. By catching these issues, data observability helps customers trust what they’re seeing. We know foundational models will keep improving, but we believe there are really only two modes: customer context and data. Thanks to Ramp’s product, engineering, and design teams, we’re in a position to be that system of record. Now it’s the data team’s job to capitalize on that opportunity, and we can’t do it without trust.
And data observability helps us get there. And with that, I’ll pass it back to Kevin.
Alexey, Co-founder, Datadog0: Thanks, Ian. Ramp shows us what’s possible when data goes right. But what about when data goes wrong? This diagram probably looks familiar to you. Data flows from sources through a warehouse to downstream AI and BI tools.
Everything looks fine until a customer flags the data issue. You start troubleshooting, but the context is either fragmented, messy, or missing entirely, and meanwhile, the problem compounds, and you start to lose the things that are easy to lose but hard to regain. Time and trust. We don’t think working with data should be this way. And to help you shift from reactive firefighting to proactive action, we’re introducing Datadog data observability, now available in preview.
So let’s say I’m a data engineer at a financial operations company like Ramp, call it Poly. And there’s an issue where the quoted prices are incorrect. Instead of the issue going unnoticed and then eventually impacting customers, I instead get a Slack alert saying that the quoted prices are lower than expected based not on manual checks, but on machine learning models trained on historical data that takes trends and seasonality into account. And to learn more, I entered Datadog. And within Datadog, I ask myself three questions.
Number one, is this real? Number two, does it matter? And number three, what can I do about it? And to answer that question of, is this real? I look at the most recent data points that failed.
And it looks like, yes, there are several occurrences of data below the expectation, so there’s probably a real issue occurring. But number two, does it matter? Instead of trawling through query logs to try and find the downstream dependencies, Datadog automatically parses them for me. That’s how I know an executive reporting BI dashboard and a table in a vector database storing embeddings are affected. So clearly, this issue does have a real impact.
And finally, I ask myself the question. What can I do about it? And usually, that’s when I’m out of luck. I don’t know where data comes from because my visibility is limited to the data warehouse. Datadog helps me map all the way upstream, integrating lineage and context across products.
So I see this Snowflake metric is materialized by a Spark transformation, which is erroring out. When I look into the details of the most recent job run, the caught exception indicates to me that the job expects to see a file in s three that’s no longer present. To figure out what’s wrong with the processes generating that file, I zoom out to the full end to end lineage view. It looks like the s three bucket is the destination of a Kafka pipeline fed by a microservice. I inspect the microservice that’s producing those Kafka messages and see that there is a recent schema change which corresponds to recent feature push.
And to resolve this, I ping the on call engineer to roll back the relevant PR. What you just saw is a combination of deep data quality checks and machine learning models that are tailored to the enterprise data quality domain overlaid on end to end data lineage. Now, what do I mean by end to end? Well, existing data observability products typically start from the warehouse, then shift one step to the left or one step to the right. But by starting towards the end and with a limited view, the damage is often already done.
Datadog data observability is the only product that spans the entire data lifecycle, starting all the way from the services and applications that produce data, to the streams and ingestions that move data, to the jobs that transform data through the warehouse, to the BI and AI systems that consume data. So now you finally have the visibility across the full data lifecycle to detect issues sooner, resolve them faster, and ideally, prevent them from happening in the place. Datadog data observability helps companies like Ramp, Justworks, and Glassdoor trust the data that powers their businesses. And if you want that same level of confidence in your data, you can sign up for the preview today or visit us at our booth to learn more. Thank you.
And now I’ll hand it back to Yanbing.
Alexey, Co-founder, Datadog9: Thank you, Kevin and Ian. It’s really exciting to see how Datadog, Data Observability, can help you not only deeply understand your data sets, but also the entire lineage so that you can understand your data lifecycle. So today, we’ve covered a lot. We started with introducing a fleet of fully autonomous biz AI agents, including SRE, security analysts, and dev agent, to help you boost your team’s productivity and reduce time to resolution. We talk about the on call RealVoice interface that lets you jump start an incident response.
We then introduce Datadog IDP that can help your development team build better software faster with more confidence. And the Datadog MCP server allow you to build your own agent with rich observability context from Datadog. We’re also reimagining observability from APM to logs and more. And Datadog Security help you protect your data stack at every layer. Last but not least, end to end AI and observability so that you have full visibility across your entire AI application and data.
So what do you think? Is it a lot? But wait. We actually launched so much more. Yes.
What you’re looking at are all the features we’re launching today at this dash. And this may be the most visually non exciting slide you’ve ever seen. As a product person, it really makes my heart sing because this truly represents the hard work from thousands of engineers so that we can help you observe, secure, and act better on your data and applications. So now, I don’t expect this Dash keynote to have the same effect on you as it did to me personally last year. But seriously, please visit data.hub so you get to see more live demos together with our product and engineering experts.
And also attend the breakout session where we not only talk about product and technology, but most importantly, real customer stories from many of you in the audience. And that’s a wrap. Thank you, and have a fantastic Dash.
This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.