The history of software follows a consistent pattern: each new paradigm shift demands new infrastructure for trust and reliability. The shift from monolithic to distributed systems gave rise to application performance monitoring. The move to cloud-native architectures created modern observability. In many cases, the companies that built the measurement and monitoring layer became deeply embedded systems in engineering workflows and the infrastructure stack.
AI is now undergoing its own version of this transition, and we believe the stakes are higher than ever. Over the past year, AI applications have evolved rapidly from isolated experiments to production systems that reason, take action and operate through complex, multi-step agentic workflows. But while model capabilities have advanced at a remarkable pace, the infrastructure required to trust these systems has not kept pace. This gap – between what AI can do and what teams can verify it is doing to fully productionize it – has become one of the most consequential challenges in modern software development.
Today, we are excited to announce our partnership with Braintrust as we lead their Series B, supporting their mission to measure, evaluate, and improve AI in production.
Why Evaluation and Observability Are Different for AI
Traditional software testing rests on a foundational assumption: determinism. Run the same input, get the same output. AI systems break this contract entirely. Non-deterministic outputs, emergent behaviors in multi-step agents and sensitivity to subtle prompt changes mean that existing observability tools, no matter how sophisticated, were not designed for this problem. Many teams today still rely on vibe checks, spreadsheets or manual prompt tuning to assess quality. These approaches may suffice in early prototyping, but they do not scale and they do not provide the assurance required to ship AI agents into real-world workflows where the cost of failure is high. As AI systems take on greater autonomy and responsibility, the absence of rigorous evaluation and observability could become a compounding liability.
Why Braintrust
When we first met the Braintrust team, it was clear to us they understood this problem with a depth and precision that set them apart. As they set out to build the observability layer they viewed as a gap in the AI ecosystem, Braintrust has evolved into an end-to-end platform across the entire AI product lifecycle from prompt iteration to automated evaluations and production monitoring. It provides teams a systematic way to assess, understand, log, refine and improve AI-enabled products over time. In doing so, Braintrust serves as a central nervous system for how AI applications are built, measured and improved.
We hear this consistently with customers who cite Braintrust's developer experience and UI as enabling adoption beyond engineers to include product managers and other non-technical stakeholders. Under the hood, the team's commitment to strong performance is equally evident. When they encountered scaling limitations with their third-party database, they built Brainstore, a custom database for LLM logs designed for faster queries, lower latency and lower infrastructure costs at the volume AI workloads demand. Hybrid deployment allows enterprises to move quickly without compromising data residency or security, while token-level traceability provides a depth of visibility that traditional observability tools were never designed to offer.
In an AI landscape often shaped by noise rather than substance, we were struck by the quality of teams that have already standardized on Braintrust. Leading applied AI companies including Ramp, Notion, Replit, Stripe, Zapier, Airtable, Instacart and others are building with Braintrust to scale their AI products in production. We believe this early adoption reflects both the urgency of the problem and the rigor of the solution. We are already seeing strong signals of recurring, embedded usage, with customers running thousands of evaluations per day.
A Builder Who Builds for Builders
We are fortunate to be partnering with Ankur Goyal, Braintrust’s Founder and CEO, who has spent his career building products for developers. Ankur previously led engineering at SingleStore, then founded Impira, a machine learning platform for unstructured data that was acquired by Figma, where he went on to lead their AI platform. Across each chapter, Ankur has focused on solving hard infrastructure problems with a deep respect for developer workflows and a maniacal focus on customer experience – responding to customer issues himself, even today. That perspective is evident in Braintrust's product philosophy, its pace of execution and the trust it has earned from some of the most discerning engineering teams in the AI ecosystem.
Looking Ahead
We believe Braintrust is meeting a pivotal moment. As AI systems shift toward agentic workflows with increasing autonomy, evaluation and observability are transitioning from best practices to essential infrastructure – embedded directly into the daily workflows of AI builders. The companies that get evaluation right will be better equipped to ship faster, fail less often and earn the trust required to deploy AI into the workflows that matter most. We are excited to partner with Ankur and the Braintrust team as they continue building the infrastructure that helps make trustworthy AI possible.
Published:
February 17, 2026
.jpg)





.jpg)
