Press Release

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real-World SRE Tasks

New benchmark shows top LLM achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between coding ability and real-world SRE work.

Tuesday, January 20th 2026, 2:46 PM EST by Advertising Content

Updated:

Tuesday, January 20th 2026, 2:46 PM EST

New benchmark shows top LLM achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between coding ability and real-world SRE work.

WARSAW, Poland --(BUSINESS WIRE)

Quesma, Inc. announced the release of OTelBench, the first comprehensive benchmark for evaluating LLMs on OpenTelemetry instrumentation tasks, revealing significant gaps in AI's ability to handle production-grade Site Reliability Engineering (SRE) work.

While frontier LLMs have demonstrated impressive coding capabilities, the best-performing model, Claude Opus 4.5, achieved only a 29% pass rate, compared to 80.9% pass rate in the SWE-Bench, highlighting a critical gap in production engineering skills.

Enterprise outages cost an average of $1.4 million per hour, making production visibility mission-critical. Yet 39% of organizations cite complexity as their top observability obstacle. The benchmark exposed context propagation as an insurmountable barrier for most models, a particularly concerning finding given that context propagation is fundamental to distributed tracing.

"The backbone of the software industry consists of complex, high-scale production systems with mission-critical reliability," said Jacek Migdal, founder of Quesma. "OTelBench shows that while LLMs are impressive at generating code, they're not yet capable of fundamental instrumentation task even at a small scale, and end-to-end problem-solving required for production engineering. Many vendors are marketing AI SRE solutions with bold claims but no independent verification."

Models had some moderate success with Go and, quite surprisingly, C++. A few tasks were completed for JavaScript, PHP, .NET, and Python. Just a single model solved a single task in Rust. None of the models solved a single task in Swift, Ruby, or Java.

"AI SRE in 2026 is what DevOps Anomaly Detection was in 2016; lots of marketing but lacking independent benchmarks," Migdal added. "That's why we're releasing OTelBench as open-source: to create a North Star for navigating the AI hype and enable the community to track real progress."

OTelBench is available today at https://quesma.com/benchmarks/otel/.

ABOUT QUESMA:

Quesma serves frontier LLM Labs and AI agent makers through independent evaluation and advanced simulation environments. The company provides benchmarks across critical domains, including DevOps, Security, and database migrations. Quesma is backed by Heartcore Capital, Inovo, Firestreak Ventures, and several angels, including Christina Beedgen, co-founder of Sumo Logic. For more information, visit www.quesma.com or follow on LinkedIn.

View source version on businesswire.com: https://www.businesswire.com/news/home/20260120541179/en/

Lucie Šimecková
Marketing
[email protected]

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact [email protected]