Industry Evals

evals as a moat

Feb 13, 2025

Tidemark’s Annual Benchmarking Report is upon us. Want to get up to speed on how your vertical AI company is performing relative to the market? Use this link to get started and get access to data across your specific peer set and gain an understanding of what it takes to become a standout Vertical SaaS business.

Can the Models Do Useful Stuff?

Every passing week brings a new model or system benchmarked against evaluation suites that are meant to be “impossible.” The latest example? Deep Research scoring a record 30+% on “Humanity’s Last Exam”. Cool name. It unfortunately doesn’t tell you anything about humans in an AI world. It does however tell you that models are improving across a wide range of domains and general capabilities.

Humanity’s Last Exam Eval Set by Capability

Why? Because an impressive score on a broad, academic-style test doesn’t automatically translate to reliability in a specific industry setting, where tasks are be governed by intricate compliance rules, unique workflows, and intangibles around brand and culture. A model can excel on general knowledge tests and still produce low-value outputs in real operational contexts. They often do.

This however, does not nix the value of evals; it dramatically heightens them. Evaluation sets and their resulting confidence scores, risk categorization, and more are actually the metrics that businesses need to operationalize thousands of agents.

To extend upon our concept last week, the key to AI penetration is eval-driven token utilization. This evaluation data should be leveraged not as a standalone developer tool, but instead as proof of performance inside of the AI products themselves.

Generic Benchmarks Fall Short

The Growing Problem
As AI adoption matures, companies will increasingly rely on chains of autonomous agents (or DAGs) to handle sensitive tasks. A shipping firm might chain agents to optimize routes, negotiate freight quotes, or communicate shipping updates. Each agent might tap into external APIs, fetch data from internal systems, or even talk to other agents. When one step fails—due to misaligned context, incorrect prompts, or inaccurate data—the entire process potentially breaks down.

Depending on how often this process breaks down and how easy it is to correct, this may be a worthwhile tradeoff. After all, every process has errors.1 But how would the business user know? This is complicated today by how vertical AI companies like to obfuscate their evals.2 Evals are disguised and ergo human operators are asked to go through essentially independent or step by step audits of every process or agent step. This may be more streamlined than in the past, but it doesn’t scale.

We are talking about potentially tens of billions of tokens being spent by a business annually. Humans cannot occupy a central place of importance in reviewing every single one. We can draw some conclusions:

Vendors and businesses might disagree on the eval metrics and there’s currently loss-y feedback mechanisms to turn this into useful and personalized models for businesses.
Evaluations are going to necessarily become business level metrics if we don’t want humans to have to review every micro-step.
Agents will potentially be evaluated across 50+ metrics that need to be distilled into an actionable business insight.
The feedback loops for determining success are going to differ qualitatively across businesses. For instance, it may be impossible to render an immediate evaluation on a voice agent for a business. You can render certain tonal evals (the customer was happy!) but you are going to want to evaluate the actual close of the sale which may occur with plenty of lag after initial contact.

Evals Are Currently a Developer Product
This is compounded by a problem in the AI dev landscape: Most evaluation tools and metrics cater to data scientists or AI engineers. They delve into things like perplexity, BLEU scores, or token-wise accuracy. These are critical and inform the future confidence scores that businesses need. However, business users—from compliance officers to C-suite executives—often need a simpler representation: “Is this output reliable for the task at hand, yes or no?” or “How confident are we in this result?”

When these metrics aren’t clear, business stakeholders struggle to trust the AI, trust the vertical AI vendor, and churn results. Today, this is a product gap and not a model capability gap. Business users need:

Simple Confidence Scores: High-level indicators of how likely an output meets compliance rules, brand style, or domain standards.
Quick Pass/Fail Checks: Binary flags for major deal-breakers like regulatory violations or brand-inconsistent language.

Confidence Scores, Extended Automation, and the Human in the Loop

The LLM that can be relied upon is not 100% accurate—no model (and no human) is. The real question is whether the AI-assisted process can handle errors gracefully. Exposed evals ultimately leads to enhanced throughput inside of the process while enabling iterative improvement.

Evals trigger reviewing what matter vs. reviewing all potential miscalibrations.
1. A mixture of spot checks and confidence scores enable companies to have ways to retain control of quality but without forfeiting all control.
  1. This in turn informs risk analysis. Good evals will move past confidence scores and be acquainted with business user perceptions of risk that can ultimately enable speedier review/intervention on more critical tasks.
Iterative Improvement
1. As the system learns from failures—where humans override or correct outputs—those corrections feed into fine-tuning.
2. Over time, fewer outputs fall below the confidence threshold, enabling more efficient workflows.
3. This is perhaps the most exciting aspect. Teams can choose to blend industry-wide evals with company-specific ones. Ultimately as finetuning and distillations improve, the importance of capturing golden data and evaluation feedback inside of the process is one of the most compelling pseudo-system of records that vertical AI providers have.

This approach empowers use-case productivity even with imperfect models. Instead of chasing 100% accuracy, the focus shifts to designing processes where evals provide quick, actionable insights on whether an output meets the bar for production use. This is simply far more actionable for businesses than trusting a general 90% measure.

Controlling the Evals, Controlling the Industry

When your company is the one designing and maintaining the eval framework, the result becomes a de facto standard for industries to adopt. This isn’t just technical oversight—it’s market power. A widely adopted eval suite becomes the stamp of approval for industry organizations. Ultimately this could yield SOC-2 like market outcomes.3

Trusted Arbiter: If clients, regulators, and even competitors use your eval suite to gauge AI performance, your company becomes the gatekeeper of “acceptable outputs.”
Defensible Moat: As you refine the eval to capture deeper domain nuances, it becomes harder for copycats to match your level of precision.
Evolving Standard: You can adjust the eval as regulations change or new best practices emerge—keeping your solution at the center of the vertical’s AI conversation.

It’s similar to how credit rating agencies set the rules for who’s “investment grade.” By defining the criteria, they also define market behavior. In vertical AI, controlling evals can confer a similar influence.

Industry-Specific Benchmarks vs. Company-Specific Evals

Industry-Level Evals: The First Step

For vertical AI companies, industry-focused evals are invaluable in proving the model meets domain-specific challenges (e.g., HIPAA compliance in healthcare, SEC rules in finance, or federal regulations in logistics). These standardized tests reassure a prospective client that a solution has the basic alignment to sector norms.

The Inevitable Need for Company-Specific Adaptations

Yet, each organization has its own processes, brand identity, and “secret sauce.” Relying solely on industry-level evals can lead to:

Misalignment: A major hospital might have stricter data-sharing protocols than the general healthcare guidelines.
Loss of Differentiation: Companies often worry that adopting a purely industry-standard AI could commoditize their unique brand voice or proprietary workflows.

By offering company-specific eval extensions, a vertical AI vendor can alleviate these concerns, allowing businesses to preserve their distinct identity—whether it’s a particular tone in customer communications or a specialized claims-approval process that sets them apart.

Game Theory: Maintaining Distinctives
Companies are rightly anxious about commoditizing their offerings in an AI-driven landscape. The more core processes an AI handles, the more critical it becomes to preserve brand differentiation and proprietary logic. Serving evals that capture these differences ensures your clients keep their edge. It also signals that you, as a vendor, won’t flatten their uniqueness into a generic workflow.

Surface Area of Evals: Agents, Tools, and Processes

As agent-based systems grow in complexity, the “surface area” of what needs evaluating expands:

Background Context Checks: Is the agent calling the right tool for the job? Is it referencing outdated data?
Chain-of-Thought Audits: In “power user mode,” you might want to display how the AI arrived at its conclusion, highlighting any flawed reasoning steps.
Multi-Step DAGs: Each node in a workflow might require a separate set of checks to validate partial outputs (e.g., pricing, risk scoring, final compliance).

The more tasks your AI undertakes, the more crucial it becomes to define crisp evaluation points throughout the pipeline. This is particularly true if some agent operations reside outside your direct control, like partner APIs or third-party databases.

Evals as a Core Product Experience

Today, evals are mostly hidden behind engineering dashboards or developer tools. This status quo won’t last. As AI moves closer to the critical path in business operations, evaluations will become part of the user-facing product experience:

Real-Time Confidence Displays
- A purchasing manager sees a “confidence meter” for each AI-generated vendor contract. If it’s high, she can sign off quickly; if it’s low, further review is triggered.
Transparent Processes
- Some power users want to see the chain-of-thought or at least an abstracted version of it. The idea: pinpoint errors or confirm the model is referencing correct sources.
- This transparency fosters trust and engagement, turning business stakeholders into active participants in refining AI outputs.
User Feedback Integration
- If a manager marks an output as “bad” or “non-compliant,” that label is fed back into the model’s improvement loop (potentially for fine-tuning or distillation).

Putting It All Together: Control the Standard, Control the Vertical

For vertical AI companies, the value creation may simply be in your evaluations and frameworks to enforce specialized rules and processes for the industry.By shaping how performance is measured—and connecting that measurement to token usage, compliance, and company-specific tasks—you become the de facto standard-setter in your vertical. I also want to point out here this excellent piece by Hamel.4

Build Industry-Level Evals
- Set benchmarks for the core tasks, compliance rules, and best practices in your sector.
- Encourage adoption by making these evals publicly recognized as the “gold standard.”
Enable Company-Specific Extensions
- Offer custom evaluation metrics so clients can encode their distinct workflows.
- This premium customization layer ensures clients don’t feel forced into a “one-size-fits-all” approach—thus preventing commoditization of their brand and processes.
Leverage Eval-Driven Token Utilization
- Align your pricing or performance metrics with the tasks that pass eval checks. If a shipping quote or legal draft passes all relevant validations, it’s “billable” token usage—reinforcing a clear link between tokens and actual value.5
Iterate Fast with Layered Evaluations
- Start with simple unit tests and binary pass/fail checks; ramp up to more granular or A/B tests as the system matures.
- Collect real-world feedback to refine both the model and the eval suite continuously.
Maintain Strategic Control
- By defining how “correctness” or “compliance” is measured, you have a defensible moat. Competitors must either adopt your eval or build their own from scratch, which is a heavy lift—especially if your eval is already an industry standard.

Evals are more than just developer tools or engineering curiosities. In a world where AI is increasingly central to how businesses operate, evaluations become the nexus between trust, adoption, and strategic advantage. By offering robust industry-level evals while enabling company-specific customization, you empower clients to preserve their distinctive edge—turning your eval framework into an indispensable pillar of the entire vertical.

Ultimately, whoever controls the evals controls the industry. By setting (and continually evolving) the rules of success, you guide how AI interacts with real-world workflows—ensuring not only that your customers can trust their models, but that you become the go-to authority for what “quality” really means in your domain.

I’m unconvinced that AI ever will perfectly pass evals that matter.

I think this is a really key problem. Vertical AI is often still assuming that they are selling a feature complete product. They’re not. They’re selling an iterative system for the business situated in a specific industry to enable AI at their company core. Why pretend otherwise?

I don’t necessarily love this becoming a reality but I think it’s plausible that evals do indeed become the way in which industry compliance for agents is borne out.

Some of these ideas I first encountered there, others are extensions.

This is I think the true way to build consumption-based pricing that doesn’t risk commoditization.

Verticalized

Discussion about this post