“We envision a future where human procurement professionals work seamlessly alongside AI agents for for enhanced efficiency.“
— Aatish Dedhia
(Founder and CEO – Zycus)
Introduction: From Analytics Fatigue to Data Fluency
In the modern enterprise, access to data is not the problem—it’s the navigation, interpretation, and actionability that pose the real challenge. This is especially true in procurement, where insights are often mired in the concatenation of disparate systems: purchase orders reside in one module, invoices in another, and supplier performance data lives in fragile spreadsheets or disconnected dashboards. The very act of asking a strategic question—“Why did tail spend spike in APAC?” or “Are my negotiated savings materializing?”—often triggers a multi-step odyssey through reports, exports, and tribal knowledge.
Zycus has long offered a mature and comprehensive Source-to-Pay (S2P) suite. Yet despite the platform’s depth, I repeatedly observed the same friction: high-value procurement questions required not just data access, but data interpretation across silos—demanding analytical acumen, manual effort, and, at times, an almost heroic persistence from users.
As the product leader for Zycus’ Data Platform, I felt this gap wasn’t due to missing features—it was due to cognitive friction. And solving it meant rethinking the way users interact with data entirely.
That realization led to Project SparQ, which ultimately became Merlin Insights: a multi-agent GenAI-powered conversational analytics platform, designed to turn natural language into deeply contextual, diagnostic, predictive, and prescriptive procurement intelligence. More than a querying layer, Merlin Insights was built to function as an intelligent collaborator—grounded in enterprise semantics, capable of understanding nuance, and extensible across every corner of the S2P stack.
This is not a marketing story. It’s a data product story—about the semantics, orchestration, architecture, governance, and intelligent systems thinking required to turn fragmented data into real-time decision intelligence.
It’s about building a system that doesn’t just answer questions—but understands why you’re asking them.
Mapping the Terrain: The Fragmented Data Reality of S2P
Before we could build any intelligent layer atop Zycus’ Source-to-Pay suite, we had to confront a hard reality: procurement data is inherently fragmented. The S2P journey spans multiple modular systems—Spend Analytics, Supplier Management, Contract Lifecycle Management (CLM), eProcurement, Invoice Management—each architected as independently configurable products. While this modularity provides flexibility to clients, it also introduces severe challenges when attempting to extract a unified analytical perspective.
At Zycus, every client’s implementation involved a unique configuration of custom fields, lookup codes, approval flows, and localized business rules. Over time, these instances diverged further as organizations added layers of integration, created new derived fields, and overrode standard metric definitions. As a result, a metric like “contract compliance” could be defined differently between two regions in the same company—depending on how contract metadata and invoice line-items were linked in the back end. There was no universal schema, no guaranteed field fidelity, and no standard naming convention. Even the meaning of basic dimensions like “supplier,” “category,” or “business unit” varied in representation and granularity.
The implication for Merlin Insights was profound: any system attempting to support natural language queries would immediately be constrained by this lack of semantic standardization. If we couldn’t resolve what a user meant by “tail spend” or “supplier cycle time” across environments, we couldn’t answer even the simplest questions accurately. We needed to build a layer that abstracted away the messiness of physical schema and provided a canonical interface for metric resolution and entity relationships. That meant grounding every part of the system in semantically aligned metadata that was decoupled from underlying implementation artifacts. Only then could we start building something intelligent on top of it.
This realization shaped nearly every architectural decision that followed. Merlin Insights had to be aware not only of what data was, but what it meant—in every environment it served.
Building the Foundations: Unified Semantic and Metric Catalogs
The architecture of Merlin Insights demanded that we decouple user language from physical data. To interpret natural language accurately across different clients, geographies, and modules, we needed to establish a canonical model—a common “source of truth” for how procurement data entities and metrics were defined, related, and computed. This required the creation of two tightly integrated systems: the Semantic Data Catalog and the Metric Catalog. These weren’t merely metadata registries—they were active, governed, and operational components of the Merlin platform.
The Semantic Data Catalog: Normalizing Entities Across Modules
The Semantic Data Catalog served as the backbone of entity resolution in Merlin Insights. Its job was to abstract over physical schemas and define logical entities in a way that was consistent, composable, and implementation-agnostic. For instance, while the Spend module might represent a supplier with fields like SUPP_ID, SUPP_NAME, and TAX_ID, the Supplier Management module could use an entirely different set of fields like VENDOR_ID, NAME, and REGISTRATION_NO. The semantic catalog resolved these into a unified “Supplier” entity, with a canonical definition and normalized attribute set.
Each entity in the catalog included metadata such as field types, permissible values, data validation rules, join keys, and reference lineage. Relationships between entities—such as how a Supplier connects to a Purchase Order, or how a Contract relates to Invoices—were modeled as a directed graph using technologies like Neo4j. This graph structure allowed the platform to traverse connections dynamically and resolve user queries like “Which suppliers had delayed deliveries for off-contract items in Q2?”
We also implemented a layered access model within the catalog. While core entities like “Invoice” or “Contract” were shared across tenants, client-specific customizations could extend the base schema using override layers. This ensured that each client’s implementation fidelity was preserved while retaining semantic consistency in the language model’s view of the data.
The Metric Catalog: Governing Analytical Meaning at Scale
If the Semantic Catalog grounded Merlin Insights in entity structure, the Metric Catalog defined analytical intent. Every meaningful insight—“invoice cycle time,” “tail spend leakage,” “supplier scorecard delta”—had to be tied to a precisely defined metric. We discovered early that in procurement, ambiguity in metric meaning was one of the biggest blockers to automation. The same term, “savings,” could mean negotiated savings, realized savings, opportunity savings, or something entirely bespoke to a client’s policy.
To address this, we built the Metric Catalog as a governed, versioned repository that explicitly defined every KPI supported by the platform. Each metric included a unique identifier, canonical name, business description, SQL and Spark implementation templates, allowed dimensions, and applicable filters. In addition, we stored synonyms, regional phrasing, and contextual cues to power NLP mappings—so that phrases like “cost avoidance” or “negotiated benefit” could correctly route to savings_negotiated_v3.
A critical feature of the catalog was scope tagging. This defined where a metric was valid (e.g., invoice-level vs. PO-line-level), what entity context it applied to, and what roles could access it. This allowed us to prevent illogical queries—such as requesting “contract utilization rate” in a context where no contracts were referenced.
To ensure data integrity, all metric definitions were executed through a testing harness that compared computed results against reference tables across clients. Metrics were versioned, with backward compatibility maintained for legacy queries. Implementation teams could extend or override base metrics through a GUI, but all changes required approval via a governed workflow that preserved auditability and data lineage.
This catalog became the operational backbone of Merlin’s intelligence layer. Without it, there would be no explainability, no traceability—and no trust.
Leveraging Generative AI to Accelerate Semantic & Metric Catalog Construction
Building a semantic and metric catalog for a domain as complex as procurement is a monumental effort. It requires aligning business intent, data structures, and analytical semantics across diverse stakeholders—product managers, implementation teams, data engineers, and procurement SMEs. To accelerate this process and reduce manual burden, we introduced Generative AI into the loop—not just as a user-facing assistant, but as an internal tool that augmented catalog construction.
The first use case was automated metric synthesis. We trained a domain-specific large language model (LLM), fine-tuned on procurement-specific documentation, client onboarding playbooks, and analytical formulae, to assist product analysts in drafting first-pass metric definitions. Given a prompt like “Create a metric for supplier fill rate based on GRN and PO delivery dates,” the model could output a semantic description, formula draft, associated joins, and even recommend applicable dimensions. Analysts could then validate, tweak, and promote these definitions into the Metric Catalog via a semi-automated workflow.
We also used LLMs for schema understanding and synonym extraction. When onboarding a new client, we parsed raw database schemas and field mappings using a transformer-based parser that suggested canonical entity mappings. For example, it could infer that VEND_CD likely referred to a Supplier Code, and ORD_DEL_DATE to a Goods Receipt timestamp. These suggestions were then cross-validated against historical mappings in similar client environments, allowing us to semi-automate the semantic layer configuration.
Another AI-powered feature was metric alias discovery. By running retrieval-augmented generation (RAG) on historic queries, documentation, and support tickets, we built a rich vocabulary-to-metric mapping graph that significantly improved NLP parsing accuracy. This technique enabled Merlin Insights to interpret diverse user phrases without relying on rigid keyword matching.
The result was not just speed, but consistency and scalability—two of the hardest challenges in building foundational metadata systems across multi-tenant enterprise environments.
Integrating the Catalogs with Orchestration and NLP Architecture
With the Semantic Data Catalog and Metric Catalog established as foundational layers, the next step was to operationalize them within Merlin’s multi-agent orchestration and NLP pipelines. The goal was simple in theory but complex in practice: every user query—regardless of phrasing, level of detail, or domain specificity—needed to resolve unambiguously to a set of structured intents, metrics, filters, and entities, all backed by governed definitions. Achieving this required tightly coupling the catalog systems with both the natural language understanding (NLU) stack and the execution orchestration layer.
At the heart of this integration was a semantic grounding layer, which intercepted parsed user input and enriched it using metadata from the catalogs. Once a user submitted a query like “Show supplier-wise tail spend in Q1 for indirect categories,” the system tokenized and parsed the input through our custom-trained NLU pipeline. This produced an initial structured form—comprising intent type, extracted entities, candidate metrics, and modifiers. However, the crucial step was catalog resolution.
Each extracted token was passed through a resolver that queried both the Semantic and Metric Catalogs. For instance, the term “tail spend” was resolved to spend_tail_leakage_v2, scoped to the Supplier entity, with time slicing logic based on client-specific fiscal calendars. Ambiguities were flagged based on catalog confidence thresholds, prompting clarifying questions when needed.
The orchestration engine, built as a chain-of-agents model, then consumed the resolved payload to construct execution plans. This included determining whether the query could be answered via cached aggregates, required federated joins across modules (e.g., linking CLM and eProc), or needed predictive modeling.
By deeply embedding catalog awareness into both NLP and orchestration logic, we created a system that was not only intelligent but also contextually precise, explainable, and reusable. Without this integration, conversational interfaces in enterprise systems would remain gimmicks—unable to deliver reliable, decision-grade outputs.
Designing the Agentic Orchestration Layer for Query Execution
At the heart of Merlin Insights’ intelligence lay its agentic architecture—a design choice that allowed us to modularize complexity, isolate concerns, and parallelize execution paths for scalability and explainability. This section unpacks how we engineered that layer to turn user questions into structured, executable, and context-aware data pipelines.
Merlin Insights didn’t rely on a single monolithic reasoning engine. Instead, it orchestrated over 50 specialized agents, grouped into eight major agent families, each designed to handle a particular slice of the query lifecycle—from retrieval and reasoning to narrative generation and workflow linking. This modularity allowed us to fail gracefully, scale horizontally, and adapt agent behavior independently based on query type, user persona, or dataset topology.
From Natural Language to Canonical Query Graph
Every user utterance entered Merlin Insights as freeform natural language—ambiguous, often unstructured, and context-dependent. The first task of the orchestration layer was to convert this into a deterministic, canonical query graph.
Using the Semantic Catalog, the utterance was parsed into structured representations: intent_type, entity_scope, metric_id, temporal_boundary, and any conditional constraints. For example, a prompt like “Show me suppliers with delayed deliveries last quarter” would be resolved as:
intent_type: diagnosticentity_scope: supplier, regionmetric: on_time_delivery_ratetime_period: Q2condition: threshold breach detection
This canonical form became the intermediate representation passed between agents. It ensured consistency, traceability, and allowed individual agents to focus on their micro-task without needing to re-interpret the user’s intent.
Importantly, the graph could evolve mid-session. If the user followed up with “What about for only IT suppliers in EMEA?”, the context update triggered a new orchestration path, using session memory and delta re-resolution logic. We found this dynamic query graph crucial for supporting multi-turn, exploratory analysis—something traditional BI tools simply can’t handle.
The Retrieval Agent Group: Intelligent Data Access
The Retrieval Agents were responsible for accessing and shaping data across Zycus’ federated modules: Spend, Supplier Management, Contracts, Sourcing, and eProc. This was not trivial. Each module had distinct schemas, query interfaces, and authorization boundaries.
When invoked, the Retrieval Orchestrator identified:
- The relevant module(s) based on entity-metric mapping
- The required join paths, as encoded in the semantic graph
- Any existing query patterns cached from session history
Using this information, sub-agents generated parameterized SQL or Spark queries—optimized for engine type and latency targets. When the query spanned modules (e.g., comparing negotiated savings from Contract data with realized spend from Invoicing), the Retrieval Agents constructed cross-domain joins dynamically by traversing semantic edges (e.g., contract_id → PO → invoice → supplier).
We also employed result caching, fingerprinting each canonical query graph and storing computed results in a Redis layer. This reduced latency for repeated or overlapping queries by over 70% and was especially useful for multi-user shared environments.
The Reasoning Agent Group: Diagnosing, Predicting, Prescribing
Merlin Insights was never built to just return charts. At its core was a reasoning engine that could derive conclusions, infer causality, and simulate outcomes.
The Reasoning Agent Group operated as an analytical brain—triggered only after data had been retrieved, cleaned, and scoped. It comprised multiple micro-agents:
- Diagnostic Agents: These computed deltas, variances, and anomalies over time or entity segments. For example, they could break down increased cycle time by supplier region, approval step, or contract renewal status.
- Causal Inference Agents: These leveraged conditional logic and pre-trained heuristics to correlate patterns—e.g., a drop in supplier fill rate might be linked to contract expiry, or an uptick in spot buys could correlate with catalog misclassification.
- Predictive Agents: These wrapped LSTM models, regression pipelines, or heuristic rulesets for time series extrapolation. They were used for forecasting tail spend leakage or predicting compliance risk scores.
- Prescriptive Agents: These were perhaps the most complex—offering next-best-action recommendations based on rulebooks, historical resolution patterns, or reinforcement-trained policy models.
Each reasoning task was explainable. Agents attached metadata: confidence intervals, rule paths used, model types triggered. This allowed the narrative layer to later explain “how” the system arrived at a given insight—a critical requirement in enterprise environments where trust is paramount.
The Narrative Agent Group: Humanizing Insight Delivery
Once structured insights were generated, Merlin Insights needed to communicate them in a way that was clear, contextual, and confidence-inspiring—whether the user was a data-savvy analyst or a time-starved CPO.
The Narrative Agents played a dual role:
- Content Shaping: Translating raw numbers and distributions into coherent language—e.g., “Cycle time for IT suppliers increased by 27% over the last quarter, primarily due to delayed approvals in Germany and Poland.”
- Adaptive Formatting: Depending on user persona and channel (desktop, mobile, email digest), the narrative could shift tone, verbosity, and format. Executives got synthesized cards with CTA buttons, while analysts received tabular exports with drill-down capability.
In addition to narrative text, these agents invoked visualization APIs to embed the right graph types (trend lines, tree maps, waterfall charts), ensuring the visual form matched the data story being told.
They also played a role in transparency: when users hovered over a stat, the narrative agent could explain its lineage—e.g., “This is based on the metric contract_utilization_rate_v2, scoped to Business Unit X, excluding expired contracts.”
Agent Groups in Action: A Unified Orchestration Flow
All 50+ agents in Merlin Insights were orchestrated through an execution controller that handled the choreography based on query topology. The eight major agent groups included:
- Intent Resolution Agents
- Entity & Metric Parsers
- Retrieval Agents
- Reasoning Agents
- Narrative Generators
- Query Optimizers & Caching Agents
- Workflow & CTA Binding Agents
- Telemetry & Feedback Agents
Each group had its own retry policies, performance monitors, and fallback logic. This made Merlin Insights resilient to partial failures (e.g., when a predictive model timed out, we still returned the diagnostic result with a note).
This agentic backbone gave us modularity, traceability, and extensibility. When a new type of agent was needed (say, for ESG scoring or sustainability index analysis), we could insert it into the pipeline without disrupting existing workflows.
Grounding Natural Language in Procurement Semantics
Building an intelligent system for procurement analytics meant accepting a simple truth: off-the-shelf NLP models were insufficient. Procurement users do not speak in textbook grammar or general-purpose vocabulary. They say things like “tail spike last quarter in IT”, “non-compliant PO leakage,” or “cycle time gap across vendors.” Parsing such queries required us to go beyond generic language understanding and create an NLP stack grounded in the semantics of procurement—from domain-specific entities to metric-aware disambiguation.
Our approach began with intent recognition, a supervised model trained on thousands of annotated procurement questions. We categorized intents not just into basic types like descriptive or diagnostic, but into fine-grained classes such as comparative contract analysis, supplier scorecard lookup, or off-catalog trend detection. Unlike traditional classifiers that rely on static keyword patterns, our model fused syntactic parsing with domain embeddings—embedding procurement-specific terms like “maverick spend,” “MOQ breach,” or “SLAs missed” into a vector space trained using transformer-based encoders and fine-tuned on Zycus-specific query logs.
Once the system recognized user intent, the next step was entity and metric disambiguation. For this, we embedded a dual-layer NER (Named Entity Recognition) pipeline. The first layer handled syntactic entities (e.g., dates, regions, supplier names), while the second, domain-specific layer used the Semantic and Metric Catalogs as grounding sources. For example, if a user mentioned “delivery rate,” the system resolved this against known catalog metrics—evaluating both likelihood (based on usage frequency) and scope compatibility (whether it was applicable to the Supplier or PO entity in context).
Critically, we implemented context retention across sessions. If a user followed up with “Compare that to marketing vendors,” the system preserved previous context—tail spend, IT, Q1—while shifting the segment. This was achieved through a session state memory graph that evolved with each interaction.
The result was an NLP stack that not only understood language—but spoke procurement. That’s what made Merlin Insights feel intuitive, yet trustworthy, to its users.
Engineering for Context – Memory, Ambiguity, and Multi-turn Dialogue
Enabling true conversational analytics means going far beyond single-shot queries. Real users don’t ask perfect questions in isolation—they think aloud, change direction, build on previous responses, and explore by iteration. That’s why memory, context-awareness, and disambiguation became central architectural concerns in Merlin Insights. We designed our dialogue system not just to interpret queries, but to participate in multi-turn analytical conversations, preserving the logic and flow of human reasoning.
Dialogue Memory Graph: Modelling Analytical Conversations Over Time
To persist context across a conversation, we introduced a Dialogue Memory Graph, a dynamic structure that tracked user intents, resolved entities, applied filters, and system responses over time. Rather than storing raw text or isolated sessions, this graph encoded semantic continuity—such as which metric the user last explored, what filters were active, and which scope (supplier, category, time period) was in effect.
Each node in the graph represented a query turn, enriched with structured metadata like query type, target entities, resolved metrics, and output modality (table, chart, narrative). Edges between nodes captured dependencies—such as clarifications, comparisons, or follow-ups. This structure allowed the system to answer queries like “Drill down into that by region” or “Now show this trend over the past 3 years” with full awareness of prior references.
The memory graph was session-aware but persisted beyond a single session if the user opted to save their conversation. This enabled long-term use cases such as tracking supplier performance across quarters or conducting iterative contract analysis. From a technical perspective, we used an in-memory graph model (via Neo4j) with snapshots serialized to a user-specific document store, enabling both real-time traversal and asynchronous resumption.
This memory framework made Merlin Insights conversationally robust—capable of handling incomplete, referential, or evolving queries in a way that mirrored how real analysts work through data problems.
Disambiguation Engine: Clarifying Intent Without Losing Flow
Natural language is ambiguous by nature—especially in procurement, where terms are overloaded, incomplete, or entirely context-dependent. When a user types “Show spend increase last month,” does “spend” refer to approved POs, paid invoices, or goods received? Should “last month” use calendar or fiscal periods? What if the user has multiple business units in scope? To address these questions, Merlin Insights embedded an active Disambiguation Engine as part of its conversational loop.
Rather than assuming or ignoring ambiguity, the system generated clarification prompts dynamically—driven by metadata retrieved from the semantic and metric catalogs. For instance, if a term like “savings” mapped to multiple candidate metrics with no default fallback, the system would return a follow-up question: “Do you mean negotiated savings, realized savings, or opportunity savings?” This logic was informed by a confidence scoring algorithm based on historical query usage, user profile roles, and catalog annotations.
The disambiguation engine operated in two modes: implicit and explicit. In implicit mode, the system used defaults based on user history or system settings—ideal for power users who preferred speed over verbosity. In explicit mode, typically for new users or high-risk queries, Merlin Insights surfaced clarification options in an interactive card format. This approach ensured interpretability while minimizing friction.
Crucially, disambiguation didn’t break the conversational flow. Once the user resolved ambiguity, the engine stitched the clarification into the original query context and continued execution without forcing a restart. We also tracked disambiguation metrics—such as resolution rates, fallback frequency, and re-query rates—to continuously improve system prompts and synonym coverage.
This module helped us navigate one of the hardest challenges in language interfaces: preserving trust in the face of uncertainty, without slowing users down.
User Testing and Feedback Loops from Early Adopters
No architecture, however well designed, can predict how real users will behave in production. That’s why from the earliest MVP stages, we embedded a user-centric, iterative testing strategy. Merlin Insights wasn’t built for procurement users in isolation—it was co-created with them. We partnered with early adopters across industry verticals—global manufacturing, pharmaceuticals, retail, and financial services—to pilot, stress-test, and shape the product in real-world conditions. These pilots served as crucibles for validating everything from system comprehension to output quality, query satisfaction, and decision readiness.
Observation-based Testing with Real Procurement Personas
Our user research approach combined structured interviews, guided tasks, and passive observation. We recruited procurement roles across levels—category managers, sourcing analysts, procurement operations leads, and CPO staff—to simulate real-world analytical workflows using Merlin Insights. Rather than abstract usability tests, we anchored each session around intent-driven tasks—“Identify tail spend by region,” “Diagnose invoice delays in a specific category,” or “Compare contract compliance year-over-year.”
We captured not just task completion time or click paths, but also mental models—how users phrased queries, what assumptions they made about system behavior, and how they interpreted answers. One key insight was that users tended to chain queries naturally, often expecting previous context to persist. For instance, after asking “Which suppliers had late deliveries last quarter?”, they might follow up with “Which of those were also off-contract?” This behavior reinforced our need for memory graphs and contextual continuity.
We also learned that power users preferred terse phrasing with layered semantics, while occasional users relied on longer, more verbose prompts. By clustering these behaviors, we tuned the NLP engine to identify user segments and adjust prompting strategies dynamically.
These sessions revealed friction points we hadn’t anticipated—like users expecting real-time drilldowns on charts or wanting to edit query parameters inline. We captured these as feedback tickets in a dedicated queue, mapped each to backlog themes, and included users in sprint demos to close the loop on their inputs.
Feedback Capture, Annotation, and Model Retraining
Capturing feedback isn’t difficult. Making it useful, structured, and impactful is where most systems fail. With Merlin Insights, we knew from day one that feedback couldn’t be treated as a passive inbox—it needed to become a core feedback-driven loop for model improvement, UX refinement, and feature prioritization. To achieve this, we embedded a tightly integrated feedback capture and annotation workflow directly into the conversational interface and backend telemetry systems.
Every response generated by Merlin Insights carried a lightweight but powerful “feedback ribbon” consisting of a thumbs-up/thumbs-down mechanism, a structured comment box, and optional tags (e.g., “irrelevant,” “confusing filter,” “missing context”). We made sure this ribbon was non-intrusive but always available. When a user flagged a result, Merlin Insights paused future prompts in that session and asked a follow-up: “Was the metric wrong, or was the interpretation unclear?” These signals were mapped against the system’s understanding of the query, helping us identify where failures occurred—NLU parsing, entity resolution, metric grounding, or orchestration.
On the backend, every feedback item was mapped to a unique query fingerprint. These fingerprints were stored alongside query metadata (user profile, modules touched, latency, metric version, catalog mappings) in a feedback warehouse. A dedicated team of analysts—trained in procurement semantics—reviewed flagged items weekly. They annotated failure types and, when needed, created training pairs for future model updates.
This data flowed into two core retraining pipelines. The NLU pipeline used annotated pairs to fine-tune entity extraction and intent detection models, particularly for edge cases involving ambiguous phrasing or compound clauses. The execution optimization pipeline used structured telemetry to identify low-confidence metric mappings or inefficient query plans, feeding improvements into our metric catalog and orchestration templates.
By institutionalizing feedback into the system’s learning cycle, Merlin Insights became not just reactive but self-improving. Every mistake was an opportunity to make the platform more robust, nuanced, and aligned with real user expectations.
Decision Intelligence — From Insight Discovery to Action
While the market is crowded with analytics platforms capable of descriptive and diagnostic outputs, very few cross the threshold into decision intelligence—where insights don’t just explain what happened, but recommend what to do next, under what conditions, and with what downstream impact. For Merlin Insights, this wasn’t a stretch goal—it was a core product principle. We envisioned a system that didn’t stop at showing trends, but reasoned through them, contextualized options, and accelerated decisions directly within the Source-to-Pay workflow.
From Metrics to Judgment: Building the Decision Graph Engine
The foundation of Merlin Insights’s decision intelligence layer was the Decision Graph Engine—a modular framework that encoded procurement logic as a series of conditional, causal, and counterfactual relationships between metrics, entities, and states. Think of it as a knowledge graph, augmented with embedded heuristics and AI-generated reasoning paths.
For example, consider a query like: “Why did tail spend increase in Q2 in the Americas?” The system first diagnoses whether the increase is statistically significant, then explores contributing dimensions: new vendors added, off-catalog purchases, policy non-compliance. It then connects these observations to possible causal relationships—such as expired contracts, increased decentralized purchases, or approvals bypassed.
Each node in this graph represents a decision factor—a threshold condition, metric delta, or policy check. Edges represent logic flows: “if contract coverage drops below 70%, tail spend typically increases.” We encoded these relationships through a hybrid rule system: manually authored logic from procurement SMEs, and machine-inferred rules using association mining across historic client data.
The graph is dynamic. As user input changes—such as filtering by supplier or excluding one-time buys—the graph recalculates reasoning paths, reorders causal candidates, and updates prescriptive suggestions in real-time. This made Merlin Insights not just a diagnostic tool, but an active reasoning companion—able to justify why something occurred and what decision levers were relevant.
Prescriptive Intelligence: Recommendations Aligned to Procurement Policy
Once Merlin Insights reached a conclusion—say, identifying that 30% of off-catalog tail spend originated from two regions with expired vendor frameworks—the next challenge was prescriptive modeling: surfacing actionable steps aligned to the organization’s procurement policy, risk appetite, and process constraints.
We built a Recommendation Engine that sat atop the decision graph. This engine matched diagnosed insights to a playbook of prescriptive actions—ranging from contract renegotiation triggers, supplier review scheduling, catalog reconfiguration, to sourcing event initiation. Each recommendation wasn’t hardcoded; it was context-aware, version-controlled, and bound to policy metadata defined by the organization.
To enable this, we designed a Policy Configuration Layer where customers could define their own “procurement guardrails”—rules like “Never suggest spot-buy consolidation for critical categories,” or “Cap supplier offboarding recommendations below Tier-2 risk threshold.” These guardrails became runtime filters that constrained or modified recommendations based on the org’s governance framework.
Each recommendation was also scored using a decision utility model. This model estimated the potential value (e.g., cost savings), feasibility (data availability, time to action), and organizational impact (risk, cycle time, compliance). These scores were exposed transparently to the user, making it easier for analysts and decision-makers to evaluate not just what was suggested—but why it was ranked that way.
By coupling causal analytics with prescriptive logic, we turned Merlin Insights into a forward-looking decision engine. It didn’t just tell you what happened—it helped you decide what to do next, with reasoning that was explainable, auditable, and operationally embedded.
Interactive Scenario Simulation and Sensitivity Analysis [launching soon]
One of the major pitfalls in traditional analytics is the assumption that insight leads directly to action. In reality, decision-making—especially in enterprise procurement—is nonlinear, hypothesis-driven, and heavily dependent on contextual trade-offs. A category manager may ask, “What if I consolidate my supplier base?”, but before acting, they need to understand the trade-offs: Will it improve discounts? What’s the lead time risk? Are any suppliers at risk of non-compliance? This is where interactive scenario simulation became a cornerstone of Merlin’s design.
We built a Simulation Engine into the orchestration layer, capable of supporting “what-if” modeling directly within the conversation flow. Users could change metric inputs, modify supplier parameters, exclude geographies, or simulate events (like a price escalation or supplier exit). These adjustments triggered recomputation pipelines that refreshed not just the output charts, but also the underlying causal reasoning paths and prescriptive recommendations.
Technically, each scenario was modeled as a temporary query plan variant within the Decision Graph, layered on top of the base insight. The system captured all deltas—input overrides, simulated variables, modified assumptions—and routed them through a recomputation context. We introduced versioning logic to allow users to compare “base vs. scenario” outcomes across KPIs like spend delta, cycle time, risk exposure, or contract coverage. This differential analysis was visualized through mirrored charts, waterfall views, and prescriptive outcome tables.
To support sensitivity analysis, we integrated statistical stress testing models that allowed users to explore ranges instead of fixed values. For instance, increasing supplier lead time by ±10% would show probability-adjusted impacts on delivery cycle KPIs. This was built using a light-weight Monte Carlo simulation library optimized for browser-based performance, allowing real-time visualization without backend latency.
In effect, Merlin Insights transformed into a live planning interface—one where data wasn’t just interpreted, but actively explored, manipulated, and projected forward in time.
Embedded Workflows – Turning Decisions into Action within S2P Systems [launching soon]
One of the most common frustrations with analytics tools in enterprise environments is the actionability gap. Even when users gain insight, they often must leave the analytics system, switch to an operational module, search for the relevant transaction or supplier, and manually initiate a process. This context-switching introduces latency, human error, and decision fatigue. With Merlin Insights, our north star was different: embed intelligence directly into the S2P execution fabric, enabling users to act at the point of insight.
To achieve this, we designed a bi-directional orchestration interface between Merlin’s analytical core and Zycus’ transactional modules—Procurement, Contracts, Supplier Management, Invoicing, and Sourcing. When Merlin Insights returned a diagnostic or prescriptive insight, it also embedded contextual CTAs (Call-To-Actions) alongside the output. For instance, after identifying that off-contract spend was spiking in the IT category, users could launch a contract review flow directly from the insight card, pre-filled with the relevant supplier, contract ID, and justification.
This required deep architectural integration. We built a workflow trigger service that exposed APIs for initiating processes in downstream Zycus modules. Each insight card included metadata tokens—such as entity IDs, document references, user roles, and recommended action types—that were passed securely to the workflow layer. Role-based access controls ensured users only saw actions they were authorized to perform.
We also tracked CTA telemetry. Every time a user launched a sourcing event, initiated a supplier risk evaluation, or submitted a catalog correction through Merlin Insights, the action was logged with its originating insight. This created an end-to-end decision trace—from data → insight → action → outcome—which not only improved governance but also helped train future recommendation models.
By integrating deeply with operational workflows, Merlin Insights didn’t just advise—it executed. That’s how we moved from being an analytics assistant to becoming an intelligent procurement co-pilot.
Real-World Use Cases and Business Impact
The proof of any intelligent analytics system lies not in its architecture, but in its adoption and impact. As Merlin Insights moved beyond beta into enterprise rollouts, we began to see clear signs that it wasn’t just being used—it was transforming how procurement decisions were made. What used to take days of Excel analysis, stakeholder meetings, and tribal knowledge was now accessible in minutes, with traceability, explainability, and next-best-action guidance.
Below are three representative use cases from early enterprise adopters across different industries, illustrating how Merlin Insights integrated intelligence, usability, and business context to create tangible results.
Example 1 – Tail Spend Analysis and Contract Leakage Detection at a Global Manufacturer
Prompt from User:
“Why has tail spend increased for the EMEA region in the last quarter?”
What Merlin Insights Did:
Upon receiving this prompt, Merlin Insights parsed the user’s query into a comparative diagnostic structure. It first retrieved historical tail spend KPIs using catalog-defined metrics (spend_tail_leakage_v3) and scoped it to the EMEA region over the trailing two quarters. It conducted a delta analysis and flagged a 27% increase. Next, the system used its root cause engine, scanning across dimensions like supplier segments, catalog compliance, payment methods, and cost centers.
It identified two specific issues:
- 40% of tail spend originated from non-catalog purchases in the indirect IT category.
- A large portion of spend was routed through ad-hoc suppliers added outside the approved vendor list, particularly in Germany and Poland.
System Output:
- Time-series chart of tail spend trajectory by country and category
- Contribution breakdown by supplier type (Tier-1, one-time vendors, off-catalog)
- Prescriptive recommendation: Launch contract coverage expansion for IT subcategory; activate supplier onboarding review.
- CTA: Trigger sourcing event pre-populated with non-contracted IT spend vendors.
Impact:
The insights helped the client initiate a supplier consolidation RFP, tighten catalog configurations in two key markets, and save an estimated $850K in leakage in the next quarter alone. Time to insight dropped from 4 days to under 8 minutes.
Example 2 – Supplier Risk and Performance Analysis at a Large Retailer
Prompt from User:
“Which of my top 10 indirect suppliers have shown delivery time deterioration and SLA breaches in the last 2 months?”
What it did:
This was a multi-layer diagnostic query. First, it invoked the Supplier Scorecard framework and pulled performance trends for the top 10 indirect vendors (by spend) over a rolling 60-day period. It analyzed metrics like on_time_delivery_rate, cycle_time_variance, and SLA_compliance_score, using the supplier master and performance module.
Then, it applied time-sliced trend deviation logic. Using historical performance baselines from the past 6 months, it identified statistically significant deterioration in two suppliers, whose SLA compliance dropped below 80% and delivery times increased by more than 2.5x standard deviation.
System Output:
- Heatmap comparing SLA compliance and delivery times across the top 10 suppliers
- Causal correlation: Identified increase in defect rates and delayed GRNs post-approval cycle
- Suggested actions: Initiate quality audit; check for recent PO renegotiations or approval bottlenecks
- CTA: Launch a supplier risk evaluation form in Supplier Management with context-filled data
Impact:
The retailer was able to proactively address quality issues, reduce penalty costs due to SLA failures, and renegotiate SLAs with two suppliers. This prevented downstream disruption during their peak retail season and improved supplier trust and accountability.
Example 3 – Opportunity Sourcing for Underutilized Contracts in Pharma Procurement
Prompt from User:
“Show contracts where committed volumes are underutilized by more than 25% over the last 2 quarters.”
What it did:
This was a contract-to-spend linkage analysis. It performed a real-time join between the Contract module and Spend data, using the contract_utilization_rate metric from the metric catalog. It compared committed volumes or financial thresholds to actual invoice-linked spend on contract line items, scoped across the past two quarters.
The system flagged several enterprise agreements in raw materials and lab equipment categories that had utilization rates below 70%. To explain why, it used its reasoning layer to cross-analyze PO behavior, supplier payment terms, and sourcing timelines. It discovered that:
- A subset of buyers were unaware of negotiated contracts.
- Spot buys were happening via maverick sourcing due to expedited project needs.
- In some cases, contracts had expired auto-renewal options that were not activated.
System Output:
- Utilization ratio trend charts for flagged contracts
- List of buyers and cost centers with highest maverick spend
- Recommendation: Activate contract compliance alerts; launch guided sourcing to redirect future purchases
- CTA: Send contract utilization summary to category manager; initiate supplier re-engagement flow.
Impact:
Over $1.2M in underutilized contract value was surfaced, leading to corrective actions and reactivation of underused agreements. Additionally, guided sourcing was introduced to auto-recommend contract-backed SKUs during requisition, improving compliance and reducing tail risk.
Example 4 – Strategic Category Planning at a Global CPG Company
Prompt from User:
“Compare sourcing efficiency across my top five categories and identify where cost savings potential is highest for the next quarter.”
What it did:
This was a comparative strategic analysis spanning multiple category domains. First, Merlin Insights pulled the user’s sourcing history and mapped it to the company’s category taxonomy, normalizing across cost centers and geographies. It calculated category-specific KPIs like avg_rfx_cycle_time, contract_conversion_rate, negotiated_vs_realized_savings, and supplier_concentration_index.
To forecast cost savings potential, Merlin Insights fused this data with market indexes (where integrated), commodity trends, and past negotiation deltas. It applied a category-level opportunity scoring model, which used parameters like spend fragmentation, supplier leverage, and prior price elasticity to rank each category’s sourcing potential.
System Output:
- Comparative dashboard across the top five categories showing:
- Cycle time trends and savings conversion ratios
- Supplier leverage scores (e.g., % of spend on top 3 suppliers)
- Fragmentation indices and category-level risk profiles
- Forecasted savings range using predictive analytics
- Prescriptive guidance: “For the MRO category, consolidation and multi-year contracting could yield 7–9% in additional savings.”
- CTA: Generate a sourcing calendar for Q3 with high-priority initiatives.
Impact:
Category leads used this analysis to reprioritize sourcing events, renegotiate framework agreements, and design long-term supplier rationalization plans. One category owner achieved a $1.4M saving uplift within two quarters by executing it’s suggestions.
Lessons Learned and What’s Next
The journey of building Merlin Insights was more than a product build—it was an organizational transformation in how we thought about data, intelligence, and decision-making. Designing a multi-agentic conversational analytics platform taught us that true intelligence lies not in model complexity, but in grounding, usability, and adaptability. Below are the key lessons that shaped the product, team, and vision moving forward.
Lessons in Product Thinking and Architecture
- Trust > Novelty: Procurement professionals don’t need another flashy dashboard—they need systems they can trust. Transparency in how metrics are defined, how data is sourced, and how recommendations are made was non-negotiable. Explainability had to be built in, not bolted on.
- Context is King: Most enterprise queries are not one-shot questions. Users think iteratively, change their minds mid-query, or follow up with “what about for marketing suppliers?”. Our dialogue graph and memory models were game changers in retaining context and making Merlin Insights feel intuitive.
- Catalogs are Strategic Assets: The Semantic and Metric Catalogs weren’t just enablers—they were the product’s knowledge foundation. Their governance, versioning, and crowd-sourced enrichment became critical to scaling the platform across regions and business units.
- Agent Modularity Drives Resilience: The decision to architect Merlin Insights as an agentic orchestration layer—where each agent could fail gracefully or be updated independently—proved essential for reliability and iteration speed, especially in multi-tenant deployments.
Lessons in User Behavior and Organizational Change
- Language is Local: Even within the same enterprise, teams use different terms for the same KPI. We had to design a language model that could be taught, corrected, and continuously retrained—treating language as a living system.
- Not All Users Want the Same Output: Some users wanted raw drilldowns; others needed narratives; others preferred summarized diagnostics with prescriptive actions. It’s ability to modulate output format based on persona increased adoption significantly.
- Feedback Loops Create Ownership: Early adopters didn’t just test the product—they shaped it. By embedding user feedback into model tuning, metric calibration, and roadmap decisions, we created a sense of shared ownership that drove long-term engagement.
What’s Next – The Future of Merlin Insights
As we look ahead, our roadmap is driven by one principle: data should not just inform decisions—it should adapt to them. Future releases of Merlin Insights will push further into proactive, autonomous decisioning. Here’s what’s on the horizon:
- Proactive Anomaly Detection: Instead of waiting for questions, Merlin Insights will continuously scan data for risks, trends, or opportunities—and proactively surface them to the right stakeholders, with explanations and suggested actions.
- Cross-Module Intelligence: We’re deepening integrations across the Zycus S2P suite to support end-to-end intelligence. For example, linking supplier risk changes in performance analytics to contract re-negotiation triggers or sourcing delays.
- Adaptive Agent Behavior: Based on user role, region, and prior query patterns, Merlin Insights will dynamically reconfigure its conversational flow, output type, and recommendation style—tailoring the product experience without requiring configuration.
- Closed-loop Action Learning: Every user action taken (or ignored) based on it’s recommendation will feed back into the model—allowing it to learn not just from queries, but from decisions and their outcomes.
- Language Personalization: Imagine a system where a new procurement analyst in Brazil and a global category lead in Germany both ask the same question—but receive responses localized to their contracts, suppliers, and compliance policies. That’s where we’re headed.
The Role of Generative AI: From Interface to Intelligence
At the foundation of Merlin Insights lies a critical technological enabler: generative AI. While most associate GenAI with chat interfaces and creative outputs, in the enterprise context, its power lies in understanding intent, bridging abstraction layers, and enabling synthesis across heterogeneous data landscapes.
Merlin Insights used GenAI at multiple layers. At the front end, it interpreted free-form user questions—disambiguating synonyms, resolving referential follow-ups, and gracefully handling under-specified queries. This wasn’t just language translation. It was semantic grounding into the procurement domain, informed by ontologies, metric catalogs, and session memory.
Behind the scenes, GenAI models—fine-tuned on procurement-specific queries—enabled intelligent auto-completions, dynamic prompt chaining, and language-driven execution planning. These models helped decide whether a question was diagnostic or prescriptive, which agents should be invoked, and even what fallback strategy to pursue when data was missing.
Most crucially, GenAI allowed us to democratize access to insights. In the past, only power users or analysts could navigate schema-heavy tools or construct SQL. With Merlin Insights, a sourcing manager or finance lead could simply ask, “What’s my maverick spend trend for logistics in Q2?”, and receive a layered, reasoned, and visual answer.
Generative AI turned the system from a query processor into a thinking collaborator. And when combined with structured agentic orchestration, caching, observability, and governance layers, it scaled from toy demos to enterprise-grade reliability.
The takeaway? GenAI isn’t magic dust. But in the right architecture—with the right abstractions, catalogs, and workflows—it becomes the connective tissue between human curiosity and machine precision.
What This Means for AI Product Managers
Merlin Insights was not just a product—it was an AI-native platform built to reimagine how enterprise decisions are made. For product managers working in AI, the project offers key lessons on both strategy and execution.
First, it reminds us that AI is not the product—the user outcome is. It’s tempting to focus on LLM prompts, vector stores, or diffusion pipelines. But none of that matters if the user still leaves with unresolved questions or wasted time. Our job as AI PMs is to keep sight of the decision moment, and ensure every part of the stack drives toward that goal.
Second, AI UX is not just about interfaces—it’s about interaction models. In Merlin Insights, every follow-up question, hover-over explanation, or embedded CTA had to be thought through with intent and traceability. AI systems don’t have static screens. They require conversation design, memory handling, and adaptive flows—all areas where PMs must lead with empathy and rigor.
Third, taxonomy is product infrastructure. The Semantic and Metric Catalogs weren’t just tools—they were user-facing interfaces in disguise. As PMs, we had to own their governance, structure, and evolution. Managing feedback loops, naming conflicts, and localization became a core part of the AI product lifecycle.
Finally, Merlin Insights reaffirmed that AI product management is system design. Unlike traditional features, AI capabilities touch orchestration, ops, data quality, telemetry, compliance, and ethics. PMs must speak fluently across domains, make architectural tradeoffs, and define product boundaries in ambiguous, fast-evolving terrain.
In short: the next generation of AI PMs won’t just ship models or prompts—they’ll build adaptive, resilient socio-technical systems. Merlin Insights was our canvas. The blueprint is ready.
Final Thoughts: From Conversational Interfaces to Cognitive Procurement
The story of Merlin Insights is not merely one of innovation in AI, nor just a milestone in enterprise analytics. It’s a story of rethinking the fabric of decision-making in distributed systems like procurement—where data is fragmented, contexts vary wildly, and decision time is a differentiator.
What began as a vision—to help procurement professionals “find answers at the speed of thought”—became a multi-year endeavor that pushed the boundaries of what conversational AI could do when deeply grounded in domain semantics, enterprise data models, and real business workflows. We didn’t just build a chatbot that sits on top of a data lake. We engineered a decision operating system—one that learns, reasons, contextualizes, and most importantly, acts.
As someone who’s led data platforms and AI product management, I’ve often been asked, “What’s the difference between an insight and intelligence?” The answer, I’ve come to believe, lies in consequence. An insight is something you observe. Intelligence changes what you do next.
That’s what Merlin Insights set out to deliver—not static reports, not clever graphs, but a new cognitive layer that could make enterprise systems feel more alive, more intuitive, more useful. We achieved that by treating data not as an artifact, but as an evolving conversation—between systems, users, and decisions.
And we’re just getting started. The future of procurement is not transactional—it’s anticipatory. It’s not just automated—it’s adaptive. As AI becomes a native layer in every enterprise workflow, the platforms that will define the next decade are not the ones with the most features, but the ones with the clearest understanding of context, consequence, and control.
Merlin Insights is a step toward that future. And the journey ahead is even more exciting than the one behind us.