Development team management involves a combination of technical leadership, project management, and the ability to grow and nurture a team. These skills have never been more important, especially with the rise of remote work both across industries and around the world. The ability to delegate decision-making is key to team engagement. Review our inventory of tutorials, interviews, and first-hand accounts of improving the team dynamic.
Building a Video Evidence Layer: Moment Indexing With Timecoded Retrieval
Beyond the Black Box: Implementing “Human-in-the-Loop” (HITL) Agentic Workflows for Regulated Industries
Most developers meet GitHub Copilot as a “smart autocomplete” that occasionally guesses the next line of code. Used that way, it’s nice — but you’re leaving a lot of value on the table. Inside VS Code, Copilot offers multiple modes of interaction designed for different stages of development: Chat Panel: Ask – use this for questions and explanationsEdit – use this for deliberate code changes.Agent – use this for autonomy, multi-step workIn-Editor Support: Ghost Text (Tab Completions) – fast, inline suggestionsInline Chat – targeted, context-rich refactoring If you understand when to use each, you can build a practical workflow: Build, Refine, Verify. This article walks through these modes, how they differ, and how to combine them into a repeatable development pattern you can trust. The Three Chat Panel Modes: Ask, Edit, Agent The Chat Panel is your main hub for high-level conversations with Copilot. It has three distinct modes that serve different purposes. 1. Ask Mode: Questions and Explanations Use Ask when you’re thinking, not editing. Ask mode is for understanding, exploring, and clarifying. It’s a safe space: Copilot won’t touch your files; it only answers in text and code snippets. Typical prompts: “How does this function work?”“What is the syntax for a flexbox?”“Explain this TypeScript error.”“What’s a good way to structure feature flags in React?” Result: You get answers, explanations, and code blocks you can copy manually. This is ideal when: You’re learning an unfamiliar API or library.You want a quick conceptual explanation (e.g., async/await, RxJS observables).You’re exploring options before committing to any code changes. Think of Ask mode as your embedded Stack Overflow + tutor. No risk, no edits, just information. 2. Edit Mode: Deliberate Code Changes Use Edit when you know what to change and want Copilot to implement it. In Edit mode, you’re giving Copilot a specific instruction about your codebase, and it will propose concrete file edits — still under your control. Example prompts: “Rename this variable across these two files.”“Refactor this class into smaller functions.”“Convert this callback-based API to async/await.”“Add null checks for user input in this file.” Result: Copilot updates your code in place, but the intent is surgical: you already understand the change; you just want help executing it consistently and quickly. Use Edit mode when: You have a clear, well-defined change.You need to apply that change across multiple files.You’re doing repetitive or mechanical refactors (renames, pattern changes, adding logs, etc.). It’s the “do the thing I already decided on” mode. 3. Agent Mode: Autonomy and Multi-Step Tasks Use Agent when you want Copilot to figure out the how and where. Agent mode is where Copilot becomes more autonomous. You describe an outcome, and Copilot breaks it into steps: editing files, creating new ones, and even running terminal commands (when allowed). Example prompts: “Create a task manager app.”“Add a user registration flow with email verification.”“Set up a basic Express server with JWT-based authentication.”“Generate a CI pipeline for this project using GitHub Actions.” Result: The Agent: Proposes a plan: “I will create A, modify B, run C…”Suggests file edits and new files.Can run commands in the terminal (e.g., install dependencies, run tests) if you confirm. Use Agent mode for: Greenfield scaffolding (new apps, services, components).Large, multi-step features.Initial project setup and boilerplate-heavy tasks. You’re still the tech lead: you approve steps and review diffs, but the Agent does the heavy lifting. Huge, all-in-one prompts perform worse than small, focused tasks. A far better approach is to talk to Agent like you would to a junior developer. In-Editor Interactions: Speed and Context Once you leave the Chat Panel and are deep in your code, the interaction style changes. Now it’s about momentum and precision inside your files. Ghost Text (Tab Completions): Momentum While Typing Ghost Text is the gray, inline suggestion that appears as you type. This is Copilot in its original, “autocomplete on steroids” form. Use it for: Boilerplate (loop structures, handlers, simple CRUD endpoints).Repetitive patterns (similar test cases, validation rules).Documentation and comments (docstrings, JSDoc, README snippets). If completions don’t seem to appear, ensure they’re enabled: Press Cmd + Shift + P (macOS) or Ctrl + Shift + P (Windows/Linux).Type: GitHub Copilot: Toggle CompletionsMake sure completions are enabled. Tab Inline Chat (Cmd+I/Ctrl+I): Targeted Refactoring Inline Chat brings Copilot right to your cursor with local context. How it works: Highlight the code you want to work on.Press Cmd+I (macOS) or Ctrl+I (Windows/Linux).Describe your intent: “Add priority levels to this list.”“Optimize this loop for large input sizes.”“Convert this to use a switch statement.”“Add better error handling here.” Inline Chat is ideal for: Local logic improvements.Iterating on algorithms.Enhancing error handling or logging.Adding small features in a specific function or block. Compared with Edit mode, Inline Chat feels more “in the flow”: you’re looking at the exact code, selecting it, and asking Copilot to transform it. The Build, Refine, Verify Workflow To get the most out of all these modes, tie them together into a simple three-step workflow: Build, Refine, Verify. 1. Build: Start Broad With Agent Begin with Agent mode when you’re facing a blank screen or a large new feature. “Create a task manager app.”“Add a ‘Projects’ feature to this dashboard, with CRUD endpoints and a basic UI.”“Set up database migrations for this service.” Let the Agent: Scaffold the project or feature.Create new directories, initial models, basic routes, or components.Wire up minimal working paths (e.g., one end-to-end flow). The goal is to defeat the blank page problem and get a working baseline quickly. 2. Refine: Get Specific With Inline Chat and Edit Once the structure exists, it’s time to refine and improve. Use: Inline Chat for local improvements: “Add filtering by status and due date to this query.”“Add priority levels (low, medium, high) to this list and sort accordingly.”“Improve the error messages returned by this API.” Edit mode for broader, planned changes: “Rename TaskItem to TodoItem across the project.”“Extract this monolithic function into smaller utilities in a utils folder.”“Switch this module from CommonJS to ES modules.” In this stage, you’re iterating on correctness, readability, performance, and maintainability. 3. Speed Up: Use Ghost Text to Fill the Gaps While refining, lean on Ghost Text to: Fill in obvious code patterns (e.g., additional test cases once it sees the first one).Write simple handlers, DTOs, or interfaces.Generate comments or docstrings from function names and parameters. This keeps you in flow. You decide the structure; Copilot fast-follows your intent. 4. Always Verify: Diff View as Your Safety Net Regardless of mode, there’s a non-negotiable final step: Verify. Before accepting changes — especially from Agent or Inline Chat — inspect the Diff view: Red = lines removed.Green = lines added. Check for: Unintended logic changes.Hidden side effects (e.g., changed function signatures, altered validations).Security or performance pitfalls (e.g., missing input validation, inefficient loops). Treat Diff view as your review gate: If it’s not clear within a few seconds what changed and why, step back.Ask Copilot (in Ask mode) to explain the diff: “Explain this change in plain English.”“Does this modification affect existing consumers of this function?” Copilot accelerates coding, but you remain the responsible engineer. Verification is where your judgement comes in. Putting It All Together Here’s how a realistic Copilot-powered session can look: Ask: “What’s a simple architecture for a task manager app with Node.js and React?” Agent: “Create a basic task manager app with backend in Express and frontend in React, including CRUD operations.” Refine with Inline Chat/Edit: Inline: “Add priority levels and due dates to tasks in this component.”Edit: “Rename Task to Todo across backend and frontend.” Speed with Ghost Text: Let Copilot autocomplete repetitive tests and API wrappers. Verify with Diff view: Review every proposed change.Run tests (manually or via Agent) and confirm behavior. Used this way, Copilot doesn’t replace your skills — it amplifies them.
My recent journey into agentic developer systems has been driven by a desire to understand how AI moves from passive assistance to active participation in software workflows. In an earlier article, AI Co-creation in Developer Debugging Workflows, I explored how developers and AI systems collaboratively reason about code. As I went deeper into this space, I came across the Model Context Protocol (MCP) and became keen to understand what this component is and why it is important. I noticed that MCP was frequently referenced in discussions about agentic systems, yet rarely explained in a concrete, developer-centric way. This article is a direct outcome of that learning process, using a practical Git workflow example to clarify the role and value of MCP in intent-driven developer tooling. What Is an MCP Server? At a conceptual level, an MCP server acts as a control plane between an AI assistant and external systems. Rather than allowing an LLM to issue arbitrary API calls, the MCP server implements the Model Context Protocol and exposes a constrained, well-defined set of capabilities that the model can invoke. As illustrated in the diagram, the AI assistant functions as an MCP client, issuing structured MCP requests that represent user intent. The MCP server receives these requests, validates them against exposed capabilities and permissions, and translates them into concrete API calls or queries against external systems such as databases, version control platforms, or document stores. The results are then returned to the model as structured context, enabling subsequent reasoning or follow-up actions. This intermediary role is critical. The MCP server is not merely a proxy; it enforces permission boundaries, operation granularity, and deterministic execution. By separating intent expression from execution logic, MCP reduces the risk of unsafe or unintended actions while enabling AI systems to operate on real developer tools in a controlled manner. In effect, the MCP server bridges conversational AI and operational systems, making intent-driven workflows both practical and governable. Case Study: Intent-Driven Git Workflows Using GitHub MCP in VS Code To ground the discussion, this section presents a concrete case study using the open-source github-mcp-server, integrated into Visual Studio Code via GitHub Copilot Chat. The goal of this case study is not to demonstrate feature completeness, but to illustrate how MCP enables intent-first interaction for common GitHub workflows. MCP Server Registration in VS Code MCP servers are configured at the workspace or user level using a dedicated configuration file. In this setup, the GitHub MCP server is registered by adding an MCP configuration file under the VS Code workspace: .vscode/mcp.json JSON { "servers": { "github": { "url": "https://api.githubcopilot.com/mcp/" } } } This configuration declares GitHub as an MCP server and points the IDE’s MCP client to a remote endpoint. Once registered, the IDE can discover the capabilities exposed by the GitHub MCP server and make them available to the chat interface as structured tools. Authentication via OAuth Approval When the MCP server is first invoked, VS Code initiates an OAuth flow with GitHub. In this case, authentication was completed by approving access through a browser-based login using GitHub credentials (username and password, followed by any configured multi-factor authentication). This OAuth-based flow has several important properties: Credentials are not stored directly in the MCP configuration.Permissions are scoped to the approved application.Token issuance and rotation are handled by the GitHub authorization system. Once authorization is complete, the MCP server can securely execute GitHub operations on behalf of the user, subject to the granted scopes (these are listed as tools when configuring the MCP server). Alternative Authentication: Personal Access Tokens In addition to browser-based OAuth authorization, the GitHub MCP server can also be configured using a GitHub Personal Access Token (PAT). This approach is useful when explicit credential control is required or when OAuth approval is not feasible in a given environment. In this setup, the MCP configuration declares an Authorization header and prompts the user to supply the token securely at runtime, rather than hardcoding it in the file. .vscode/mcp.json (PAT-based authentication) JSON { "servers": { "github": { "type": "http", "url": "https://api.githubcopilot.com/mcp/", "headers": { "Authorization": "Bearer ${input:github_mcp_pat}" } } }, "inputs": [ { "type": "promptString", "id": "github_mcp_pat", "description": "GitHub Personal Access Token", "password": true } ] } This configuration has two practical advantages. First, the token is not committed to source control because it is collected via an interactive prompt. Second, it makes the authentication mechanism explicit and portable across environments while keeping the MCP server endpoint unchanged. After the token is provided, the IDE can invoke GitHub MCP capabilities through the same intent-driven prompts used in the OAuth-based setup. Verifying MCP Server Initialization in VS Code After adding the MCP configuration, it is important to verify that the GitHub MCP server is correctly initialized and running. Visual Studio Code exposes MCP server lifecycle events directly in the Output panel, which serves both as a validation mechanism and a primary debugging surface. Once the .vscode/mcp.json file is detected, VS Code attempts to start the configured MCP server automatically. In the Output tab, selecting the “MCP: github” channel shows detailed startup logs, including server initialization, connection state, authentication discovery, and tool registration. The logs confirm several important stages: The GitHub MCP server transitions from Starting to RunningOAuth-protected resource metadata is discoveredThe GitHub authorization server endpoint is identifiedThe server responds successfully to the initialization handshakeA total of 40 tools are discovered and registered These log entries provide concrete evidence that the MCP server is active and that its capabilities are available to the IDE. They also offer visibility into the OAuth flow, making it clear when authentication is required and when it has been successfully completed. From a practical standpoint, the Output panel becomes essential when troubleshooting MCP integrations. Configuration errors, authentication failures, or capability discovery issues surface immediately in these logs, allowing developers to debug MCP setup issues without leaving the IDE or guessing at silent failures. Executing GitHub Operations Through Intent Once the GitHub MCP server is configured and running, GitHub operations become available inside the IDE as structured capabilities. Using Visual Studio Code with GitHub Copilot Chat, prompts expressed in natural language are translated into constrained GitHub operations via the github-mcp-server. Repository Discovery Prompt: “List all repos in my GitHub account.” The assistant invokes the repository-listing capability and returns the results directly in the IDE, validating authentication and MCP capability discovery. Pull Request Creation Prompt: “Create a PR.” Because the request is underspecified, the assistant asks for required parameters, including repository, change source, title, description, and base branch. After responding with: “react-storybook-starter, staged changes, PR title – Add a dummy commit, PR description none, merge to master” the assistant creates a branch, commits the staged changes, and opens a pull request. The PR is confirmed with its repository identifier. Repository Creation Prompt: “Create a new repo in mvmaishwarya. Repo name: problems-and-prep. Repo is public.” The MCP server executes the repository creation operation and returns confirmation that the public repository has been successfully provisioned. Observations from Intent-Driven Execution Across these examples, several consistent behaviors emerge. First, the assistant requests clarification only when required by the operation’s schema, avoiding unnecessary dialogue. Second, all actions are executed through explicitly exposed MCP capabilities rather than inferred or free-form API calls. Finally, the IDE remains the primary workspace, reducing context switching between terminals, browsers, and documentation. Together, these interactions demonstrate how MCP enables GitHub workflows to shift from command-driven procedures to intent-driven execution while maintaining safety, transparency, and developer control.
As Indian SaaS companies, e-commerce platforms, and service providers increasingly target global markets, the need for robust international payment integration has become paramount. While numerous payment gateways offer cross-border capabilities, the developer experience and the specific API features required to handle these transactions efficiently — especially given India’s unique compliance landscape — vary significantly. Simply processing a charge isn’t enough. Developers need APIs that elegantly handle multiple currencies, diverse global payment methods, stringent security protocols such as 3D Secure 2.0, and, crucially, provide programmatic access to the data required for Indian regulatory needs like the Foreign Inward Remittance Certificate (FIRC). Manual processes for compliance or reconciliation simply don’t scale. This article provides a technical deep dive into the APIs of five major payment gateways active in India, evaluating their suitability for developers building applications that require international payment acceptance. We focus on API design, core international payment features, developer experience (DX), and the critical aspect of handling compliance programmatically. The API Litmus Test: Key Criteria for Evaluation When assessing an international payment gateway API from an Indian developer’s perspective, the following factors are critical. API Design & Developer Experience (DX) Architecture: Is the API truly RESTful, with predictable, resource-oriented URLs and standard HTTP methods?Documentation: Is the API reference comprehensive, accurate, and easy to navigate? Are there clear code examples, tutorials, and quickstart guides relevant to international payments?SDKs: Are well-maintained SDKs available for major backend languages (Node.js, Python, Java, PHP, Ruby)? Do they provide convenient abstractions over raw API calls?Sandbox environment: How closely does the sandbox mimic the production environment, especially for testing international card flows, 3DS challenges, and currency conversions? Is it reliable and easy to provision test credentials?Developer support: How responsive and technically adept is the support team when developers face integration issues? Multi-Currency & FX Handling via API Currency support: Does the API allow creating charges directly in major international currencies (USD, EUR, GBP, etc.)?FX rate transparency: Can applicable foreign exchange rates be fetched or previewed via the API?Settlement data: How clearly does the API, or related webhooks, expose the final settlement amount in INR, including any applied FX rates or fees? Payment Method Integration (API Level) International cards: How straightforward is the API flow for accepting major international card networks (Visa, Mastercard, Amex)?Other global methods: Does the API support integrating other relevant methods, such as PayPal, easily if required? Security & 3DS2 Integration APIs PCI compliance: Does the provider offer solutions (such as hosted fields or dedicated SDKs) that minimize the developer’s PCI compliance burden?3D secure 2.0: How does the API manage mandatory 3DS2 flows for relevant international transactions? Does it provide clear status updates via webhooks or callbacks for authentication success, failure, or challenge flows?Fraud prevention APIs: Are there endpoints for retrieving fraud risk scores, passing custom transaction metadata for risk analysis, or configuring fraud rules programmatically? Compliance & Settlement Data via API (Critical for India) FIRC data retrieval: Can the essential data points required for FIRC generation — such as UTR number, purpose code, transaction ID, settlement amount, and FX rate — be accessed programmatically via API endpoints or reliably delivered through webhooks? Or does this require manual report downloads?Reconciliation: Do the settlement APIs or reports provide sufficient detail (for example, linking settlements back to original transaction IDs) to enable automated reconciliation of international payments credited to an Indian bank account? The API Deep Dive: Comparing Five International Payment Gateways Let’s examine how five popular gateways stack up based on these API-centric criteria. 1. Razorpay International Payments Positioning: Optimized for Indian businesses — SaaS, e-commerce, and services — going global. API analysis: Razorpay offers a largely RESTful API. Creating international charges involves specifying the currency parameter, with support for 130+ currencies. The documentation is generally clear, with dedicated sections for international payments and code examples in multiple languages. SDKs are available for major platforms. Strengths (API focus): Compliance automation: Razorpay’s key differentiator. While direct API endpoints for all FIRC data points are still evolving, the platform provides crucial identifiers — such as razorpay_payment_id, settlement details (settlement_id, utr) — via webhooks and dedicated Settlement APIs. This facilitates programmatic reconciliation and compliance data collection. Features like the MoneySaver Export Account aim to improve FX transparency, often reflected in settlement details accessible via API. Additionally, the international payment gateway handles international card payments reliably, with minimal downtime.Unified domestic/international payments: Indian payment methods (UPI, Netbanking) and international cards are handled through a relatively consistent API structure, reducing integration complexity. Potential weaknesses (API focus): The sandbox environment, while functional, may not always replicate all edge cases for international 3DS flows across card issuers. Advanced FX rate querying may not be fully exposed via API. Verdict: A strong choice for Indian developers prioritizing integrated compliance and a unified API for domestic and international payments. The programmatic access to settlement data is a significant advantage, and the MoneySaver Export Account is a cost-effective alternative to traditional bank transfers. 2. Stripe (Global) Positioning: The feature-rich global standard. API analysis: Stripe’s API — especially PaymentIntents — is widely regarded as a gold standard for design, consistency, and documentation. It is highly flexible, supporting complex international scenarios, multiple currencies, and a broad range of global payment methods. SDKs and developer tooling are excellent. Strengths (API focus): Flexibility and power: Granular control over the payment lifecycle, including 3DS handling, and support for many international payment methods beyond cards.Developer experience: Best-in-class documentation, client libraries, CLI tooling, and sandbox environment. Extensive webhook support enables real-time updates. Potential weaknesses (API focus): Indian compliance via API: Programmatically extracting FIRC-related data — such as the exact UTR number from Indian settlement batches — can be challenging. It often requires parsing settlement reports obtained manually or via indirect APIs (for example, the Reporting API), adding complexity compared to India-focused providers. Purpose code management might also be less integrated at the API level. Verdict: An excellent API for complex global payment flows and experienced teams. However, developers must plan for additional work to automate India-specific compliance requirements. 3. PayPal Positioning: Widely trusted globally, with varying API depth. API analysis: PayPal provides modern REST APIs for checkouts and card processing (where available). Integration typically involves redirects or JavaScript SDKs. Multi-currency handling is a core capability. Strengths (API focus): Global recognition: Integrating the PayPal wallet via API or SDK is straightforward and benefits from strong global user trust.Broad currency support: Native multi-currency support across APIs. Potential weaknesses (API focus): API complexity: Direct international card processing (beyond PayPal wallet payments) can be more complex or have limited availability compared to Stripe or Razorpay. Indian compliance via API: Similar to Stripe, retrieving FIRC-related settlement data (like UTR) programmatically often requires specific reporting endpoints or manual report downloads. Auto-withdrawal can further complicate reconciliation. Verdict: Essential if PayPal wallet support is a priority. For direct card processing, carefully evaluate API capabilities and the feasibility of automating Indian compliance workflows. 4. 2Checkout (Verifone) Positioning: Focused on global e-commerce and digital goods. API analysis: 2Checkout provides APIs for global e-commerce use cases, supporting multiple currencies and international payment methods. Documentation covers order creation, payments, and subscriptions. Strengths (API focus): Global payment methods: Strong support for region-specific international payment methods.E-commerce features: APIs often include features relevant to e-commerce, such as tax handling and localized checkout features. Potential weaknesses (API focus): DX and modernity: API design and developer experience may feel less modern or intuitive compared to Stripe or Razorpay.Indian compliance via API: Accessing Indian settlement details (such as UTRs for FIRC) programmatically may be less straightforward and insufficiently documented for Indian compliance needs. Verdict: A viable option for global e-commerce businesses, but requires careful evaluation of API endpoints and processes for automating Indian compliance and reconciliation. 5. CCAvenue Positioning: Established Indian player with international capabilities. API analysis: CCAvenue supports international payments and multi-currency processing. Historically, integrations relied on form posts or proprietary protocols, though newer APIs may be available. Strengths (API focus): Local market expertise: Deep understanding of the Indian banking ecosystem.Multi-currency processing: Supports international currencies with INR settlement. Potential weaknesses (API focus): API Design and DX: Older integrations may feel less developer-friendly. Documentation can be less comprehensive or harder to navigate.Compliance data via API: Programmatic access to granular settlement data (such as UTRs for FIRC) may be limited or require manual report handling. Verdict: Reliable, especially for businesses already using CCAvenue domestically, but developers should carefully assess the latest APIs with a focus on DX and automated access to compliance data. API Feature Matrix: Quick Comparison for Developers Gateway API Design Multi-Currency API Ease FIRC Data via API? SDK Quality Docs Clarity Sandbox Quality Razorpay Int'l Mostly RESTful Excellent Yes (Partial/Via Settlements API/Webhooks) Excellent Excellent Good Stripe (Global) Excellent (REST) Good Indirect (Via Reporting API/Manual) Excellent Excellent Excellent PayPal REST Good (REST) Good Indirect (Via Reporting/Manual) Good Good Good 2Checkout (Verifone) Fair-Good Good Likely Indirect Fair Fair Fair-Good CCAvenue Varies (Legacy/New) Fair Likely Indirect/Manual Fair Fair Fair Note: “FIRC Data via API?” refers to the ease of programmatically obtaining identifiers such as UTRs for automated compliance, not merely the existence of the data in reports. Conclusion: Selecting the Best API for Your International Stack Choosing an international payment gateway API requires balancing global feature richness with local operational realities. Global powerhouses (Stripe, PayPal): Offer flexible, feature-rich APIs ideal for complex international scenarios. However, automating India-specific compliance — especially FIRC data retrieval — often requires additional engineering effort.India-optimized solutions (Razorpay): Aim to bridge this gap by combining international payment capabilities with built-in or well-exposed compliance pathways via APIs and webhooks, reducing development and operational overhead.Specialized players (2Checkout, CCAvenue): Provide essential functionality but may lag in API modernity, DX, or programmatic access to India-specific compliance data. Ultimately, the best API depends on your team’s expertise, payment flow complexity, and how critical automated compliance is to your operations. Before committing, thoroughly test sandbox environments — focusing on international card flows with 3DS2, currency handling, and, most importantly, your ability to programmatically retrieve transaction and settlement data required for FIRC and reconciliation. The API that makes this lifecycle easiest to manage in code is likely your best long-term choice.
During my eight years working in agile product development, I have watched sprints move quickly while real understanding of user problems lagged. Backlogs fill with paraphrased feedback. Interview notes sit in shared folders collecting dust. Teams make decisions based on partial memories of what users actually said. Even when the code is clean, those habits slow delivery and make it harder to build software that genuinely helps people. AI is becoming part of the everyday toolkit for developers and UX researchers alike. As stated in an analysis by McKinsey, UX research with AI can improve both speed (by 57%) and quality (by 79%) when teams redesign their product development lifecycles around it, unlocking more user value. In this article, I describe how to can turn user studies into clearer user stories, better agile AI product development cycles, and more trustworthy agentic AI workflows. Why UX Research Matters for AI Products and Experiences For AI products, especially LLM-powered agents, a single-sentence user story is rarely enough. Software Developers and product managers need insight into intent, context, edge cases, and what "good" looks like in real conversations. When UX research is integrated into agile rhythms rather than treated as a separate track, it gives engineering teams richer input without freezing the sprint. In most projects, I find three useful touchpoints: Discovery is where I observe how people work todayTranslation is where those observations become scenario-based stories with clear acceptance criteriaRefinement is where telemetry from live agents flows back into research and shapes the next set of experiments A Practical UX Research Framework for Agile AI Teams To keep this integration lightweight, I rely on a framework that fits within normal sprint cadences. I begin by framing one concrete workflow rather than a broad feature; for example "appointment reminder calls nurses make at the start of each shift." I then run focused research that can be completed in one or two sprints, combining contextual interviews, sample call listening, and a review of existing scripts. The goal is to understand decisions, pain points, and workarounds. Next, I synthesize findings into design constraints that developers can implement directly. Examples include "Never leave sensitive information in voicemail" or "Escalate to a human when callers sound confused." Working with software developers, product managers, and UX designers, I map each constraint to tests and telemetry so the team can see when the AI agent behaves as intended and when it drifts. Also Read: The Benefits of AI Micromanagement UX Research Framework for Agile AI Product Development Technical Implementation: From Research to Rapid Prototyping One advantage of modern AI development is how quickly engineering can move from research findings to working prototypes. The gap between understanding the problem and having something testable has shrunk dramatically. Gartner projects that by 2028, 33% of enterprise software will embed agentic AI capabilities driving automation and more productivity. When building AI agents, I have worked with teams using LLMs or LLM SDKs to stand up functional prototypes within a single sprint. The pattern typically looks like this: UX research identifies a workflow and its constraints, then developers configure the agent using the SDK's conversation flow tools, prompt templates, and webhook integrations. Within days, I have a working prototype that real users can evaluate. This is where UX research adds the most value to rapid prototyping. SDKs handle the technical heavy lifting, such as speech recognition, text-to-speech, and turn-taking logic. But without solid research, developers and PMs end up guessing business rules and conversation flows. When I bring real user language, observed pain points, and documented edge cases into sprint planning, the engineering team can focus on what matters: building an agent that fits how people work. The same holds true for text-based agents. LLM SDKs let developers wire up conversational agents quickly, but prompt engineering goes faster when you have actual user phrases to work from. Guardrails become obvious when you have already seen where conversations go sideways. Also Read: Bounded Rationality: Why Time-Boxed Decisions Keep Agile Teams Moving How UX Research Changes Agile AI Development Incorporating UX research into agile AI work changes how teams plan and ship software. Deloitte's 2025 State of Generative AI in the Enterprise series notes that organizations moving from proofs of concept into integrated agentic systems are already seeing promising ROI. In my experience, the shift happens in two key areas. The first change is in how I discuss the backlog with engineering and product teams. Instead of starting from a list of features, I start from observed workflows and pain points. Software developers and PMs begin to ask better questions: How often does this workflow occur? What happens when it fails? Where would automation genuinely help rather than just look impressive in a demo? The second change is in how I judge success. Rather than looking only at LLM performance metrics or deployment counts, I pay attention to human-centric signals. Did the AI agent reduce manual calls for nurses that week? Did fewer financial operations staff report errors in their end-of-day checks? Those questions anchor agile AI decisions in users' lived experience. Use Case: Voice AI Agent for Routine Calls I built a voice AI agent to support routine inbound and outbound calls in healthcare and financial services. In my user research, I found that clinical staff and operations analysts spent large parts of their shifts making scripted reminder and confirmation calls. Staff jumped between systems, copied standard phrases, and often skipped documentation when queues spiked. I ran contextual interviews with nurses and operations staff over two sprints. I sat with them during actual call sessions, noted where they hesitated, and asked why certain calls took longer than others. One nurse told me she dreaded callbacks for no-shows because patients often got defensive. That single comment shaped how we designed the escalation logic. Based on these observations, I scoped an AI agent with clear boundaries. It would dial numbers, read approved scripts, capture simple responses like "confirm" or "reschedule," log outcomes in the primary system, and escalate to a human when callers sounded confused or emotional. Each constraint came directly from something I observed or heard in research. The "escalate when confused" rule, for example, came from watching a staff member spend four minutes trying to calm a patient who misunderstood an automated message. We treated the research findings as acceptance criteria in the backlog. Developers could point to a specific user quote or observed behavior behind every rule. When questions came up during sprint reviews, I could pull up the interview notes rather than guess. The AI agent cut manual call time, reduced documentation errors by more than 50%, and made collaboration between teams and end users more consistent. Because I started from real workflow observations and built in human escalation paths, adoption was smoother than previous automation attempts and increased by 35% in one quarter. Voice AI Agent Case Study Why This Approach Works UX research gives agile AI development a focused user perspective that directly supports developer cycles. When teams work from real workflows and constraints, they write less speculative code, reduce rework, and catch potential failures earlier. McKinsey's work on AI-enabled product development points out that teams redesigning their Agile AI product development and with UX research expertise tend to see more user-centric decision-making leading to better product experiences. Knowing this, and in my opinion, you do not have to trade one for the other. Agile AI teams that work this way stay closer to their users without slowing down. Key Takeaways If you are beginning to build or refine LLM-powered agents, here is a realistic next step. Pick one narrow workflow. Study how work happens today. Run a small research-driven experiment. Use telemetry and follow-up conversations to refine each iteration. AI delivers lasting value only when it is integrated thoughtfully into how people and teams already operate. By treating UX research as a first-class part of agile AI development, you bring the user's perspective into every sprint and make your development lifecycle more responsive to real needs. UX research helps agile AI teams start from real workflows instead of abstract features, leading to more focused and effective agentic workflowsIntegrating Research into each agile AI product development sprint gives teams clearer constraints, reduces rework, and supports higher quality releasesModern LLMs accelerate prototyping, but the quality of your agentic AI workflows depends on how well you understand the AI workflows before you define requirements and write code
Bigger isn’t always better, especially when it comes to AI models. They are larger, more capable, and more resource-intensive, utilizing bigger models to deliver enhanced reasoning, summarization, and even code generation capabilities. The size and scalability of gen AI models have their limits. Larger models are designed to work best with open-ended problems, which are, by nature, often countered in chats. However, when an AI-powered product, such as a CRM system, is using AI models, the problem that the product is solving is actually very much fixed and highly structured. It has deviated substantially from the original chat format, which would require AI models to define the problem and come up with the steps to a solution themselves. As we look forward to 2026, we can expect to see a more nimble system design. AI is transitioning from research to production, particularly in enterprise ecosystems, and the limitations of LLMs are beginning to show. Latency, cost, and lack of control are making it more difficult to harness LLMs for fixed business workflows. Using LLMs to address routine business issues is like using a sledgehammer to crack a nut – you don’t need that much AI processing power. When Smaller Is Better Let's take AI-powered customer support for e-commerce, for example, which is one of the most popular business use cases of GenAI. When implementing an AI customer support agent, the first instinct would be to deploy a large thinking model like GPT-5 Thinking or Sonnet 4.5 to handle the full customer inquiry, since these thinking models are supposedly powerful enough to do everything, including understanding customer tone, interpreting requests, generating empathetic responses, checking inventory, processing returns, and escalating complex issues. However, when this is actually implemented, there are some key issues: The response is slow. Larger thinking models are often slower than smaller models. This may be a smaller problem for email support, but a very big issue for chat support. It's expensive. Larger models may cost 10 times as much as smaller models, processing the exact same input.It's inconsistent. Using larger models may correctly answer customer inquiries 90% of the time, but it's very difficult to improve on the last 10% since we have so little control over "how" the model thinks. The next wave of AI systems will prioritize architecture over scale. It’s time to adopt smaller, faster, more specialized AI models engineered to work together as modular components to address specific business problems. The Bigger Brain Fallacy For the past five years, developers have been focused on optimizing “thinking” AI models that can handle open-ended reasoning using conversational language. LLMs that support such thinking models are great for free-form tasks, such as ideation, creative writing, and complex logic. They are less well-suited for structured, rules-based applications, such as CRM, ERP, and e-commerce, yet organizations are adapting LLMs for rules-based workflows. The problem space for many business issues is well-defined within a specific workflow. LLMs are ideal for freeform reasoning, but the task of AI is actually usually clearly defined; there is not much free reasoning needed to create a path to the solution; it’s to execute that path efficiently and predictably, with consideration for constraints like cost and latency. For interactive systems to deal with issues such as routine customer issues, businesses need predictability and consistency, not opaque AI geniuses. Modular Means More Efficiency Rather than adopting behemoth AI models, it makes better sense to break the problem into a sequence of narrower AI tasks, each handled by a specific, lightweight AI model. Each of these smaller models performs a discrete, well-defined function. Together, they can be assembled into a composable workflow that outperforms LLMs for well-defined functions. Assembling a swarm of task-specific models optimizes speed, cost, and reliability. For example, we already have a clear set of rules on how customer inquiries should be processed. Let's do a high-level overview on how we can use small models to divide and conquer: Intent classification – Use an intent classifier at the beginning with a tiny model. Its only job is to read the customer message and identify what the customer wants, whether it is refunding, order tracking, product info, etc.Policy enforcement – Depending on the intent classifier, run a predefined SOP according to its category. Let's say the customer is asking for a refund; it can first run a small model to check store return policies. It can either accept or reject the request, ask for more information, or escalate and route to a human support.Data interaction – If the refund is accepted, run a model to generate an action to check and update customer order data in the database.Response generation – Based on the result of the updated order, the AI drafts a response using a small model or even sends a simple reply to the customer using a template without even using AI. While there are multiple model calls, each one is smaller, faster, and cheaper than using a single LLM. This approach could reduce processing time by 70% and cut costs by over 50%. The simpler the query, the shorter the time and the lower the cost. It’s also easier to debug. Since each function has a specific responsibility, developers can observe and test outcomes. Each component can be individually benchmarked to identify the weak points. The accuracy of this swarm of smaller models' approach is much better than the single larger thinking model approach in most cases, because the smaller models are asked to do one much simpler and specific job, and they have a much smaller chance of hallucinating. It also has many fewer output degrees of freedom and a clearer success criterion, which reduces the number of ways that things can go wrong. A Return to Classic Software Principles Using a modular approach may seem familiar. Rather than treating AI systems as black boxes, this marks a return to classic software engineering, where developers can create transparent and measurable elements. In one example, each model behaves like a microservice. Observable metrics such as latency, cost per token, and accuracy are tracked at every stage. Classifiers or text generators can be swapped out without having to retrain the entire system. Workflows can be reconfigured based on user context or business logic. This modular approach aligns AI with modern DevOps practices. Deployment pipelines can be extended to include model components. Monitoring tools can log model-level performance, error rates, and drift. The result is AI development as an iterative engineering approach rather than building a black box. The resulting systems are not only faster and more predictable but also easier to maintain at scale. The use cases for the largest AI adopters are mostly very suitable for this type of swarm of smaller models approach. The top 30 OpenAI customers have already used more than 1 trillion AI tokens. For most of these companies, AI usage is well-defined, so they would likely benefit from using a swarm of small models. Duolingo is one of the companies in the top 30 list. The company is utilizing AI for language learning, which doesn’t require much critical thinking. What it does need is consistent ways to generate responses in multiple languages. A swarm of structured, repeatable tasks is all that’s needed. Generative AI was designed to address the bigger challenge of utilizing natural language processing (NLP). Most AI applications are taking advantage of that capability, but in 2026, we can expect to see a shift from AI model size to system design. The most advanced products will be defined by their architecture rather than the number of parameters. The key to success is intelligently and efficiently orchestrating specialized models to address specific business outcomes. AI is entering the DevOps era. The future won’t be built using a single giant brain, but a network of distributed micro-intelligences working together at machine speed.
TL;DR: The Pre-Mortem Leadership resistance to your pre-mortem reveals whether your organization’s operating model prioritizes comfortable narratives over preventing failure. This article shows you how to diagnose cultural dysfunction and decide which battles to fight. The Magic Of Risk Mitigation Without Passing Blame There’s a risk technique that takes 60 minutes, costs nothing, and surfaces problems other planning methods miss. It’s been field-tested for nearly two decades. Teams that use it catch catastrophic issues while there’s still time to act. Most organizations never run one. When you try to introduce it, the people who complain loudest about projects failing are likely the same ones who will kill it in the meeting. The technique is the pre-mortem, and the resistance you hit tells you more about your organization than any risk register. The Basics (In Case You Haven’t Run a Pre-Mortem) Traditional risk planning asks “what might go wrong?” A pre-mortem flips it: Assume your initiative already failed. It’s six months from now, the project is a smoking crater, and you’re gathering the team to explain what happened. That shift from “might fail” to “did fail” breaks something open. People stop hedging. The risks they’ve been too politically careful to mention in a typical planning session suddenly make it into the room: the technical debt everyone knows about, but nobody wants to raise, the stakeholder who will torpedo this in month four, the assumption the whole plan depends on that nobody has actually validated. The pre-mortem technique is simple: In a 60-minute session, everyone first writes down their own reasons for failure. You cluster them, vote on the critical ones, then dig in: What does that failure actually look like?What early warnings would we see?What can we do this week to prevent it? What’s the backup plan? You walk out with a shared understanding of what could kill this initiative and concrete actions you can take immediately. Not a document to file. Actual insight. My tip: Liberating Structures work very well in this context; think of TRIZ, for example. Objectives from Leadership Level Against Pre-Mortems Interestingly, the pre-mortem technique is not as popular as we think. On the contrary, any facilitator who suggests a pre-mortem may face serious opposition from leadership. The top-three objections are: 1. “We Don’t Have Time for Another Workshop” When you hear this, you’re not hearing a scheduling problem. You’re hearing a confession. What they’re saying: Calendars are packed, we’re under pressure to deliver, and an hour spent imagining failure is an hour not spent building. What they’re confessing: Planning in this organization is theater. We can’t tell the difference between looking busy and being effective. We have time for roadmap sessions and strategy off-sites that produce nothing but slide decks, but not for 60 minutes that might actually prevent failure. Ask yourself: if you don’t have 60 minutes to pressure-test a significant initiative before you commit resources to it, what are you doing in all those other meetings? If you can’t spare an hour for thinking, you’re not planning, but performing planning for an audience. You always have time for what you actually value. This objection shows that the organization values the appearance of progress over its substance. 2. “This Is Too Negative, It Will Demotivate People” This one is my favorite because it’s pure magical thinking dressed up as leadership wisdom. What they are saying: We need to project confidence. Dwelling on failure becomes self-fulfilling. Teams need positive energy. What they are actually revealing: We have confused optimism with competence. We believe reality is negotiable, that if we just maintain the right attitude, the laws of physics, market dynamics, and technical constraints will join our efforts and make us successful. The problem, of course, is that reality doesn’t care about your team’s morale. Your competitors aren’t checking your confidence level before they move. Technical debt doesn’t vanish because you chose not to discuss it. I have watched this play out repeatedly. Teams that can only stay motivated by avoiding hard truths aren’t resilient; they’re brittle. The first time they hit a problem they didn’t prepare for, the whole structure collapses. Motivation built on denial shatters the moment you encounter reality. The most motivated teams I have seen are those that know precisely what they are up against and have a plan to deal with it. And if that is not working, they can pivot rapidly to another plan. Confidence that survives contact with reality requires facing reality first. 3. “We Already Manage Risk” This objection is the most revealing because it exposes a category error in the organization’s thinking. What they are saying: The PMO maintains risk registers. We have governance processes. Project reviews happen. Therefore, a pre-mortem looks like duplication. What they are missing: They have mistaken the artifact for the activity. Having a risk register is not the same as having risk awareness. It is the difference between owning a fire extinguisher and understanding how fires start. Look at the risk registers in your organization. You will often see the same five entries on every project: “scope creep,” “resource constraints,” “stakeholder alignment,” “technical complexity,” and “timeline pressure.” Not wrong. Just useless. Too generic to act on, too obvious to provide insight, too abstract to prevent anything. A pre-mortem asks different questions. It focuses on what will kill this particular initiative in this context. It uses collective intelligence from everyone who knows something critical about what could go wrong, not one person filling out a template alone, thereby creating alignment and a shared understanding of the risk situation. You are not duplicating risk management. You’re doing it for the first time. Conclusion: What You Learn from a Pre-Mortem’s Rejection When leadership blocks a pre-mortem with one of these objections, pay attention. You are learning more about the system you are operating in than about the technique. The pattern is consistent: The organization prefers comfortable narratives to uncomfortable truths. It would rather maintain the fiction of control than develop the capability to handle what is coming. No facilitation method fixes that. If leadership can’t spare 60 minutes for critical thinking, or believes acknowledging problems creates them, or thinks documentation equals understanding, you face a cultural dysfunction that runs deeper than your initiative’s risk profile. You can still use that information. You can make better decisions about where to invest your energy, which battles are worth fighting, and whether this organization is serious about the outcomes it claims to want. Sometimes, the most valuable thing a pre-mortem shows you is that nobody in charge actually wants to know why a project might fail.
For decades, software development has been a story of evolving methodologies. We moved from the rigid assembly line of Waterfall to the collaborative, iterative cycles of Agile and Scrum. Each shift was driven by a need to better manage complexity. Today, we stand at a similar inflection point. A new, powerful collaborator has joined the team: Artificial Intelligence. The initial rush to use AI has led to a chaotic, improvisational style of work many call “vibe coding.” It’s fast, it’s exciting, but as many teams are discovering, it’s not sustainable. Just as Agile brought structure to team collaboration, a new generation of AI-native frameworks is emerging to bring structure, predictability, and professionalism to human-AI collaboration. Hidden Costs of Unstructured AI Use The hype around AI productivity is real. Studies show developers can code up to 55% faster with AI assistants. But these headline numbers mask a darker, more expensive reality for teams that lack a formal process. The 70% rejection rate: Industry data shows that while AI tools suggest code constantly, developers reject or discard approximately 70% of these suggestions (Source: GitClear, Netcorp 2025). Every rejected suggestion represents wasted compute cycles, direct token costs, and a developer’s time spent sifting through noise instead of building.The quality nosedive: A 2024 analysis found that unstructured AI-assisted coding was linked to a four-fold increase in code duplication and a rise in “code churn”, brittle and non-reusable code that inflates technical debt and creates future maintenance nightmares (Source: GitClear). Without a guiding framework, the developer’s mental load doesn’t disappear. It shifts from writing code to constantly vetting, debugging, and refactoring a stream of unpredictable AI output. The Rise of AI-Native Frameworks To counter this chaos, a new category of tools and methodologies is taking shape. These AI-native frameworks provide the guardrails and structured workflows needed to turn a powerful but erratic AI tool into a reliable engineering partner. The core idea is to move from a conversational, “vibe-driven” approach to an intent-driven one, where your plan becomes a version-controlled artifact that guides the AI. We are seeing this trend emerge in various forms: Spec-Driven Workflows like GitHub’s Spec-kit.Agile-Inspired Methodologies like the BMad Method.Test-Driven Development (TDD) Partners like Aider.Autonomous Agentic Systems like MetaGPT and SWE-agent. While all these frameworks share common goals, their approaches can be quite different. To illustrate this, let’s zoom in on two prominent examples, Spec-kit and the BMad Method. Understanding their distinct philosophies, the first one is tactical and developer-centric, whereas the other is strategic and team-oriented. A Tale of Two Philosophies Spec-kit focuses on feature-level “spec-to-code” generation, whereas the BMad Method focuses on Full project lifecycle management from idea to QA.Spec-kit is primarily for individual developers, whereas the BMad Method is preferable for the entire agile team (PMs, architects, Devs).Spec-kit is capable of rapidly and reliably scaffolding code from a clear, version-controlled specification, whereas the BMad Method is great in integrating AI agents into existing Agile/Scrum processes at a strategic, cross-functional level. This comparison shows there isn’t a single “best” framework, only the one that best fits the task at hand. You wouldn’t use a full project plan to fix a typo, nor would you build a new microservice based on a one-line prompt. Adopting a Framework-Based Approach Before picking a specific tool, the first step is to adopt the mindset. Before your team starts its next AI-assisted project, ask these questions: How do we define our intent? Is there a formal process for creating a specification or plan before we prompt the AI to write code?What is the human’s role? Is the developer positioned as a clear-eyed reviewer and approver at critical checkpoints?Is the process repeatable? Are our prompts and plans version-controlled?How do we enforce quality? Do we have a mechanism to ensure the AI adheres to our architectural patterns and coding standards? Best-of-Both-Worlds Solution The choice isn’t always ‘either/or.’ The real power of these structured approaches lies in their modularity, allowing teams to combine them to create a workflow that fits their unique needs. A hybrid approach can leverage BMad’s strategic planning with Spec-kit’s tactical execution prowess. Here’s how it could work: Phase 1: Strategic and sprint planning (BMad) Use the BMad Business Analyst and Architect agents to define the project’s vision, create a detailed PRD, and establish the high-level system design.The BMad Scrum Master then breaks down the PRD into user stories for the upcoming sprint. Phase 2: Feature implementation (Spec-kit) A developer picks up a user story from the sprint backlog.They use this user story as the initial prompt for Spec-kit’s /specify command to create a detailed, executable specification.They then run through the /plan, /tasks, and /implement phases to generate high-quality, compliant code that perfectly matches the spec. Phase 3: Quality assurance and integration (BMad) The code generated by Spec-kit is submitted for review.The BMad QA Agent is then invoked to perform an initial review, checking the implementation against the original user story and acceptance criteria, completing the loop. This hybrid model creates a seamless workflow where high-level project management flows directly into low-level, spec-driven code generation, giving you end-to-end control, consistency, and quality. Moving from vibe coding to a structured framework is the next logical step in the evolution of our industry. It’s how we transform AI from a clever shortcut into a strategic asset that delivers predictable, high-quality, and cost-effective results. It’s how we build the future, responsibly.
The role of a Scrum Master is to establish Scrum, and the Scrum Master is accountable for the Scrum Team’s effectiveness. Thus, it is quite tempting to ask how a Scrum Master can help improve the productivity of the development team. But, in a complex working environment like software development, productivity is often not the right measure to showcase all the complexities of software developers’ knowledge work. In simple working environments, productivity means a ratio of output to input. The traditional idea is to know how much is achieved (output) with a given amount of resources (inputs), largely in numbers, and the focus is on maximizing the output. That’s why, in traditional project management of software development projects, stakeholders evaluate the development team’s productivity based on the lines of code. Or, even today in Agile project management, stakeholders with a traditional mindset ask for the number of story points per iteration, known as Sprint Velocity. But productivity in a complex working environment like Agile software development is not linear. Factors like customer satisfaction, business value, and project success matter more than working at the highest efficiency. It is because if a software is not able to deliver the intended business value or solve the exact customer problems, there is no use in building it fast, in the least possible time. It will be a waste of time and money. Having said that, it does not mean there are no opportunities to improve productivity. There are operational efficiencies that can hinder productivity. And it is the responsibility of a Scrum Master to address the operational inefficiencies to improve the productivity of the development team because actions of a Scrum Master have a direct impact on the team’s efficient functioning. In this post, we will look at the four primary ways a Scrum Master can help improve the productivity of the Scrum Team. I would rather call it ways to ‘improve effectiveness’ because we also have to focus on ensuring the development team delivers the software of the highest business value and customer satisfaction most effectively. Four Ways a Scrum Master Improves Development Team Productivity Here are four ways a Scrum Master can contribute: 1. Facilitating Scrum Each Scrum event (Sprint, Sprint Planning, Daily Scrum, Sprint Review, and Sprint Retrospective) has a purpose. The official Scrum guide says, “Each event in Scrum is a formal opportunity to inspect and adapt. Events are used in Scrum to minimize the need for meetings not defined in Scrum.” And it is true. Modern-day complexities in software development, such as customer-centric product development, changing market trends, and competitors' developments, require continuous collaboration among developers, stakeholders, and product owners to inspect and adapt. Too many meetings can hinder the productivity of the developers. By facilitating each Scrum event at the right time and in the right order, the Scrum Master eliminates unnecessary meetings, ensuring the team communicates, inspects, and adapts at the right time to produce the most valuable work. These Scrum meetings also provide an opportunity to address a team’s operational inefficiencies, resulting in improved productivity. Let’s understand it by an example. The purpose of Sprint planning is to bring clarity and consensus on what needs to be done for the development team. It must happen at the beginning of the sprint to ensure everyone has clarity and a mutually agreed and shared understanding of the Definition of Done (DOD), Product goal, Sprint goal, Increments to be delivered, and External dependencies. The Scrum Master ensures that all key stakeholders (Product Owner, Developers, and Scrum Master) are present at the sprint planning meeting, and their concerns are addressed. Similarly, for each Scrum ceremony — Daily Scrum, Sprint Review, and Sprint Retrospective — the Scrum Master ensures it serves its intended purpose. By facilitating these events, the Scrum Master ensures the team works most effectively, resulting in improved productivity while delivering the most effective work. 2. Removing Impediments Scrum focuses on getting feedback early and often from the customers. This is the reason why sprints are of short duration. If there are any blockers, obstacles, or other impediments to obtaining early and frequent customer feedback, it is the responsibility of the Scrum Master to remove those impediments. To give you an example of an impediment, consider that the deployment of the increment is delayed due to some external dependencies, such as a bureaucratic deployment process or complex dependency chains with other teams. This delays the customer feedback that can prevent potential improvements in the next sprint. It is the responsibility of the Scrum Master to streamline deployment processes, remove blockers, and gather feedback from customers early. This is one example of a hindrance. Hindrances could be anything from an unclear Definition of Done to a poor estimate of Story points, a lack of required technological resources, and context switching. 3. Empowering the Team to be Self-organizing A Scrum Team is a self-organizing team. It means the developers are the ones who decide: What work to do?When to do the work?How to do the work?How do engineers, designers, and testing experts work together?Who does the work?What technologies to use?What architecture and UX to use? Even a Scrum Master does not dictate the way development teams organize, plan, and manage the work. The 11th principle of Agile Manifesto says, “The best architectures, requirements, and designs emerge from self-organizing teams.” However, it is definitely the responsibility of the Scrum Master to coach the development team in self-organization and cross-functionality. The Scrum Master has to ensure the team is collaborating effectively and accountable. To achieve this, the Scrum Master can create an environment that fosters open collaboration, where the Scrum Team collaborates on solving problems independently and feels psychologically safe and encouraged to contribute. This autonomy and accountability remove operational inefficiencies and promote faster decision-making. If a team needs resources, guidance, and any support, the Scrum Master is there as a facilitator and a servant leader to provide the resources the team needs to function optimally and effectively. Based on the experience of the Scrum Team, the involvement of the Scrum Master varies. Having said that, it is supposed that the best Scrum Teams are capable of self-organizing, planning, identifying, adapting, and resolving their own impediments. It is the fine balance of authority and autonomy that a Scrum Master needs to master. 4. Removing Barriers Between Stakeholders and a Scrum Team Software development does not go as smoothly as it appears on paper. It is challenging to bring all the stakeholders on the same page. That’s exactly why the Scrum Team has a Scrum Master. They bridge the gap between the Scrum Team, the Product Owner, and the Organization. The Scrum Master facilitates collaboration among stakeholders as requested or needed and helps them understand the complexities of each other’s work. This improves the flow of work by addressing the complex issues, securing necessary resources, and bringing clarity to the priorities, needs, and expectations. Conclusion Productivity is not the goal of the Scrum Team, but effectiveness is. It is because ultimately nothing is more wasteful than building software that no one wants. And undoubtedly, the actions of a Scrum Master have a direct impact on the team’s productivity, efficiency, and effectiveness. By leading the team in Scrum, addressing the operational inefficiencies, and facilitating collaboration among the stakeholders, a Scrum Master can help improve the productivity of the development team.
Measuring and improving developer productivity has long been a complex and contentious topic in software engineering. With the rapid rise of AI across nearly every domain, it's only natural that the impact of AI tooling on developer productivity has become a focal point of renewed debate. A widely held belief suggests that AI could either render developers obsolete or dramatically boost their productivity — depending on whom you ask. Numerous claims from organizations linking layoffs directly to AI adoption have further intensified this perception, casting AI as both a disruptor and a catalyst. In this article, we'll examine the current landscape and delve into recent studies and surveys that investigate how AI is truly influencing developer productivity. Studies Let's explore the findings from the studies below, which assess the impact of AI tooling on developer productivity. Study #1: Experienced Open-Source Developer Productivity To evaluate the impact of AI coding assistant tools on the productivity of experienced open-source developers, a randomized controlled trial (RCT) was conducted from February to June 2025 using the tools. A total of 16 developers with an average of 5 years of experience were chosen to complete a total of 246 tasks in mature projects. These tasks were randomly assigned among developers, with either AI tools being allowed or disallowed, respectively. Before starting tasks, developers forecast that task completion time would decrease by 24% with AI. After completing the task, developers estimated that with AI, the completion time had been reduced by 20%. However, on the contrary, the study found that allowing AI actually increased task completion time by 19%. Moreover, these results are in stark contradiction of experts prediction of task completion time reduction of up to ~39%. Below is the summary of the prediction and findings mismatch: Experts and study participants misjudged the speedup of AI tooling. Image courtesy of respective research. Although the study concludes that AI tooling slowed developers down, it could be due to a variety of factors, with five key factors for observed slowdown listed below: Over-optimism about AI usefulness (Direct productivity loss). Developers are free to use AI tools as they see fit, but their belief that AI boosts productivity is often overly optimistic. They estimate a 20–24% time reduction from AI, even when the actual impact may be neutral or negative, potentially leading to overuse.High developer familiarity with repositories (Raises developer performance). AI assistance tends to be less helpful, and may even slow developers down, on tasks where they have high prior experience and need fewer external resources. Developers report AI as more beneficial for unfamiliar tasks, suggesting its value lies in bridging knowledge gaps rather than enhancing expert workflows.Large and complex repositories (Limits AI performance). Developers report that LLM tools struggle in complex environments, often introducing errors during large-scale edits. This aligns with findings that AI performs worse in mature, large codebases compared to simpler, greenfield projects.Low AI reliability (Limits AI performance). Developers accept less than ~44% of AI-generated code, often spending significant time reviewing, editing, or discarding it. Even accepted outputs require cleanup, with ~75% reading every line and ~56% making major changes, leading to notable productivity loss.Implicit repository context (Limits AI performance, raises developer performance). AI tools often struggle to assist effectively in mature codebases due to a lack of developers' tacit, undocumented knowledge. This gap leads to less relevant suggestions, especially in nuanced cases like backward compatibility or context-specific edits. Due to these factors, the gains of auto-code generation are offset considerably, and thus the significant contrast in perceived/forecasted and actual results in developer productivity is exposed. Also, with the AI tooling, the developer is required to spend additional time on prompting, reviewing AI-generated suggestions, and integrating code outputs with complex codebases. Thus, adding to the overall completion time. See below for average time spent per activity — with and without AI tooling. Average time spent per activity. Image courtesy of respective research. Takeaway: The study reveals a perception gap where AI usage subtly hampers productivity, despite users believing otherwise. While findings show a slowdown in large, complex codebases, researchers caution against broad conclusions and emphasize the need for rigorous evaluation as AI tools and techniques continue to evolve. Thus, the study should merely be considered as a data point in evaluation and not a verdict. Study #2: GitClear The GitClear study analyzed ~211 million structured code changes from 2020 to 2024 to assess how AI-assisted coding impacts developer productivity. It categorized changes — like added, moved, copied/pasted, and churned lines — using GitClear's Diff Delta model to track short-term velocity versus long-term maintainability. Duplicate block detection was introduced to measure how often AI-generated code repeats existing logic. The methodology links rising output metrics to declining code reuse, revealing hidden costs in perceived productivity gains. Below is the trend of code operations and code churn by year as cited in the report. GitClear AI Code Quality Research — Code operations and code churn by year. Image courtesy of respective research. The following points can be inferred from the study: Increased code output: AI-assisted development led to a significant rise in the number of lines added, up 9.2% YoY in 2024. This could be perceived as an increase in developer productivity due to faster code generation and higher task (ticket) completion throughput. However, the key question remains — are the added lines of code required in the first place?Decline in refactoring (“moved” code): “Moved” lines — an indicator of refactoring — dropped nearly 40% YoY in 2024, falling below 10% for the first time. This can be attributed to the developer accepting the AI-generated code as-is and skipping the effort to refactor (to save time). Moreover, AI tools rarely suggest refactoring due to limited context windows, and thus fuel the overall drop.Surge in copy-and-pasted and duplicated code. Copy/pasted lines exceeded moved lines in 2024, with a 17.1% increase YoY. Commits with duplicated blocks (≥5 lines) rose 8x in 2024 compared to 2022. 6.66% of commits now contain such blocks. This, too, can be attributed to the developer accepting the AI-generated code as-is without much effort to keep the code DRY.Increased churn in newly added code. Churn — code revised within 2–4 weeks — increased 20–25% in 2024, i.e., developers are revisiting new code more frequently. This also implies that although the code output surged with AI tooling, due to low quality, code is being revised sooner than it used to happen earlier (when no or limited AI tooling was utilized). Takeaway: The rise in AI-generated code has led to a parallel increase in copy-pasted fragments, duplication, and churn — while refactoring efforts have notably declined. This trend signals a deterioration in overall code quality. Many organizations still gauge developer productivity by metrics like lines of code added or tasks completed. However, these indicators can be easily inflated by AI, often at the expense of long-term maintainability. The result is bloated codebases with higher duplication, reduced clarity, and an expanded surface area for bugs. While AI may boost short-term development velocity, the trade-off is accumulating technical debt and diminished code quality — costs that will surface over time in the form of increased maintenance burden and reduced agility. Surveys While studies often rely on data-driven methodologies, these approaches can sometimes be questioned for their assumptions or limitations. Surveys, on the other hand, offer direct insight into developer sentiment and can help bridge gaps that traditional studies might overlook. In the sections below, we explore findings from independent surveys that assess the impact of AI tools on developer productivity. Survey #1: StackOverflow In its 2025 annual developer survey, Stack Overflow received over 49k responses, covering various aspects, including AI tooling and its related impact. Do note that I, too, was one of the respondents. Among respondents, overall AI tool usage surged to ~84% from ~76% the previous year. The AI tool positive sentiment however dropped by ~10 percentage points signaling a trust deficit by the developers— more on this later. AI tools usage and sentiment. Image courtesy of respective survey results. Among the respondents, ~46% of users actively distrust the accuracy of AI tools. Moreover, ~66% cited that the AI tools solution is not up to mark, and ~45% cited that these solutions require additional debugging time. This clearly means the developer requires additional effort to understand, debug, and potentially refine AI-generated code, effectively increasing overall task completion time. Although trust in AI tools' ability to handle complex tasks surged by ~6 percentage points, this could be due to AI tool enhancements or to developers' overall lack of trust in AI tools' accuracy. Thus, totally avoiding AI tools for any complex tasks given the risk it carries in terms of quality and other aspects. Given the significant trust deficit in the accuracy of AI tools, the decline in positive sentiment seen in the previous section could be well related. Trust in AI tools accuracy and ability to handle complex tasks. Image courtesy of respective survey results. Frustrations with AI tools and humans as the ultimate arbiters of quality and correctness. Image courtesy of respective survey results. Even though the AI agents adoption isn't mainstream more than half of the respondents ~52% cited productivity gains. The AI Agents perhaps could be a space worth watching for as they are relatively new thus a lot of potential enhancement could follow in upcoming years. Moreover, given the contextual information it utilizes to generate code, they seem promising over the simpler AI tools. AI agents and impact on work productivity. Image courtesy of respective survey results. Takeaway: The survey revealed a sharp rise in AI tool adoption accompanied by a notable drop in positive sentiment highlighting a growing trust deficit. Majority of respondents expressing active distrust in AI tool accuracy, due to subpar solutions, suggesting that AI-generated code often demands extra effort to refine and validate. This offsets the productivity gain from faster code generation. Interestingly, trust in AI tools' ability to handle complex tasks rose, reflecting cautious optimism rather than full confidence. Developers still see themselves as the ultimate judges of code quality, reinforcing the need for human oversight. Meanwhile, AI agents — though not yet widely adopted — show early promise. Their use of contextual information positions them as a potentially more reliable and efficient evolution of current AI tooling. Survey #2: Harness Harness surveyed 500 engineering leaders and practitioners to assess various parameters, including the impact of AI on developer productivity. Although the surveyed participants showed overall positive sentiments towards AI tooling and its adoption, 92% also highlighted the associated risks. In an independent related observation, the risks are corroborated. AI Missteps and Impact Radius. Image courtesy: https://martinfowler.com/articles/exploring-gen-ai/13-role-of-developer-skills.html Almost two-thirds of respondents mentioned that they spend more time debugging AI-generated code and/or resolving security vulnerabilities. AI tooling may also generate code that includes outdated dependencies or insecure coding patterns, requiring developers to spend time updating and patching these vulnerabilities. This significantly increases the developer overhead and potentially offsets a considerable part of the productivity gains with AI tooling. Two-third of respondent requires more time debugging AI generated code and/or resolving security vulnerabilities. Image courtesy of respective survey results. About 59%nearly half offsets the gains due to rework or additional efforts 59% of developers experience deployment problems with AI tooling involved. Image courtesy of respective survey results. Since 60% of the respondents don't evaluate the effectiveness of the tools, it's quite challenging to relate it to developer productivity altogether. 60% of respondents don't evaluate the effectiveness of AI tooling. Image courtesy of respective survey results. Takeaway: The survey reveals a nuanced picture of AI's impact on developer productivity. While most respondents expressed optimism about AI tooling, but also flagged significant risks. Notably, the majority reported spending more time debugging AI-generated code and addressing security vulnerabilities — contradicting the assumption that AI always boosts efficiency. Deployment issues further compound the overhead, with many encountering frequent rework. The lack of tool effectiveness evaluation by many respondents underscores the challenge of accurately measuring productivity gains. Overall, the findings highlight that AI adoption demands careful oversight to avoid offsetting its intended benefits. Conclusion The studies and surveys analyzed paint a complex picture of AI's role in software development, revealing that perceived productivity gains often mask deeper issues. While AI tools may accelerate coding tasks, they also introduce duplication, churn, and technical debt — especially in large codebases — undermining long-term maintainability. Trust in AI-generated code remains fragile, with developers frequently needing to debug and refine outputs. This erodes efficiency, offsets gain from faster code generation, and highlights the importance of human oversight. Crucially, coding represents only a fraction of the overall software delivery cycle. Improvements in cycle time don't necessarily translate to gains in lead time. Sustainable productivity demands more than speed — it requires thoughtful architecture, strategic reuse, and vigilant monitoring of maintainability metrics. In essence, AI can be a powerful accelerator, but without deliberate human intervention, its benefits risk being short-lived. References and Further Reads Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer ProductivityGitClear Code Quality Study — 2024 | 2025Harness — State of Software DeliverySO Developer Survey 2025Role of Developer Skills in Agentic Coding
Editor’s Note: The following is an article written for and published in DZone’s 2025 Trend Report, Intelligent Observability: Building a Foundation for Reliability at Scale. Imagine a world where the 3:00 AM PagerDuty alert doesn’t lead to a frantic scramble, but rather to a concise summary of the problem, a vetted solution, and a one-click button to approve the fix. This transformative capability represents the next frontier of AIOps (artificial intelligence for IT operations), powered by agentic AI systems that are designed to perceive, reason, act, and learn. This shift promises a significant reduction in mean time to resolution (MTTR) but critically relies on human-in-the-loop (HITL) safeguards to ensure accountability and prevent issues like AI hallucinations. This tutorial will serve as a practical, step-by-step guide for engineers and tech ops teams and leaders. We will attempt to illustrate and sketch out the details of how to construct a scalable, secure, resilient workflow using large language model (LLM) agents for smart alert triage, context summarization, and most importantly, gated runbook execution. Such an agentic mechanism should eventually be the pioneering framework for the upcoming self-healing systems. Prerequisites of Agentic AIOps Before beginning the construction of the first line of code, we need to define the theoretical base, factors, and conditions for agentic AIOps to be an option. Defining Agentic AIOps Agentic AIOps fundamentally overhauls the relationship between AI and digital operations. It goes beyond models, which are just classifiable or predictive ones. The cornerstone of this evolution is a software entity that possesses four key attributes, allowing it to move beyond passive observation: Perception – the ability to ingest and understand data from the environment (e.g., observability data)Reasoning – the use of an LLM and structured data (tools) to formulate a goal and planAction – the capacity to execute the plan through external tools (e.g., calling an API, running a script)Learning – the ability to refine its performance based on feedback from its actions and human input For this model, rather than simply reacting to an alert, the agent is taking proactive ownership of a defined task (e.g., diagnosing a microservice failure and suggesting a well-tested, workable fix). Understanding the Human-in-the-Loop Approach HITL is the critical safety lock on AI autonomy. For AIOps, it creates a division of duties that is distinct: Agents handle routine – High-volume, low-risk, and repetitive tasks — like classifying alerts, fetching diagnostic context, and correlating related events — are fully automated.Humans authorize risk – Actions taken to alter the operating environment of a production system — like creating new configurations, restarting work, rolling back a deployment, or changing a configuration — must be processed through an HR gate that is entirely human controlled. This architecture assures responsibility and allows the agent to correct or pause its action if its reasoning is wrong (a hallucination) or if the proposed action violates a critical business constraint. It transforms the SRE from “firefighter” to “trusted approver” and “AI model trainer.” The Observability Trifecta LLM agents are only as good as the situation they find themselves in. In order to engage in sophisticated reasoning, the system should reach into an observability trifecta: logs, traces, and metrics. This suite of data makes up the context trifecta, which, when synthesized by an LLM, can mirror a full incident bridge meeting — in brief paragraphs — for a human team or an advanced agent. Technical Stack Overview Implementation of agentic AIOps involves integration of multiple component parts that have to be combined to realize the success. The general high-level stack like this one generally contains the components found in Table 1: Component Category Example Technologies/Concepts Purpose in Agentic AIOps Foundation LLM Enterprise-governed models (e.g., any cloud hosted) The “brain” for reasoning, summarization, and action planning Agent framework LangGraph, LangChain, etc. Provides the state machine and abstraction layer for defining agent personas and orchestrating their collaboration Observability/data OpenTelemetry, Prometheus, vector database Ingestion and storage of the logs, traces, and metrics needed for LLM context retrieval Security and gating Role-based access control (RBAC), Open Policy Agent (OPA) Enforcing security policies, defining human approval rights, and implementing Policy as Code for automated action checks Execution/automation Ansible, Terraform, Kubernetes APIs such as CI/CD tools The interface for agents to execute approved, low-risk, or gated actions HITL interface Slack/Teams APIs, PagerDuty/Jira integration The human-facing communication and approval channel for all high-risk actions Table 1. AgenticOps technical stack System Architecture and Innovations The architecture of an agentic AIOps system must be engineered for both speed and safety. It’s a structured pipeline designed to manage the flow of data, intelligence, and execution with multiple funnel points for human and policy oversight. High-Level Design Figure 1 describes this intricate apparatus consisting of five key modules: the ingestion layer, agentic core, HITL gatekeeper, execution later, and feedback loop. Figure 1. Agentic AIOps system architecture Ingestion Layer This is the front door for all the operational data. To support this layer, we use real-time data from existing observability tools — whether they conform to the OpenTelemetry specification or utilize proprietary agents — to stream. The most important innovation here is getting ready for the LLM. The logs, traces, and metadata are indexed and translated to embeddings, which are stored in a vector database (the operational “knowledge base”) to perform rapid semantic retrieval. Agentic Core This is the primary intelligence engine. A multi-agent unit run by a Supervisor Agent: Triage Agent is the first responder, which classifies the incoming alert (e.g., seriousness, service owner) and correlates it with either recent deployment events or related past incidents.Summarizer Agent uses retrieval-augmented generation (RAG) to query the vector database and pull relevant logs, traces, and metrics and synthesize a coherent, plain-language incident summary.Runbook Proposer Agent either takes the summary of the context and maps it to a set of preapproved, executable runbooks or generates a new runbook or a new action script by looking at the previous ones (e.g., based on historical resolving scenarios). HITL Gatekeeper This is the critical safety barrier. The proposed action (e.g., Restart the Payment Service) is translated into a crisp approval card and sent through a standalone Slack/Teams bot to the on-call SRE. Additionally, the system should interface with an escalation system like PagerDuty so that the correct person will be notified within the SLO. This gate completely limits the agent execution privileges. Execution Layer The action is run after human approval. This layer was created for defensive deployment techniques: Gated executor is the component that runs the actual action (e.g., if I’m calling my Kubernetes cluster by API).Canary/rollback, used for essential changes, wraps the action in defensive mechanisms (e.g., Istio traffic splitting, any other rolling rollout strategy) to test the change on a small slice of users, the canary. It is preconfigured to roll back in real time when health checks go south. Feedback Loop The system is intended to learn from every interaction. The output of the human as approval or rejection, success or failure of execution, and the final MTTR are all fed back into a training module. This employs a version of Reinforcement Learning from Human Feedback (RLHF) to refine the agents’ reasoning and runbook proposal strategies periodically. Multi-Agent Collaboration The architecture uses a collaborative “team” rather than a single monolithic agent. While the Supervisor Agent is responsible for the flow, the specialized agents are designed to “debate” or cross-validate their conclusions. For example, the Triage Agent classifies the alert. The Summarizer Agent might challenge that classification if the underlying logs suggest another primary service is failing. This internal, automatic peer review process greatly increases the accuracy of our recommendation; it eliminates any single point of failure in the reasoning chain. Hybrid RAG and GraphRAG Conventional RAG only fetches chunks of text based on its semantic similarity. But this is inadequate for AIOps. A service failure is about logs and dependencies. This problem was resolved by GraphRAG, which layers a knowledge graph over the data. The underlying data in GraphRAG covers all the dependencies of services — for example, “The Checkout Service relies on the Inventory Service and Payment Gateway.” The agent first queries the graph during an alert to comprehend the impact and upstream/downstream dependencies. Then, that RAG query looks only for the logs and traces on the affected services detected by the graph. Since the agent is reasoning over structure and text in parallel, this combination results in far faster and more accurate root cause analysis. Zero-Trust Gating A zero-trust principle needs to guide the execution layer; no action, even if it is authorized by a human, is inherently safe, unless its intent and context are validated against policy. This is made possible via Policy as Code, using such tools as Open Policy Agent (OPA). Dynamic checks of the proposed action payload: Scope check – Is the behavior only within the service specified in the alert? (e.g., was the agent attempting to restart the entire cluster for one pod failure?)RBAC check – Does the approving human in fact have the appropriate security role via the HITL Gatekeeper to authorize this? Context check – Is the present time window (e.g., peak sales hour) relevant to prohibit this high-risk action? The execution environment is then not released until these automated, dynamic policy checks are completed. Step-by-Step Implementation Guide This section provides a conceptual walk-through of building the agentic workflow using common open-source principles and tools. Step 1: Set Up the Dev Environment Begin by establishing the foundation: Dependencies – Install the necessary LLM libraries, the agent framework (e.g., LangGraph), and client libraries for observability data access.Observability config – Configure a basic microservice application to emit logs, traces, and metrics via OpenTelemetry. This is the standard abstraction layer that future-proofs the system against vendor lock-in.Sample alert generation – Create a simple trigger mechanism (e.g., a script that injects an alert into a Kafka queue or an alert manager) to simulate an incident flow. Step 2: Build the Data Ingestion Pipeline The goal is to prepare the operational data for semantic search: Stream data – Use an OpenTelemetry Collector or similar tool to funnel the raw logs, traces, and metrics.Vectorization – For unstructured data (logs), use an embedding model to convert the text into high-dimensional vectors.Storage – Store these vectors in a vector database. This allows the Summarizer Agent to pose sophisticated, natural language questions or requests, such as “Show me all logs related to user authentication errors in the last 15 minutes that correlate with an increase in P99 latency.” Step 3: Implement the Triage Agent This agent must be fast and accurate as it sets the stage for the entire response: Define persona – The agent’s LLM prompt should define a clear persona, such as “Alert Triage Specialist,” instructing it to be concise, factual, and strictly adhere to the company’s severity classification policy.Classification and correlation – The agent’s first action is a classification API call (using the LLM) to assign severity and service ownership. Its second action is to query the vector database to identify related events (e.g., a recent deployment, a network change, another concurrent alert).Output – This is a structured JSON object containing the service_name, severity_level, and a list of correlated_event_IDs. Step 4: Develop the Context Summarizer This context summarizer agent turns raw data into actionable intelligence. It implements the RAG pipeline: Retrieval – Using the Triage Agent’s output (service name, correlated events), the Summarizer Agent executes a vector search and a graph search (Hybrid RAG) to pull all relevant logs, traces, and metrics.Generation – The LLM is prompted with the retrieved data and instructed, “Based on the following raw operational data, generate a single, non-speculative, three-paragraph summary covering: 1) What is failing? 2) When did it start? 3) What is the likely root cause (with evidence)?”Output – This is the final concise incident summary ready for human consumption. Step 5: Create the Runbook Proposer Runbook mapping – The agent is given access to a runbook repository (preferred to be a structured JSON list or private GitHub repository) where each entry is tagged with its service, failure_type, and execution_payload.Reasoning – The agent uses the Summarizer’s context to select the most appropriate runbook. For an unprecedented event, it might generate a proposed new runbook, clearly labeled as “Experimental.”Output – This is a structured proposal containing the runbook_ID, a justification for its selection, and the execution_payload (e.g., a shell script, Kubernetes YAML patch). Step 6: Integrate HITL Gates This is the most critical step for building trust: Approval workflow – The proposed solution is sent to a Slack/Teams bot, which acts as a secure intermediary. It shows a Triage Summary, Context Summary, and Runbook Proposal with action buttons: “Approve & Execute” and “Reject & Escalate” — or something else that suits your WoW style of working.Policy checks – Before the “Approve” button triggers any action, the Zero-Trust Gatekeeper (powered by OPA) checks the action’s details against current policies. This engine verifies the human’s role and the context of the action.Defensive execution– Once both policy and human approval are given, the execution layer automatically adds predefined safeguards to the runbook’s execution: Canary deployment – If the action involves a deployment, it’s initially rolled out to only 5% of traffic.Auto-rollback – Preconfigured health checks start immediately. If any fail, an automatic rollback is triggered, stopping the execution and alerting the human. Step 7: Put the Whole Workflow Together The Supervisor Agent connects everything using a State Machine (easily managed by a framework like LangGraph): Flow – The supervisor defines how things move from one stage to the next, as shown in Figure 2.End-to-end testing – Use the simulated alerts from Step 1 to test the entire process. The goal is to ensure that a simple alert leads to a complete context summary and a proposed action that needs human approval before it’s carried out. This confirms both the speed of the intelligence and the reliability of the safety features. Figure 2. Agentic AIOps full workflow Conclusion: The Future Is Agentic Observability This step-by-step guide illustrates that the ongoing transition is more than just an upgrade to agentic AIOps; it fundamentally redefines digital engineering and operations. This shift represents a potential paradigm in how we handle and respond to intelligence within our digital operations. The future of IT operations hinges on fostering a collaborative partnership between human experts and intelligent AI agents. By embedding HITL principles, we create systems that are not only powerful and fast but more importantly, are safe, trustworthy, and accountable. The agent handles the cognitive load of sifting through petabytes of data, and the human retains ultimate authority over the live environment. The journey toward this self-healing, agentic environment doesn’t require a “big bang” overhaul. Start small by identifying one persistent friction point in your incident response lifecycle — perhaps the tedious, repetitive task of context summarization — apply the agentic principles, and build a secure, gated workflow from there. This model is the foundation for proactive resilience, which is an evolution that elevates digital engineering from reactive fixes to intelligent self-management. The age of the autonomous yet overseen operations agent has arrived. This is an excerpt from DZone’s 2025 Trend Report, Intelligent Observability: Building a Foundation for Reliability at Scale.Read the Free Report
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert