Software Engineering Practices in the AI Era

April 19, 2026

There are three consequences of AI-assisted software engineering. First, the old fundamentals still hold, and the cost of ignoring them went up. Strong documentation, crisp specs, small PRs, tests, and rigorous reviews were always important. Skipping them now bites quickly. Second, AI changes parts of the work, creating new failure modes that one must contain. Third, AI extends the reach of some practices that were never practical by hand, turning a few aspirations into things we can actually do.

This article covers agentic software engineering practices through these three lenses. A companion post, Engineering Ceremonies for the AI Era, covers a team-oriented operational side: the ceremonies, the new bottlenecks, and the roles. Here the focus is on the engineering hygiene those rituals depend on.

1. Documentation is onboarding for humans and machines

What persists is the principle: documentation is how a newcomer gets up to speed. What changes is that the newcomer is now also a machine. Teams that already have strong onboarding docs (README, ADRs, architectural diagrams, coding standards) adopt AI tools more easily, because the context an agent needs already exists. Teams whose onboarding lives in tribal knowledge have a harder time. That knowledge was never written down for a human newcomer, let alone an agent.

On the other hand, the same agents can draft and maintain these docs, generating a first README, proposing ADRs from commit history, or flagging pages that have gone stale. That lowers the cost of a practice teams always found tedious. But an agent can only write down what the code already encodes. The tribal knowledge in people's heads, the why behind a decision that left no trace in the repo, is exactly what it can't recover. That is often the valuable part. So the reach is real but bounded, and what it generates still needs validation.

A new failure mode is consumability. In most companies, the docs that define what you're building live in many places: product vision in Microsoft 365 slides, PRDs and designs in Confluence, JIRA, READMEs, tech-debt tickets and ADRs in GitHub, bug reports and incident postmortems somewhere else. Humans tolerate that scatter; agents stall on it, because they can't navigate efficiently unless someone has planned for structure and access (e.g., MCPs). Organize these docs so a coding agent can move across vision, design, spec, tests, implementation, validation, and feedback, with teams filling in the right detail at the right point. That structure turns the scatter back into leverage.

2. Design before you code

"Spec-first prompting" has similarities with a decades-old idea: the System Design Document or Design Specification. Skipping design has always been a costly mistake for a complex project. Prompting an agent forces one to specify intent up front. What changes is how the time gets spent. A first draft of a spec now takes minutes, not days, so the hours saved move to reviewing the design, walking through edge cases, and ensuring alignment from the people who'll have to live with it. The bottleneck moves from writing the spec to making sure it's right.

What AI introduces here is a new failure mode: drift. The design, specs, code, and validation tests need to stay in sync, and fast agent-driven generation pulls them apart faster than a team can track by hand. Without discipline or automation, they separate quickly. For example, a spec gets updated, an agent rewrites the implementation, the tests still encode the old behavior, and the design doc hasn't been touched since kickoff. AI-driven workflows have to enforce keeping all four in sync: the design that explains why, the spec that defines what, the code that implements it, and the tests that prove it. Treat them as one artifact in four forms, and lean on tooling to keep them consistent (e.g., agentic workflows that update specs when code changes, or vice versa).

3. Small PRs and rigorous review

The principle is old: small, focused, atomic commits. Large engineering orgs have long pushed small changelists for a reason; Google's public code-review guidance makes "small CLs" an explicit norm. That reason is unchanged by AI: humans cannot meaningfully review large diffs, and rubber-stamped reviews let bugs through. A 40-line PR you can read in three minutes is worth ten 400-line PRs that get waved through. If you can't describe a change in a two-sentence commit message, it's too big.

What changes is the pressure on this principle. AI generates code fast enough to bury a reviewer, so keeping AI-generated changes reviewable is now a discipline you have to enforce. Small PRs help on two distinct fronts: a human reviewer can actually hold a small diff in their head, and an AI reviewer, working within a limited context window, has a better chance of catching subtle bugs when the change is scoped tightly.

That review has to happen at every layer the LLM touched. When agents generate PRDs, architecture sketches, specs, test plans, and the code that ties them together, each of those artifacts deserves the same skepticism as a code change. A confidently worded spec that quietly contradicts an invariant already in the code produces just as much downstream damage as confidently wrong code. The earlier a bad assumption gets caught, the cheaper it is to fix.

That breadth is itself the reach: an AI reviewer can sustain a first-pass thoroughness across every layer, on every change, that no human review rota could staff. It won't replace the human judgment call, but it widens what gets looked at before a person ever opens the diff.

4. Tests prove behavior, not implementation

What persists is the role of the test suite: it's the executable record of what the system is supposed to do. Agents write tests as fast as they write code, and the failure mode is subtle unless the tests come from intent: write them first, before the code.

Test-driven development was always a way to state intended behavior before getting attached to an implementation. With agents the order matters more, because grinding an implementation until the tests go green is exactly what the agentic loop is built for. Derive the tests from the spec first, before the implementation exists, and the same iteration drives the code toward what you actually meant. A pass then means the implementation matches the spec, not that it agrees with itself.

Where the behavior gets proven is also important. A passing test might still fail in production when external dependencies, data, or configuration differ from the mocks used in development. Embedding automation that runs this check safely lets both people and agents confirm the change behaves correctly, not just in code but in a live production environment.

The reach is real here too. AI can generate the breadth of cases a person rarely has the patience for: edge conditions, error paths, property-based inputs, fuzzing. Used well, that widens coverage cheaply. The catch is the same as everywhere else: breadth from the machine, judgment about what should be true from a human.

5. Security: same depth, new risks and new reach

Defense in depth is unchanged as a principle. AI-assisted coding raises the stakes from both directions. It introduces new risks and vulnerabilities from AI-generated code and dependencies. It is also easier for cybercriminals to build agentic tools and exploit vulnerabilities. At the same time, engineering teams and platform owners can use agentic tools to extend the reach of software analysis to protect their software and infrastructure. While security analysis applies to every layer of the product, here are some common examples.

Some kinds of analysis are too big for humans on any real codebase. The gain comes from using it alongside human review to widen coverage: hunting variants of a known bug, tracing tainted data across dozens of functions and services, finding race conditions, checking authorization across a sprawling API, or catching security-relevant changes buried in large refactors. These are grindy, attention-heavy tasks.

6. Operationalization: the path from merged to running

Getting code safely into production runs through automation: CI/CD, staged promotion, progressive rollout, the ability to roll back, and the instrumentation to know it worked. That machinery doesn't change. AI changes two things about it. First, the volume of deployments flowing through goes up. Second, and more dangerous, agents can now act on the pipeline, not just propose changes to it. The new failure mode is autonomy creep: an agent acquiring direct access to production because it was convenient.

Hence, use AI agents to build and maintain automation, but never to give them direct access to production. Capture operations as code wherever practical, more than ever: Infrastructure as Code (IaC), Configuration as Code, Observability as Code (Monitoring and Alerting), and Dashboards as Code. That gives AI agents context beyond application code and enables them to reason about changes.

To understand if these processes and automation are working well, one must measure them. This is the subject of the next section.

7. Measure the system

The rule is old too: measure outcomes, not output. Lines of code and commit counts were always poor proxies for value. What changes is that AI inflates exactly those output measures, so a team can look more productive while quality quietly erodes. The intuitive promise is higher throughput; the question is whether it rises without dragging change failure rate and time to restore along with it.

The DORA metrics are a good starting point, and the discipline that matters is measuring them before and after a tooling change rather than in the abstract. Establish a baseline, roll out the tool, and watch the same four numbers. A shift you can't see against a baseline is a shift you can't attribute.

If an AI rollout pushes deploy frequency up and lead time down, but change failure rate and time to restore rise with them, the team is shipping faster and breaking more things. The fix isn't to roll back AI tools but to invest in the processes and practices (review, testing, observability) that need to scale with the deployment velocity.

The reach here is narrower, but it exists: an agent can help collect these metrics, tie a regression in change failure rate back to the changes that likely caused it faster than a person combing through dashboards. That turns a lagging number into something you can act on.

Key Takeaway

The through-line: the machine supplies scale and speed; the human supplies judgment about what is correct.

← All posts