Developer Workflows Tested March 2026

Code Factory: Fully Automated Code Review and Merge Pipeline With AI Agents

AI reviews every PR in 3 minutes. Risk-tier contracts route critical code to humans and low-risk docs to auto-merge. A coding agent reads the review, patches the code, re-validates, and pushes — all without you. Here's the complete factory, from preflight gates to browser evidence.

The Problem

Solo dev or two-person team: code review is either a bottleneck or nonexistent. Skip it and a silent auth bug costs you a weekend. Wait for a teammate and PRs sit 6-18 hours. Self-review catches typos, not logic errors.

The fix: A fully automated pipeline where AI reviews, auto-fixes the safe stuff, blocks the dangerous stuff, and only escalates to you when human judgment actually matters.

Risk-Tier Contracts

Not every PR carries the same risk. A README fix and a payment rewrite shouldn't run through the same pipeline. Define tiers as a machine-readable contract:

// .github/risk-tiers.json
{
  "tiers": {
    "critical": {
      "paths": ["src/auth/**", "src/payments/**", "*.env*"],
      "required_checks": ["tests", "type_check", "secrets_scan",
                           "ai_review", "human_approval"],
      "auto_merge": false
    },
    "standard": {
      "paths": ["src/**"],
      "required_checks": ["tests", "lint", "ai_review"],
      "auto_fix_allowed": ["suggestion"]
    },
    "low_risk": {
      "paths": ["docs/**", "README*", "*.md"],
      "required_checks": ["lint"],
      "auto_merge": true,
      "auto_fix_allowed": ["suggestion", "should_fix"]
    }
  }
}

A docs-only PR runs lint and auto-merges. A payment flow change runs everything including mandatory human sign-off. The contract is versioned in the repo — no guessing what "low-risk" means.

Preflight Gating: Don't Waste CI on Broken Code

Full test suites cost time and compute. Running 15 minutes of tests on code with a syntax error is waste. Preflight catches the obvious stuff in seconds:

# .github/workflows/preflight.yml — runs in <15 seconds
- name: Syntax check
  run: find . -name "*.py" -exec python3 -m py_compile {} +

- name: Secrets scan
  run: |
    if git diff origin/main...HEAD | \
       grep -iE "(api_key|secret|password)\s*=\s*['\"][^'\"]+"; then
      echo "🔴 Secret detected"; exit 1
    fi

- name: Resolve risk tier
  run: source .github/scripts/resolve-tier.sh

Only if preflight passes does the full CI pipeline spin up. A syntax error caught in 5 seconds saves a 15-minute CI run.

Current-Head SHA Discipline

Subtle bug: the review runs on commit abc123. Developer pushes a fix (def456). The merge gate still reads the review from abc123 — stale approval on code that no longer exists.

# .github/scripts/merge-check.sh
PR_HEAD=$(gh pr view $PR --json headRefOid -q .headRefOid)
REVIEW_SHA=$(jq -r '.sha' .github/reviews/latest.json)

if [ "$PR_HEAD" != "$REVIEW_SHA" ]; then
  echo "🔴 Review is stale. Re-run on current HEAD."
  exit 1
fi

The rule: If HEAD doesn't match review SHA, the review is void. Re-run it. The synchronize trigger handles this for new pushes, but the merge gate should verify independently.

The Full Remediation Loop

The coding agent reads the review, patches the code, runs tests, and pushes — without human input for safe changes:

Pipeline flow

PR opened → Preflight (5 sec) → fail? Instant feedback, no CI wasted

Preflight pass → AI Review → 🔴 MUST FIX? Block + comment, human decides

🟢/🟡 issues → Coding agent reads JSON → patches code → runs tests

Tests fail → Revert patch, comment, escalate to human

Tests pass → Push fix → Re-review on new SHA (current-head discipline)

Re-review pass → Merge gate checks risk tier → auto-merge if allowed, else wait for human

# Auto-remediation step
- name: Auto-remediate
  run: |
    ISSUES=$(jq '[.issues[] | select(.severity != "must_fix")]' \
            .github/reviews/latest.json)
    [ "$(echo "$ISSUES" | jq 'length')" -eq 0 ] && exit 0

    # Agent generates a unified diff patch
    PATCH=$(curl -s https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "content-type: application/json" \
      -d "{\"model\":\"claude-sonnet-4-20250514\",\"max_tokens\":4096,
           \"messages\":[{\"role\":\"user\",
           \"content\":\"Fix these issues. Output ONLY unified diff.\\n$ISSUES\"}]}" \
      | jq -r '.content[0].text')

    # Validate → apply → test → push
    echo "$PATCH" | git apply --check && \
    echo "$PATCH" | git apply && \
    npm test && \
    git add -A && \
    git commit -m "fix: auto-remediate review [bot]" && \
    git push

- name: Revert on failure
  if: failure()
  run: git reset --hard HEAD~1 && git push --force-with-lease

Safety rails: git apply --check validates before applying. Tests run after patch, before push. --force-with-lease prevents overwriting others' work on revert. 🔴 issues are never auto-fixed — period.

Browser Evidence for UI Changes

Code review catches logic errors. It doesn't catch "the button moved off-screen." For frontend PRs, browser screenshots are first-class proof:

# Capture key pages with Playwright
npx playwright screenshot http://localhost:3000 evidence/home.png
npx playwright screenshot http://localhost:3000/pricing evidence/pricing.png

# Visual diff against main branch baseline
git show main:evidence/home.png > evidence/before.png
npx pixelmatch evidence/before.png evidence/after.png evidence/diff.png 0.1

Screenshots attach to the PR as artifacts. Visual diffs show exactly what changed. No "it works on my machine" debates.

Required when: Any PR touching HTML, CSS, or frontend JS. Any checkout/payment flow change (critical tier). Skipped for backend-only, docs, or config (resolved by risk tier contract).

The Audit Trail

echo "{\"pr\":$PR,\"sha\":\"$SHA\",\"tier\":\"$TIER\",
  \"verdict\":\"$V\",\"issues\":$N,\"auto_fixed\":$AF,
  \"browser_evidence\":$SCREENSHOTS,\"cost\":$C}" \
  >> .github/review-log.jsonl

Monthly: which paths produce the most 🔴s? Where should human review concentrate? If auto-fix failure rate exceeds 10%, tighten criteria.

What This Doesn't Replace

Architecture decisions. AI catches bugs, not bad design. Domain knowledge. Won't catch "discount allows negative prices" without rules. The hard conversations. "This approach needs rethinking" is human territory.

Getting Started

60+ Guides Like This One

Every Library item is tested on real systems. New guides weekly. $9/month, cancel anytime.

Get The Library — $9/mo

30-day money-back guarantee