AI reviews every PR in 3 minutes. Risk-tier contracts route critical code to humans and low-risk docs to auto-merge. A coding agent reads the review, patches the code, re-validates, and pushes — all without you. Here's the complete factory, from preflight gates to browser evidence.
Solo dev or two-person team: code review is either a bottleneck or nonexistent. Skip it and a silent auth bug costs you a weekend. Wait for a teammate and PRs sit 6-18 hours. Self-review catches typos, not logic errors.
The fix: A fully automated pipeline where AI reviews, auto-fixes the safe stuff, blocks the dangerous stuff, and only escalates to you when human judgment actually matters.
Not every PR carries the same risk. A README fix and a payment rewrite shouldn't run through the same pipeline. Define tiers as a machine-readable contract:
// .github/risk-tiers.json
{
"tiers": {
"critical": {
"paths": ["src/auth/**", "src/payments/**", "*.env*"],
"required_checks": ["tests", "type_check", "secrets_scan",
"ai_review", "human_approval"],
"auto_merge": false
},
"standard": {
"paths": ["src/**"],
"required_checks": ["tests", "lint", "ai_review"],
"auto_fix_allowed": ["suggestion"]
},
"low_risk": {
"paths": ["docs/**", "README*", "*.md"],
"required_checks": ["lint"],
"auto_merge": true,
"auto_fix_allowed": ["suggestion", "should_fix"]
}
}
}
A docs-only PR runs lint and auto-merges. A payment flow change runs everything including mandatory human sign-off. The contract is versioned in the repo — no guessing what "low-risk" means.
Full test suites cost time and compute. Running 15 minutes of tests on code with a syntax error is waste. Preflight catches the obvious stuff in seconds:
# .github/workflows/preflight.yml — runs in <15 seconds
- name: Syntax check
run: find . -name "*.py" -exec python3 -m py_compile {} +
- name: Secrets scan
run: |
if git diff origin/main...HEAD | \
grep -iE "(api_key|secret|password)\s*=\s*['\"][^'\"]+"; then
echo "🔴 Secret detected"; exit 1
fi
- name: Resolve risk tier
run: source .github/scripts/resolve-tier.sh
Only if preflight passes does the full CI pipeline spin up. A syntax error caught in 5 seconds saves a 15-minute CI run.
Subtle bug: the review runs on commit abc123. Developer pushes a fix (def456). The merge gate still reads the review from abc123 — stale approval on code that no longer exists.
# .github/scripts/merge-check.sh
PR_HEAD=$(gh pr view $PR --json headRefOid -q .headRefOid)
REVIEW_SHA=$(jq -r '.sha' .github/reviews/latest.json)
if [ "$PR_HEAD" != "$REVIEW_SHA" ]; then
echo "🔴 Review is stale. Re-run on current HEAD."
exit 1
fi
The rule: If HEAD doesn't match review SHA, the review is void. Re-run it. The synchronize trigger handles this for new pushes, but the merge gate should verify independently.
The coding agent reads the review, patches the code, runs tests, and pushes — without human input for safe changes:
PR opened → Preflight (5 sec) → fail? Instant feedback, no CI wasted
Preflight pass → AI Review → 🔴 MUST FIX? Block + comment, human decides
🟢/🟡 issues → Coding agent reads JSON → patches code → runs tests
Tests fail → Revert patch, comment, escalate to human
Tests pass → Push fix → Re-review on new SHA (current-head discipline)
Re-review pass → Merge gate checks risk tier → auto-merge if allowed, else wait for human
# Auto-remediation step
- name: Auto-remediate
run: |
ISSUES=$(jq '[.issues[] | select(.severity != "must_fix")]' \
.github/reviews/latest.json)
[ "$(echo "$ISSUES" | jq 'length')" -eq 0 ] && exit 0
# Agent generates a unified diff patch
PATCH=$(curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d "{\"model\":\"claude-sonnet-4-20250514\",\"max_tokens\":4096,
\"messages\":[{\"role\":\"user\",
\"content\":\"Fix these issues. Output ONLY unified diff.\\n$ISSUES\"}]}" \
| jq -r '.content[0].text')
# Validate → apply → test → push
echo "$PATCH" | git apply --check && \
echo "$PATCH" | git apply && \
npm test && \
git add -A && \
git commit -m "fix: auto-remediate review [bot]" && \
git push
- name: Revert on failure
if: failure()
run: git reset --hard HEAD~1 && git push --force-with-lease
Safety rails: git apply --check validates before applying. Tests run after patch, before push. --force-with-lease prevents overwriting others' work on revert. 🔴 issues are never auto-fixed — period.
Code review catches logic errors. It doesn't catch "the button moved off-screen." For frontend PRs, browser screenshots are first-class proof:
# Capture key pages with Playwright
npx playwright screenshot http://localhost:3000 evidence/home.png
npx playwright screenshot http://localhost:3000/pricing evidence/pricing.png
# Visual diff against main branch baseline
git show main:evidence/home.png > evidence/before.png
npx pixelmatch evidence/before.png evidence/after.png evidence/diff.png 0.1
Screenshots attach to the PR as artifacts. Visual diffs show exactly what changed. No "it works on my machine" debates.
Required when: Any PR touching HTML, CSS, or frontend JS. Any checkout/payment flow change (critical tier). Skipped for backend-only, docs, or config (resolved by risk tier contract).
echo "{\"pr\":$PR,\"sha\":\"$SHA\",\"tier\":\"$TIER\",
\"verdict\":\"$V\",\"issues\":$N,\"auto_fixed\":$AF,
\"browser_evidence\":$SCREENSHOTS,\"cost\":$C}" \
>> .github/review-log.jsonl
Monthly: which paths produce the most 🔴s? Where should human review concentrate? If auto-fix failure rate exceeds 10%, tighten criteria.
Architecture decisions. AI catches bugs, not bad design. Domain knowledge. Won't catch "discount allows negative prices" without rules. The hard conversations. "This approach needs rethinking" is human territory.
risk-tiers.json. Defining tiers forces you to think about what's critical.Every Library item is tested on real systems. New guides weekly. $9/month, cancel anytime.
Get The Library — $9/mo30-day money-back guarantee