SKILL.md
skills/ecc-tools-cost-audit/SKILL.md
name: ecc-tools-cost-audit
description: Evidence-first ECC Tools burn and billing audit workflow. Use when investigating runaway PR creation, quota bypass, premium-model leakage, duplicate jobs, or GitHub App cost spikes in the ECC Tools repo.
origin: ECCECC Tools Cost Audit
Use this skill when the user suspects the ECC Tools GitHub App is burning cost, over-creating PRs, bypassing usage limits, or routing free users into premium analysis paths.
This is a focused operator workflow for the sibling [ECC-Tools](../../ECC-Tools) repo. It is not a generic billing skill and it is not a repo-wide code review pass.
Skill Stack
Pull these ECC-native skills into the workflow when relevant:
autonomous-loopsfor bounded multi-step audits that cross webhooks, queues, billing, and retriesagentic-engineeringfor tracing the request path into discrete, provable unitscustomer-billing-opswhen repo behavior and customer-impact math must be separated cleanlysearch-firstbefore inventing helpers or re-implementing repo-local utilitiessecurity-reviewwhen auth, usage gates, entitlements, or secrets are touchedverification-loopfor proving rerun safety and exact post-fix statetdd-workflowwhen the fix needs regression coverage in the worker, router, or billing paths
When To Use
- user says ECC Tools burn rate, PR recursion, over-created PRs, usage-limit bypass, or premium-model leakage
- the task is in the sibling
ECC-Toolsrepo and depends on webhook handlers, queue workers, usage reservation, PR creation logic, or paid-gate enforcement - a customer report says the app created too many PRs, billed incorrectly, or analyzed code without producing a usable result
Scope Guardrails
- work in the sibling
ECC-Toolsrepo, not ineverything-claude-code - start read-only unless the user clearly asked for a fix
- do not mutate unrelated billing, checkout, or UI flows while tracing analysis burn
- treat app-generated branches and app-generated PRs as red-flag recursion paths until proved otherwise
- separate three things explicitly:
- repo-side burn root cause
- customer-facing billing impact
- product or entitlement gaps that need backlog follow-up
Workflow
1. Freeze repo scope
- switch into the sibling
ECC-Toolsrepo - check branch and local diff first
- identify the exact surface under audit:
- webhook router
- queue producer
- queue consumer
- PR creation path
- usage reservation / billing path
- model routing path
2. Trace ingress before theorizing
- inspect
src/index.*or the main entrypoint first - map every enqueue path before suggesting a fix
- confirm which GitHub events share a queue type
- confirm whether push, pull_request, synchronize, comment, or manual re-run events can converge on the same expensive path
3. Trace the worker and side effects
- inspect the queue consumer or scheduled worker that handles analysis
- confirm whether a queued analysis always ends in:
- PR creation
- branch creation
- file updates
- premium model calls
- usage increments
- if analysis can spend tokens and then fail before output is persisted, classify it as burn-with-broken-output
4. Audit the high-signal burn paths
PR multiplication
- inspect PR helpers and branch naming
- check dedupe, synchronize-event handling, and existing-PR reuse
- if app-generated branches can re-enter analysis, treat that as a priority-0 recursion risk
Quota bypass
- inspect where quota is checked versus where usage is reserved or incremented
- if quota is checked before enqueue but usage is charged only inside the worker, treat concurrent front-door passes as a real race
Premium-model leakage
- inspect model selection, tier branching, and provider routing
- verify whether free or capped users can still hit premium analyzers when premium keys are present
Retry burn
- inspect retry loops, duplicate queue jobs, and deterministic failure reruns
- if the same non-transient error can spend analysis repeatedly, fix that before quality improvements
5. Fix in burn order
If the user asked for code changes, prioritize fixes in this order:
- stop automatic PR multiplication
- stop quota bypass
- stop premium leakage
- stop duplicate-job fanout and pointless retries
- close rerun/update safety gaps
Keep the pass bounded to one to three direct fixes unless the same root cause clearly spans multiple files.
6. Verify with the smallest proving steps
- rerun only the targeted tests or integration slices that cover the changed path
- verify whether the burn path is now:
- blocked
- deduped
- downgraded to cheaper analysis
- or rejected early
- state the final status exactly:
- changed locally
- verified locally
- pushed
- deployed
- still blocked
High-Signal Failure Patterns
1. One queue type for all triggers
If pushes, PR syncs, and manual audits all enqueue the same job and the worker always creates a PR, analysis equals PR spam.
2. Post-enqueue usage reservation
If usage is checked at the front door but only incremented in the worker, concurrent requests can all pass the gate and exceed quota.
3. Free tier on premium path
If free queued jobs can still route into Anthropic or another premium provider when keys exist, that is real spend leakage even if the user never sees the premium result.
4. App-generated branches re-enter the webhook
If pull_request.synchronize, branch pushes, or comment-triggered runs fire on app-owned branches, the app can recursively analyze its own output.
5. Expensive work before persistence safety
If the system can spend tokens and then fail on PR creation, file update, or branch collision, it is burning cost without shipping value.
Pitfalls
- do not begin with broad repo wandering; settle webhook -> queue -> worker first
- do not mix customer billing inference with code-backed product truth
- do not fix lower-value quality issues before the highest-burn path is contained
- do not claim burn is fixed until the narrow proving step was rerun
- do not push or deploy unless the user asked
- do not touch unrelated repo-local changes if they are already in progress
Verification
- root causes cite exact file paths and code areas
- fixes are ordered by burn impact, not code neatness
- proving commands are named
- final status distinguishes local change, verification, push, and deployment