From 61b5572e2b2352c4da31d0d75cb2448e09b535fb Mon Sep 17 00:00:00 2001 From: Ed_ Date: Sun, 7 Jun 2026 14:26:22 -0400 Subject: [PATCH] chore(audit): spec license_cve_audit track (compliance + CVE + pinning) Builds scripts/audit_license_cve.py: single audit script that checks third-party deps (pyproject.toml + uv.lock transitive tree) for: (1) license compliance against the project's policy, (2) known CVEs (via pip-audit subprocess), (3) version-pinning, and (4) source-file SPDX license headers in src/ and scripts/. LICENSE POLICY (encoded in the script) Allowlist (permissive or weak copyleft or public domain): - Permissive: MIT, BSD, Apache 2.0, ISC, Unlicense, Zlib, Python-2.0, 0BSD, PSF-2.0 - Weak copyleft (Python import-safe): LGPL 2.1/3.0, MPL-2.0 - Public domain: CC0, WTFPL Blocklist (non-OSI / restricted-source): - GPL (any version), AGPL (any version) - SSPL (MongoDB 2018) - broad service-provider trigger - BSL / BUSL - delayed open source; competitive-use restriction - Commons Clause - 'cannot sell the software' addendum - Elastic License v2 - 'cannot offer as managed service' - Unknown / unparseable / missing metadata (catches packaging bugs and custom licenses) The two lists are explicit. Default rule: unknown = violation (never auto-pass). The script's --help references the policy table for transparency. Specific per-license additions go in scripts/audit_license_cve.py directly; no spec change needed. TRACK SCOPE In scope: third-party deps (direct + transitive), source-file SPDX headers, vendored libraries (defensive), version pinning. Out of scope: the project's own LICENSE file, project's own SPDX/Copyright headers, recommendations on project license. The user reserves all rights to the repo; no LICENSE file is created by the track. The audit reports third-party state only. OUTPUT FORMAT (sanitized: no JSON in user-facing output) - Stdout: line-per-violation, parseable by eye and by grep - Markdown report in docs/reports/license_cve_audit/2026-06-07/ - Baseline file: JSON (matches existing audit_weak_types convention; internal state for --strict mode only) CI GATE --strict mode + scripts/audit_license_cve.baseline.json. Fails CI on any new violation OR any new CVE. Mirrors the 3 existing audit scripts (audit_main_thread_imports, audit_weak_types, check_test_toml_paths). COMMITS PLANNED 1. chore(audit): add license_cve audit script + initial report 2. chore(deps): tilde-pin all deps; delete requirements.txt 3. chore(audit): add --strict mode + baseline file (CI gate) 4. conductor(tracks): mark License CVE Audit track complete NO NEW PIP DEPENDENCIES IN PROJECT Pure stdlib (importlib.metadata, tomllib, pathlib, re) + subprocess to pip-audit (an optional dev tool, installed via 'uv tool install pip-audit' if user wants CVE checks). --- .../tracks/license_cve_audit_20260607/spec.md | 286 ++++++++++++++++++ 1 file changed, 286 insertions(+) create mode 100644 conductor/tracks/license_cve_audit_20260607/spec.md diff --git a/conductor/tracks/license_cve_audit_20260607/spec.md b/conductor/tracks/license_cve_audit_20260607/spec.md new file mode 100644 index 00000000..f4b11500 --- /dev/null +++ b/conductor/tracks/license_cve_audit_20260607/spec.md @@ -0,0 +1,286 @@ +# Track: License & CVE Audit (Dependency Compliance) + +**Status:** Spec approved 2026-06-07 +**Initialized:** 2026-06-07 +**Owner:** Tier 2 Tech Lead +**Priority:** High (compliance + security; CI gate) + +--- + +## Overview + +Build `scripts/audit_license_cve.py` — a single audit script that checks third-party dependencies (in `pyproject.toml` + `uv.lock` transitive tree) for: (1) license compliance against the project's policy, (2) known CVEs (via `pip-audit` subprocess), and (3) version-pinning (every direct dep must have a `~X.Y.Z` bound). The script also scans source-file license headers (`SPDX-License-Identifier`) in `src/**/*.py` and `scripts/**/*.py`. Then apply the fixes: tilde-pin all direct deps, delete `requirements.txt` (redundant with `uv.lock`), regenerate `uv.lock`, add `--strict` mode + baseline file (CI gate). One script, one CI gate, one report. + +The track is **scope-limited to third-party dependencies**. The project's own LICENSE file and SPDX/Copyright headers are explicitly OUT OF SCOPE — the user reserves all rights to the repo and has not picked a project license yet. The audit reports third-party state only; it does not assert or imply a project license, and it does not create a `LICENSE` file. + +## Current State Audit (as of `9796fe27`) + +- `pyproject.toml` has 14 direct deps with **mixed pinning**: + - 7 unconstrained: `"imgui-bundle"`, `"anthropic"`, `"google-genai"`, `"openai"`, `"fastapi"`, `"mcp"`, `"uvicorn"` + - 6 with `>=X.Y.Z`: `"pyopengl>=3.1.10"`, `"tree-sitter>=0.25.2"`, `"tree-sitter-python>=0.25.0"`, `"tree-sitter-c>=0.23.2"`, `"tree-sitter-cpp>=0.23.2"`, `"psutil>=7.2.2"`, `"chromadb>=1.5.8"` + - `"tomli-w"`, `"pytest-timeout>=2.4.0"` +- `uv.lock` exists; `requirements.txt` exists (duplicates lock — will be removed) +- No `LICENSE` file in repo root (user's chosen posture: all rights reserved; the audit reports this as informational, not a violation) +- No source-file `SPDX-License-Identifier` headers in `src/**/*.py` or `scripts/**/*.py` (informational note; not a violation — the user hasn't picked a project license yet) +- No `vendor/`, `third_party/`, or vendored C/C++ in the repo tree (the scan is defensive for the future) +- 0 existing license/CVE audit tools in `scripts/` +- The 3 existing audit scripts (`audit_main_thread_imports.py`, `audit_weak_types.py`, `check_test_toml_paths.py`) follow the project pattern of `scripts/audit_.py` + `scripts/audit_.baseline.json` + `--strict` mode for CI gates (per `conductor/workflow.md` "Audit Script Policy"). The new track follows the same pattern. + +### Already Implemented (DO NOT re-implement; KEEP / build on) + +1. **The 3 existing audit scripts** in `scripts/`. They define the project pattern for audit + CI gate. The new `scripts/audit_license_cve.py` follows the same shape. +2. **`uv.lock`** — the canonical lock file for the project. The audit reads it for transitive resolution. +3. **`importlib.metadata`** (Python 3.11+ stdlib) — gives `License` and `License-Expression` per installed distribution. No new pip dep needed for the license check. +4. **`tomllib`** (Python 3.11+ stdlib) — parses `pyproject.toml`. No new pip dep needed for the pin check. +5. **`pip-audit`** (PyPA tool) — invoked as a subprocess for the CVE check. `pip-audit` itself is NOT a project dep; it's installed via `uv tool install pip-audit` or `uvx pip-audit` if the user wants the CVE check. The script detects missing `pip-audit` and logs a warning; license + pin checks still run. + +### Gaps to Fill (this track's scope) + +- `scripts/audit_license_cve.py` (~300 lines, 3 internal checks + `--strict` + `--dump-baseline`) +- `scripts/audit_license_cve.baseline.json` (zero-violation post-cleanup state for `--strict` mode) +- `docs/reports/license_cve_audit/2026-06-07/initial.md` and `final.md` (the human-readable reports) +- Updates to `pyproject.toml` (tilde-pin every direct dep) +- Updated `uv.lock` (regenerated) +- Deletion of `requirements.txt` +- `tests/test_audit_license_cve.py` (TDD unit tests) + +## Goals + +1. **Single audit script** that runs all four checks (license + CVE + pin + source-header) and emits a unified report. +2. **CI gate** via `--strict` mode + baseline file. Mirrors the 3 existing audit scripts. Fails on any new violation OR any new CVE. +3. **Tilde-pin every direct dep** in `pyproject.toml` (`~X.Y.Z` = `>=X.Y.Z,//initial.md` or `final.md`. +- **`--strict` mode:** exits non-zero if violations > baseline. For CI. +- **`--dump-baseline`:** writes the current violation set as the new baseline. For intentional changes (e.g., a new dep is added; the user accepts its license). + +### Internal structure (3 checks + 1 scan) + +```python +def check_licenses() -> list[Violation]: ... # iterates dist.metadata; classifies +def check_cves() -> list[Violation]: ... # subprocess pip-audit; parses JSON +def check_pins() -> list[Violation]: ... # tomllib parse; flag missing/loose pins +def check_source_headers() -> list[Violation]: ... # pathlib rglob; SPDX regex + +def main(): + violations = [] + for check in (check_licenses, check_cves, check_pins, check_source_headers): + violations.extend(check()) + for v in violations: + print(v.format_stdout()) # parseable line-per-violation + write_markdown_report(violations) + if args.strict and len(violations) > len(load_baseline()): + sys.exit(1) + if args.dump_baseline: + dump_baseline(violations) +``` + +### Cost model (the 4 checks) + +| Check | Mechanism | New deps? | +|-------|-----------|-----------| +| **License** | `importlib.metadata.distribution(name).metadata.get("License")` + `License-Expression` (Python 3.11+ stdlib). For each direct + transitive dep, classify the license string against the policy table. Unknown / unparseable / missing → violation. | None (stdlib) | +| **CVE** | Subprocess call to `pip-audit --format=json --strict` (a `uv tool install pip-audit` dev tool; the project itself doesn't depend on it). If `pip-audit` isn't installed, log a warning + skip the CVE check; license + pin still run. Air-gapped CI: CVE check returns no results (not a failure). | None in `pyproject.toml`; `pip-audit` is an optional dev tool. | +| **Version pin** | `tomllib.load(pyproject.toml)` (stdlib). For each entry in `[project].dependencies`, check the version specifier. Flags: (a) no specifier at all, (b) no lower bound. Accepts any lower bound as a soft check (the user's choice is tilde, but the script doesn't enforce tilde specifically — it enforces "has a lower bound"). | None (stdlib) | +| **Source header** | `pathlib.Path(src_dir).rglob("*.py")`, read first 20 lines of each, regex-look for `SPDX-License-Identifier:` (case-insensitive). If present and in the blocklist → violation. If no SPDX → no violation (informational note). | None (stdlib) | + +## License Policy (encoded in the script) + +### Allowlist (permissive or weak copyleft, import-safe in Python) + +- **Permissive:** MIT, BSD (2-clause + 3-clause), Apache 2.0, ISC, Unlicense, Zlib, Python-2.0, 0BSD, PSF-2.0 +- **Weak copyleft (import-safe in Python):** LGPL (2.1, 3.0), MPL-2.0 +- **Public domain:** CC0, Unlicense, WTFPL + +(The script's allowlist is the canonical source of truth for the per-license table; see `scripts/audit_license_cve.py` for the current list. New licenses can be added by editing that table; no spec change needed.) + +### Blocklist (non-permissive / restricted-source) + +The blocklist is for licenses that are **non-OSI** or that impose **restrictions beyond standard copyleft terms** (permissive or copyleft). The unifying technical property: the license restricts how downstream users can use the software in ways that standard open-source licenses do not. + +| License | Specific restriction | +|---------|---------------------| +| **GPL** (any version) | Strong copyleft; viral licensing; downstream users must release derivative works under GPL | +| **AGPL** (any version) | Network copyleft; downstream SaaS users must release source under AGPL | +| **SSPL** (MongoDB, 2018) | "If you offer the software as a service, you must release the entire stack under SSPL" — broad service-provider trigger | +| **BSL / BUSL** (Business Source License) | Source-available with a delayed open-source conversion; competitive-use restriction during the delay | +| **Commons Clause** | Addendum to an open-source license; adds "you may not sell the software" — targets SaaS reselling | +| **Elastic License v2** (Elastic NV, 2021) | "You may not offer the software as a managed service that competes with Elastic" | +| **Unknown / unparseable** (e.g., `UNKNOWN`, `Custom`, `see AUTHORS`) | Not classifiable; flagged for manual review; never auto-pass | +| **Missing license metadata** | Catches packaging bugs | + +### Decision rule (in the script) + +``` +if license in BLOCKLIST: violation +elif license in ALLOWLIST: pass +else: # unknown / unparseable / unclassified + violation (flag for manual review; never auto-pass) +``` + +The two lists are explicit, not heuristic. Adding a new license to either list is a one-line code change. The script's `--help` references the policy table for transparency. + +## Output Format + +### Stdout (line-per-violation, parseable) + +``` +LICENSE_VIOLATION pkg=foo license="GPL-3.0" via=bar==2.0 +CVE_FOUND pkg=baz cve_id=CVE-2024-12345 severity=high fix_versions=">=1.2.3" +PIN_MISSING pkg=qux (no version specifier in pyproject.toml) +SPDX_VIOLATION file=src/some_module.py license="GPL-3.0" +``` + +Each line is a stable parseable format; CI can grep for `VIOLATION|FOUND|MISSING` and `exit 1` on any match. + +### Markdown report (in `docs/reports/license_cve_audit//`) + +- `initial.md` — the discovered violations (committed in Phase 1) +- `final.md` — the post-cleanup state (committed in Phase 2, after tilde-pinning + lock regen) + +Structure: + +```markdown +# License & CVE Audit — 2026-06-07 + +## Top-level summary + +- License violations: 0 +- CVEs found: 0 +- Pinning issues: 0 +- SPDX violations in src/ or scripts/: 0 + +## Notes + +- No `LICENSE` file in repo root — informational, not a violation. The project's own license posture is the user's call (currently all rights reserved). +- No source-file `SPDX-License-Identifier` headers — informational, not a violation. The project's own copyright headers are the user's call. +- pip-audit not installed → CVE check skipped. Install via `uv tool install pip-audit` to enable. + +## Per-violation table + +| Type | Package | License / CVE / Pin | Via | +|------|---------|---------------------|-----| +| ... | ... | ... | ... | +``` + +### Baseline file (`scripts/audit_license_cve.baseline.json`) + +Internal state for `--strict` mode. JSON because it matches the existing convention (`scripts/audit_weak_types.baseline.json`). Not the user-facing report; not in the output surface. Format: + +```json +{ + "schema_version": 1, + "baseline_violations": [], + "baseline_date": "2026-06-07", + "notes": "Zero-violation state after the tilde-pinning + lock regen in this track." +} +``` + +`--strict` mode loads this file and fails CI if `len(current_violations) > len(baseline_violations)`. The user's intentional changes (e.g., adding a new dep with an acceptable license) are recorded by re-running with `--dump-baseline`. + +## Commit Structure (4 atomic commits, in order) + +``` +1. chore(audit): add license_cve audit script + initial report + - scripts/audit_license_cve.py (initial version, informational mode) + - docs/reports/license_cve_audit/2026-06-07/initial.md (the discovered violations) +2. chore(deps): tilde-pin all deps; delete requirements.txt + - pyproject.toml (every direct dep gets ~X.Y.Z or stays as >=X.Y.Z) + - uv.lock (regenerated) + - requirements.txt (deleted; was redundant with lock) +3. chore(audit): add --strict mode + baseline file (CI gate) + - scripts/audit_license_cve.py (extends with --strict + baseline diff) + - scripts/audit_license_cve.baseline.json (zero-violation post-cleanup state) +4. conductor(tracks): mark License CVE Audit track complete + - tracks.md update +``` + +Each commit message includes a `git notes add -m "..."` summary per `conductor/workflow.md`. + +## Verification (TDD per `conductor/workflow.md`) + +Unit tests in `tests/test_audit_license_cve.py`: + +- License classifier: a known fixture package list with various licenses → correct classification (blocklist + allowlist + unknown). +- Blocklist enforcement: each entry (GPL, AGPL, SSPL, BSL, BUSL, Commons Clause, Elastic v2, unknown, missing) → correctly flagged. +- Allowlist enforcement: each entry (MIT, BSD, Apache 2.0, ISC, Unlicense, Zlib, Python-2.0, LGPL, MPL-2.0, CC0, WTFPL) → correctly passes. +- Pin check: synthetic `pyproject.toml` with mixed pinning (no bound, `>=X.Y`, `~X.Y.Z`, exact) → correct flags. +- Source header check: synthetic `.py` with `SPDX-License-Identifier: GPL-3.0` → flagged; with no SPDX → no violation. +- `--strict` mode: violations > baseline → exit 1; violations == baseline → exit 0; new violation (delta > 0) → exit 1. +- `--dump-baseline`: writes a baseline file matching the current violation set. + +## Risks + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Some packages' license metadata is missing or unparseable in `importlib.metadata` | High | Medium (false positives on unknown) | The policy treats `UNKNOWN` as violation → manual review catches the right answer; the report's notes section lists the unknowns explicitly | +| `pip-audit` not installed in CI | Medium | Low (CVE check is a no-op) | Script detects missing `pip-audit` and logs a warning; license + pin checks still run | +| Air-gapped CI can't reach OSV / PyPI advisory DBs | Medium | Low (CVE check returns no results) | Document; a follow-up could add offline CVE support, not in this track | +| Pinning decisions are subjective (some deps deserve looser bounds than others) | Medium | Low (initial pass is conservative) | The pin check accepts any lower bound as a soft check; the user can loosen specific deps via the baseline file | +| The baseline file becomes a "shadow ledger" — needs maintenance when intentional changes are made | Medium | Low (intentional) | Document the update workflow in the script's `--help`; `--dump-baseline` regenerates the baseline after an intentional change | +| The project's own LICENSE absence might confuse a future contributor who doesn't know the user's posture | Low | Low | The report's notes section explicitly calls this out: "no LICENSE in repo root — informational, not a violation; project's own license is the user's call (currently all rights reserved)" | +| A dep is added with a license that doesn't match the script's allowlist/blocklist (e.g., a new "BSL 2.0" variant) | Low | Low | The script's default rule (unknown = violation) catches it; the report's notes section surfaces it for review; one-line add to the appropriate list | + +## Follow-up + +- `air_gapped_cve_check_20260607` (NOT in this track): add offline CVE support for air-gapped CI environments that can't reach OSV / PyPI. The CVE check would ship a snapshot of the advisory DBs (or use a local mirror). +- `cve_auto_remediation_20260607` (NOT in this track): when a CVE is found, auto-bump the dep to the fix version (within the pin range) and re-run the audit. Out of scope here; this track REPORTS, the user DECIDES. + +## Coordination with Pending Tracks + +This track has **no blockers** and **no conflicts** with the 5 active planned tracks. It modifies: + +- `pyproject.toml` (version pins; could affect resolution for any future track that depends on something) +- `uv.lock` (regenerated; the lock file changes) +- `requirements.txt` (deleted; was redundant with lock) +- New: `scripts/audit_license_cve.py`, `scripts/audit_license_cve.baseline.json`, `docs/reports/license_cve_audit/2026-06-07/` + +It does NOT modify `src/`, `tests/`, or any of the 5 planned tracks' files. The deleted `requirements.txt` is a separate file from the 5 planned tracks' scope. Can ship independently and in parallel with the 5 planned tracks. + +The tilde-pinning in this track is a STRENGTHENING of the dep contract, not a loosening — it doesn't break any existing test or any other track's plan. + +## Out of Scope + +- The project's own `LICENSE` file (user's decision; the track will not create one). +- The project's own `SPDX-License-Identifier` / `Copyright` headers in `src/` (user's decision; the track will not add or modify). +- Any recommendation on what license the user should pick for the project. +- Patching CVEs in transitive deps (the track REPORTS; the user decides whether to wait for upstream or replace). +- Auto-bumping versions to address CVEs (manual decision; the track reports, the user acts). +- Modifying any third-party code already in the repo (none currently; the scan is defensive for the future). +- License/header updates to vendored C/C++ (none currently vendored; the scan is defensive). +- The local-rag optional dependency group (`sentence-transformers`); covered by the same audit but pinning happens in the same `pyproject.toml` edit. + +## See Also + +- `conductor/workflow.md` "Audit Script Policy" — the convention this track follows. +- `scripts/audit_main_thread_imports.py`, `scripts/audit_weak_types.py`, `scripts/check_test_toml_paths.py` — the 3 existing audit scripts; the new track follows the same shape. +- `scripts/audit_weak_types.baseline.json` — the baseline file pattern (the new `scripts/audit_license_cve.baseline.json` mirrors this). +- [OSI Approved Licenses](https://opensource.org/licenses/) — the de facto list of "open source" licenses; the script's policy is consistent with this list (with the addition of LGPL / MPL-2.0 in transitive deps for Python import-safety). +- `pip-audit` (PyPA) — the CVE-checking tool invoked as a subprocess. Optional; the script handles its absence gracefully.