Appendix A — Technical Details

This appendix provides more detail on each step of the pipeline described in the main proposal.

flowchart TD
    A["SEC EDGAR"] --> S1
    B["FCA NSM"] --> S1
    C["PDIP"] --> S1

    S1["Step 1 - Ingest: Source-specific API adapters discover and download filings"] --> S2

    S2["Step 2 - Parse: Docling for PDFs, native parsing for EDGAR HTML/JSONL"] --> S3

    S3["Step 3 - Ground truth: PDIP expert annotations define what each clause type looks like"] --> S4

    S4["Step 4 - Locate: Regex cue patterns + document structure heuristics find candidate sections"] --> S5

    S5["Step 5 - Extract + filter: Opus and Sonnet with multi-shot prompts extract clauses verbatim"] --> S6

    S6["Step 6 - Verify: Fuzzy string matching confirms extraction appears in source text"] --> S7

    S7["Step 7 - Lawyer review: THE MISSING PIECE"]

    style S1 fill:#e8f0fe,stroke:#4285f4,color:#1a1a1a
    style S2 fill:#e8f0fe,stroke:#4285f4,color:#1a1a1a
    style S3 fill:#e6f4ea,stroke:#34a853,color:#1a1a1a
    style S4 fill:#e6f4ea,stroke:#34a853,color:#1a1a1a
    style S5 fill:#fef7e0,stroke:#f9ab00,color:#1a1a1a
    style S6 fill:#fef7e0,stroke:#f9ab00,color:#1a1a1a
    style S7 fill:#fce8e6,stroke:#ea4335,color:#1a1a1a

Detailed pipeline with technical components at each step

Step 1 - Ingest

Each source has a dedicated adapter. EDGAR and NSM adapters query APIs scoped to sovereign issuers (using LEIs and issuer name patterns), then download PDFs. PDIP documents are already available. Currently batch-mode; can be made continuous to pick up new filings as they appear.

Source	Documents	Format
SEC EDGAR	~3,300	HTML filings with embedded prospectuses
FCA NSM	~650	PDF filings via UK regulatory API
PDIP	~820	PDF documents with expert annotations

Step 2 - Parse

Different sources need different parsers:

PDFs (NSM, PDIP): Parsed with Docling, which extracts text with document structure (headings, sections, tables) preserved as markdown.
EDGAR HTML/JSONL: Parsed natively. EDGAR filings are already structured text with page boundaries.

The output is a sequence of sections, each with a heading, body text, heading level, and source format. These sections are the unit of analysis for everything downstream.

Step 3 - Ground truth

PDIP’s expert-annotated contracts define what each clause type looks like. We organize search patterns into cue families: heading cues (e.g., “Collective Action”), voting threshold cues, aggregation cues, etc. A candidate section needs hits from multiple cue families to qualify. ICMA model clause language supplements the PDIP-derived patterns.

Step 4 - Locate candidates

Deterministic regex pattern matching against each section. A section qualifies as a candidate if it has:

A heading match (the section title contains clause-relevant language), OR
Body cue hits from 2+ distinct cue families (e.g., both “voting threshold” patterns and “aggregation” patterns)

Adjacent qualifying sections are clustered together. This step is fast, interpretable, and produces a wide net of candidates.

Step 5 - Extract and filter with LLMs

Claude Opus 4.6 and Sonnet 4.6 receive each candidate section with multi-shot prompts showing real examples from PDIP-annotated contracts. The models:

Extract the clause verbatim (copy exact text, no paraphrasing)
Assign a confidence level (high, medium, low)
Explain their reasoning
Flag “not found” when a section does not actually contain the clause

This step eliminates false positives that pattern matching alone cannot catch: table of contents entries, summaries, cross-references to clauses in other documents.

Step 6 - Verify

Every extraction is checked against the original document text using fuzzy string matching with a 95% similarity threshold. A windowed approach handles OCR noise and minor formatting differences. Extractions that fail verification are flagged.

This catches hallucination, paraphrasing, and OCR drift. It is a safety net, not a substitute for human review.

Step 7 - Lawyer review

This is the missing piece. The pipeline narrows 4,800+ documents down to 9,145 likely clause matches. But “likely” is not “correct.” Only legal experts can confirm clause boundaries and correctness.

What we can measure without lawyers: Recall against PDIP expert annotations (did we find what experts tagged?). Verbatim similarity to source text.

What we cannot measure without lawyers: Precision (false positive rate). Whether clause boundaries are correct. Whether the system missed clauses that experts would have found.

Stack

Python 3.12, DuckDB, Docling, Click CLI. MIT licensed. All code: GitHub