d402distributed 402
First capability family

Web pages into auditable markdown.

The first d402 service fetches public pages, freezes snapshots, extracts clean markdown, rejects blocked pages, and returns evidence so validators and clients can reason about quality.

Capability family

CapabilityBoundaryUse
web.fetch.snapshot@1Network/browser snapshot.Fetch a URL and persist a content-addressed snapshot.
web.extract.markdown@1Pure transform.Convert an existing snapshot or inline HTML into canonical markdown without network access.
web.page_to_markdown@1Combined demo path.Fetch and extract in one task for simple callers.

Extractor

The markdown transform uses defuddle, the MIT-licensed content extraction engine used by Obsidian Clipper. Extraction runs with async/network fetch disabled by default, so workers transform the already captured snapshot instead of each independently calling site-specific APIs during comparison.

  • Plain HTTP snapshotting is cheap and deterministic enough for simple pages.
  • Playwright snapshotting improves coverage for JavaScript-rendered pages.
  • Fallback from Playwright to HTTP fetch is explicit and recorded in evidence.
  • Blocked pages, CAPTCHA pages, JavaScript interstitials, and paywall-like challenge pages should be rejected, not paid.

Request and output

request
{
  "input": {
    "url": "https://example.com",
    "timeoutMs": 10000
  }
}
output
{
  "markdown": "# Example Domain\n\n...",
  "snapshotCid": "sha256:...",
  "contentHash": "sha256:...",
  "title": "Example Domain",
  "metadata": {
    "author": null,
    "description": "Example description",
    "site": "example.com",
    "published": null,
    "wordCount": 120
  },
  "evidence": {
    "converter": "defuddle",
    "snapshotterVersion": "playwright",
    "quality": {
      "blocked": false,
      "reason": null
    }
  }
}

Validation

  • Markdown must exist and pass minimum quality thresholds.
  • Claims must include content hash, snapshot CID, converter metadata, and snapshotter metadata.
  • Known blocked-page patterns are rejected.
  • In quorum mode, accepted claims must agree after markdown canonicalization.
  • Workers that return invalid or blocked-page claims should lose reputation and may be slashed when bonds are enabled.

Limitations

Browser rendering improves coverage, but it does not guarantee access to paywalled, bot-defended, geo-blocked, or personalized content. For production reliability, d402 should keep snapshotting and extraction as separate capabilities so validators compare markdown against frozen evidence.