Web pages into auditable markdown.
The first d402 service fetches public pages, freezes snapshots, extracts clean markdown, rejects blocked pages, and returns evidence so validators and clients can reason about quality.
Capability family
| Capability | Boundary | Use |
|---|---|---|
web.fetch.snapshot@1 | Network/browser snapshot. | Fetch a URL and persist a content-addressed snapshot. |
web.extract.markdown@1 | Pure transform. | Convert an existing snapshot or inline HTML into canonical markdown without network access. |
web.page_to_markdown@1 | Combined demo path. | Fetch and extract in one task for simple callers. |
Extractor
The markdown transform uses defuddle, the MIT-licensed content extraction engine used by Obsidian Clipper. Extraction runs with async/network fetch disabled by default, so workers transform the already captured snapshot instead of each independently calling site-specific APIs during comparison.
- Plain HTTP snapshotting is cheap and deterministic enough for simple pages.
- Playwright snapshotting improves coverage for JavaScript-rendered pages.
- Fallback from Playwright to HTTP fetch is explicit and recorded in evidence.
- Blocked pages, CAPTCHA pages, JavaScript interstitials, and paywall-like challenge pages should be rejected, not paid.
Request and output
{
"input": {
"url": "https://example.com",
"timeoutMs": 10000
}
}{
"markdown": "# Example Domain\n\n...",
"snapshotCid": "sha256:...",
"contentHash": "sha256:...",
"title": "Example Domain",
"metadata": {
"author": null,
"description": "Example description",
"site": "example.com",
"published": null,
"wordCount": 120
},
"evidence": {
"converter": "defuddle",
"snapshotterVersion": "playwright",
"quality": {
"blocked": false,
"reason": null
}
}
}Validation
- Markdown must exist and pass minimum quality thresholds.
- Claims must include content hash, snapshot CID, converter metadata, and snapshotter metadata.
- Known blocked-page patterns are rejected.
- In quorum mode, accepted claims must agree after markdown canonicalization.
- Workers that return invalid or blocked-page claims should lose reputation and may be slashed when bonds are enabled.
Limitations
Browser rendering improves coverage, but it does not guarantee access to paywalled, bot-defended, geo-blocked, or personalized content. For production reliability, d402 should keep snapshotting and extraction as separate capabilities so validators compare markdown against frozen evidence.