Web → Clean PDF/DOCX

Website URL

Start with depth=0 to test the start page quickly.

Max Depth crawl

0 = just the start page. 1 = links on that page. 2 = links of links, etc.

Max Pages

Upper limit for how many pages to include in the bundle.

Output Format

PDF uses WeasyPrint. DOCX uses Pandoc if installed (falls back to HTML).

Same-domain only

Keep the crawl on the same registered domain. Usually keep this on.

Respect robots.txt

Best-effort. For personal use, you may disable. For public service, leave on.

Content CSS selector optional

If set, we extract only from this element (useful for docs main column). Otherwise we auto-detect.

Link scope selector optional

If set, we only follow links found inside this element (e.g., a left docs menu). Great for staying on-topic.

Allow URL patterns (regex, one per line)

Only URLs matching these patterns are crawled. Leave blank to let the preset pick.

Exclude URL patterns (regex, one per line)

URLs matching any pattern here are skipped (e.g., search pages, images).

Title (used for filename)

We'll save as your-title.pdf / .docx / .html.

💡 Pro Tips

Start with depth=0 to test the start page quickly.
For docs sites, try Preset: Docs and set Link Scope to the sidebar/toc.
Use Allow patterns to include only relevant sections.
Exclude search pages and images to avoid noise.
DOCX export needs Pandoc; otherwise we save clean HTML.
For large sites, keep Max Pages reasonable to avoid long processing times.
Navigation timeouts? Expand Navigation (Advanced) below and set Wait="load" or add a timeout.

🧭 Navigation (Advanced)

Wait Strategy

Change only if pages hang at networkidle.

Nav Timeout (ms)

0 = Playwright default (30s). Increase for slow sites.

Max Retries

Only used if Wait Strategy set. Fallbacks to looser waits.