Back to Projects

Wayback Restore

A CLI tool that recovers archived websites from the Wayback Machine and transforms them into deployable static sites for Cloudflare Pages.

The Problem

When a domain expires or a website goes offline, the content doesn't have to disappear forever-the Internet Archive's Wayback Machine often has snapshots going back years. This matters if you're preserving content from a discontinued project, recovering a client's old site, or acquiring a domain that still has backlink value you want to maintain. But recovering that content manually is painful: you'd need to identify which snapshots exist, download hundreds of pages and assets individually, strip out the Wayback Machine's injected scripts, fix broken asset paths, and somehow prepare it all for modern hosting. It's the kind of task that sounds straightforward until you're three hours in, hand-editing HTML files.

The Solution

Wayback Restore automates the entire restoration process through a three-stage pipeline. An interactive setup wizard walks through the configuration-which date ranges to prioritize, what content to strip out, which URL patterns to exclude-and generates a project config file. The fetch stage queries the Wayback Machine's CDX API to discover all archived snapshots, then downloads the HTML pages and static assets with automatic retry logic and resumable progress tracking (so a flaky connection doesn't mean starting over). Finally, the build stage transforms the raw archives into a clean static site: it rewrites asset paths, strips tracking scripts like Google Analytics and Facebook Pixel, optionally disables or removes forms, and generates the configuration files Cloudflare Pages needs-cache headers, redirects, sitemaps, the works.

How It Works

  1. 1 Discovery: The CDX API returns a list of all archived snapshots for the domain. The tool filters by HTTP 200 responses, applies exclusion patterns, and picks the latest snapshot for each unique URL path.
  2. 2 Download with Resume: Each page and asset is fetched from the Wayback Machine using the "raw" snapshot format (avoiding their injected rewriting). A manifest file tracks progress, so interrupted downloads can resume where they left off.
  3. 3 HTML Transformation: A multi-pass pipeline cleans each HTML file-rewriting relative asset paths, converting absolute domain links to relative ones, removing tracking scripts, and applying any custom regex patterns for site-specific cleanup.
  4. 4 Static Site Output: The build stage copies everything to an output directory, generates Cloudflare-specific config files (_headers, _redirects, robots.txt, sitemap.xml), and tracks file hashes for incremental rebuilds.

Tech Stack

  • Language: Python 3.9+
  • CLI Framework: Click with Rich for terminal output
  • HTTP Client: httpx with tenacity retry decorator
  • Configuration: YAML with dataclass schema validation
  • Interactive Prompts: questionary
  • Target Platform: Cloudflare Pages
  • External API: Wayback Machine CDX API

Results

  • Restored multiple archived sites ranging from simple brochures to 500+ page content archives
  • Full restoration from discovery to deployment in hours, not days
  • Handles sites with thousands of archived pages through manifest-based resume
  • Incremental builds via MD5 change detection-only reprocesses modified files
  • 9 built-in tracking script patterns (GA, GTM, Facebook Pixel, Hotjar, Mixpanel, etc.) with support for custom regex