Crawl and save a website as PDF files

The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content - for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.

I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.

My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF. This is a rough overview of what it took.

Go seems an interesting language and as it seems the Cooly package was suited to help do the job - and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go - 1.8):

package main

import (
    "fmt"
    "os"
    "github.com/gocolly/colly"
)

func main() {
    // Instantiate default collector
    website := os.Args\[1\]

    c := colly.NewCollector(
        colly.AllowedDomains(website),
    )

    // On every a element which has href attribute call callback
    c.OnHTML("a\[href\]", func(e \*colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })

    c.OnRequest(func(r \*colly.Request) {
        fmt.Println(r.URL.String())
    })
    c.Visit("https://" + website)
}
It assumes the site is running HTTPS and takes the domain name (a FQDN) as the first and only parameter and should be piped into a file, which will have the complete list of all URLs (one URL on every line). Run the script without piping to a file to see the output on STDOUT and validate it seems to work as expected.

Printing a PDF from each URL on the site

Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job - like this (from the shell):

google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/

This line should generate a PDF file called output.pdf of the Google.com front page.

Putting it all together

So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:

#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt

for url in $(cat example.com.txt); do
	filename=${url//\\//\_}
	filename=${filename/\\?//\_}
	filename=${filename//:/\_}
	filename=${filename//\\//\_}
	google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done

This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.