Category Archives: GoLang

Crawl and save a website as PDF files

The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content – for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.

I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.

My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF.
This is a rough overview of what it took.

Crawling the site, finding links

Go seems an interesting language and as it seems the Cooly package was suited to help do the job – and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go – 1.8):

package main
 
import (
    "fmt"
    "os"
    "github.com/gocolly/colly"
)
 
func main() {
    // Instantiate default collector
    website := os.Args[1]
 
    c := colly.NewCollector(
        colly.AllowedDomains(website),
    )
 
    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })
 
    c.OnRequest(func(r *colly.Request) {
        fmt.Println(r.URL.String())
    })
    c.Visit("https://" + website)
}

It assumes the site is running HTTPS and takes the domain name (a FQDN) as the first and only parameter and should be piped into a file, which will have the complete list of all URLs (one URL on every line). Run the script without piping to a file to see the output on STDOUT and validate it seems to work as expected.

Printing a PDF from each URL on the site

Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job – like this (from the shell):

	google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/

This line should generate a PDF file called output.pdf of the Google.com front page.

Putting it all together

So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:

#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt
 
for url in $(cat example.com.txt); do
	filename=${url//\//_}
	filename=${filename/\?//_}
	filename=${filename//:/_}
	filename=${filename//\//_}
	google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done

This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.