Crawl and save a website as PDF files
The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content - for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.
I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.
My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF. This is a rough overview of what it took.
Crawling the site, finding links
Go seems an interesting language and as it seems the Cooly package was suited to help do the job - and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go - 1.8):
package main
import (
"fmt"
"os"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
website := os.Args\[1\]
c := colly.NewCollector(
colly.AllowedDomains(website),
)
// On every a element which has href attribute call callback
c.OnHTML("a\[href\]", func(e \*colly.HTMLElement) {
link := e.Attr("href")
c.Visit(e.Request.AbsoluteURL(link))
})
c.OnRequest(func(r \*colly.Request) {
fmt.Println(r.URL.String())
})
c.Visit("https://" + website)
}
Printing a PDF from each URL on the site
Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job - like this (from the shell):
google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/
This line should generate a PDF file called output.pdf of the Google.com front page.
Putting it all together
So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:
#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt
for url in $(cat example.com.txt); do
filename=${url//\\//\_}
filename=${filename/\\?//\_}
filename=${filename//:/\_}
filename=${filename//\\//\_}
google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done
This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.