Crawl and save a website as PDF files

The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content – for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.

I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.

My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF.
This is a rough overview of what it took.

Crawling the site, finding links

Go seems an interesting language and as it seems the Cooly package was suited to help do the job – and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go – 1.8):

package main
 
import (
    "fmt"
    "os"
    "github.com/gocolly/colly"
)
 
func main() {
    // Instantiate default collector
    website := os.Args[1]
 
    c := colly.NewCollector(
        colly.AllowedDomains(website),
    )
 
    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })
 
    c.OnRequest(func(r *colly.Request) {
        fmt.Println(r.URL.String())
    })
    c.Visit("https://" + website)
}

It assumes the site is running HTTPS and takes the domain name (a FQDN) as the first and only parameter and should be piped into a file, which will have the complete list of all URLs (one URL on every line). Run the script without piping to a file to see the output on STDOUT and validate it seems to work as expected.

Printing a PDF from each URL on the site

Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job – like this (from the shell):

	google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/

This line should generate a PDF file called output.pdf of the Google.com front page.

Putting it all together

So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:

#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt
 
for url in $(cat example.com.txt); do
	filename=${url//\//_}
	filename=${filename/\?//_}
	filename=${filename//:/_}
	filename=${filename//\//_}
	google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done

This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.

Linux – No space left on device, yet plenty of free space

My little server ran into an issue, and started reporting the error:

No space left on device

No worries, lest figure out which disk has full and clean up…

Using the df command with the -h (for human-readable output) it should be easy to find the issue:

root@server:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 483M 0 483M 0% /dev
tmpfs 100M 3.1M 97M 4% /run
/dev/vda 20G 9.3G 9.4G 50% /
tmpfs 500M 0 500M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 500M 0 500M 0% /sys/fs/cgroup
cgmfs 100K 0 100K 0% /run/cgmanager/fs
tmpfs 100M 0 100M 0% /run/user/1000

Strange. Notice who the /dev/vda is 50% fillled and all other disk devices seems to be finde too. Well after a little digging, thinking and googling, it turns out device space consists of two things – space (for data) on the device and iNodes (the stuff used to mange the space – where the data go – simplified).

So next move was to look at the inodes:

root@server:~# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 123562 375 123187 1% /dev
tmpfs 127991 460 127531 1% /run
/dev/vda 1310720 1308258 2462 100% /
tmpfs 127991 1 127990 1% /dev/shm
tmpfs 127991 3 127988 1% /run/lock
tmpfs 127991 18 127973 1% /sys/fs/cgroup
cgmfs 127991 14 127977 1% /run/cgmanager/fs
tmpfs 127991 4 127987 1% /run/user/1000

Bingo – no iNodes left on /dev/vda – “too many files in the file system” is the cause, and that’s why it can’t save any more data.

The cleanup

Now, I did not expect the server in this case to have a huge number of files, so something must be off.

Finding where the many files to a little digging too. Starting with this command:

du --inodes -d 1 / | sort -n
</div>

It lists who many iNoes are consumed by each directory in the root.

The highest number was in /var, and next step was doing:

<div id="foo">
du --inodes -d 1 /var | sort -n

Until I found the folder where an extreme number of files was consumed and solved the issue(*).

*) Turned out to be PHP sessions files, which ate the space, which there is an easy solution for.

DNSSEC and switching nameservers

I’ve switched nameservers for all my domains yesterday. During the past many years I’ve been free-riding on GratisDNS and enjoying their free DNS service (and luckily never needed support in their forums).

Yesterday I switched to Cloudflare and I’m using them for DNS for this (and other domains). I don’t have any particular requirements, and the switch was mostly easy and automated to the extent possible. Two domains went smooth, but the last my mahler.io domain went a stray a few hours during the switch.

The issue was completely on me and required a help from a friend to resolve. Most my DNS records are completely basic, but I’ve tried to keep a current baseline and supported CAA records and DNSSEC.

CAA does not matter when switching DNS servers, but the DNSSEC does. As the name implies, DNSSEC is a DNS SECurity standard, and in the particular case, the DNSSEC records did not only exist at gratisdns, but also at NIC.io my DNS registrar for my dot io domain.

Only as the DNSSEC was removed at gratisdns – and nic.io – the transfer went through and everything was running smoothly at the Cloudflare DNS service.