Crawl and save a website as PDF files

The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content – for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.

I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.

My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF.
This is a rough overview of what it took.

Crawling the site, finding links

Go seems an interesting language and as it seems the Cooly package was suited to help do the job – and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go – 1.8):

package main
 
import (
    "fmt"
    "os"
    "github.com/gocolly/colly"
)
 
func main() {
    // Instantiate default collector
    website := os.Args[1]
 
    c := colly.NewCollector(
        colly.AllowedDomains(website),
    )
 
    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })
 
    c.OnRequest(func(r *colly.Request) {
        fmt.Println(r.URL.String())
    })
    c.Visit("https://" + website)
}

It assumes the site is running HTTPS and takes the domain name (a FQDN) as the first and only parameter and should be piped into a file, which will have the complete list of all URLs (one URL on every line). Run the script without piping to a file to see the output on STDOUT and validate it seems to work as expected.

Printing a PDF from each URL on the site

Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job – like this (from the shell):

	google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/

This line should generate a PDF file called output.pdf of the Google.com front page.

Putting it all together

So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:

#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt
 
for url in $(cat example.com.txt); do
	filename=${url//\//_}
	filename=${filename/\?//_}
	filename=${filename//:/_}
	filename=${filename//\//_}
	google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done

This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.

Linux – No space left on device, yet plenty of free space

My little server ran into an issue, and started reporting the error:

No space left on device

No worries, lest figure out which disk has full and clean up…

Using the df command with the -h (for human-readable output) it should be easy to find the issue:

root@server:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 483M 0 483M 0% /dev
tmpfs 100M 3.1M 97M 4% /run
/dev/vda 20G 9.3G 9.4G 50% /
tmpfs 500M 0 500M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 500M 0 500M 0% /sys/fs/cgroup
cgmfs 100K 0 100K 0% /run/cgmanager/fs
tmpfs 100M 0 100M 0% /run/user/1000

Strange. Notice who the /dev/vda is 50% fillled and all other disk devices seems to be finde too. Well after a little digging, thinking and googling, it turns out device space consists of two things – space (for data) on the device and iNodes (the stuff used to mange the space – where the data go – simplified).

So next move was to look at the inodes:

root@server:~# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 123562 375 123187 1% /dev
tmpfs 127991 460 127531 1% /run
/dev/vda 1310720 1308258 2462 100% /
tmpfs 127991 1 127990 1% /dev/shm
tmpfs 127991 3 127988 1% /run/lock
tmpfs 127991 18 127973 1% /sys/fs/cgroup
cgmfs 127991 14 127977 1% /run/cgmanager/fs
tmpfs 127991 4 127987 1% /run/user/1000

Bingo – no iNodes left on /dev/vda – “too many files in the file system” is the cause, and that’s why it can’t save any more data.

The cleanup

Now, I did not expect the server in this case to have a huge number of files, so something must be off.

Finding where the many files to a little digging too. Starting with this command:

du --inodes -d 1 / | sort -n
</div>

It lists who many iNoes are consumed by each directory in the root.

The highest number was in /var, and next step was doing:

<div id="foo">
du --inodes -d 1 /var | sort -n

Until I found the folder where an extreme number of files was consumed and solved the issue(*).

*) Turned out to be PHP sessions files, which ate the space, which there is an easy solution for.

DNSSEC and switching nameservers

I’ve switched nameservers for all my domains yesterday. During the past many years I’ve been free-riding on GratisDNS and enjoying their free DNS service (and luckily never needed support in their forums).

Yesterday I switched to Cloudflare and I’m using them for DNS for this (and other domains). I don’t have any particular requirements, and the switch was mostly easy and automated to the extent possible. Two domains went smooth, but the last my mahler.io domain went a stray a few hours during the switch.

The issue was completely on me and required a help from a friend to resolve. Most my DNS records are completely basic, but I’ve tried to keep a current baseline and supported CAA records and DNSSEC.

CAA does not matter when switching DNS servers, but the DNSSEC does. As the name implies, DNSSEC is a DNS SECurity standard, and in the particular case, the DNSSEC records did not only exist at gratisdns, but also at NIC.io my DNS registrar for my dot io domain.

Only as the DNSSEC was removed at gratisdns – and nic.io – the transfer went through and everything was running smoothly at the Cloudflare DNS service.

Updates…

It’s been quiet here for a while, but be things have been happening behind the scenes. In case your wondering the site (and surroundings) have been seeing a number of updates which eventually may make it into separate posts.

  • I’m running on a Digital Ocean droplet. It was provisioned as an Ubuntu 12.04 LTS, which is dead by now (as in no more updates including security updates). The server has now been roll up to an Ubuntu 16.04 LTS in place.
  • As I was messing around with the server, I’ve added IPv6 support.
  • The DNS has been updated to have full support for DNSSEC.
  • My Let’s Encrypt Certificates now has automated certificate renewals and I’ve upgraded to CAA support.
  • The Webserver has been switched from Apache to NGINX.
  • The PHP has been switched from PHP 5.6 series to a modern 7.0.
  • I’m adopting full Git-backed backup of all server setup and configuration using BitBucket.org. It’s not complete but most config files have been added and managed using GitHub.

These was the majority of changes on the site and server the past few months. With these updates in place, I might get back to producing content for the site.

Devops: You build it; you run it… sort of

DevOps seems to be sweeping through IT departments these years and for most developers it seems to be sen as a way of getting those pesky gatekeepers from Operations away and ship code whenever any developers feels like it.

The problem is however, that in the eagerness to be a modern DevOps operation, the focus is often solely on the benefits of faster releases (on the short term) the “DevOps” provide over “Dev to Ops”, and many developers do seem to forget the virtues Operations (should) bring to the party.

From my observations here are the top three fails when adopting DevOps:

  1. Too must focus on features, less on foundation. Often Operations is making sure that operating system, libraries and other components utilized by the system is updated for security and end of life. As these tasks does seem to provide “obvious value” for the users of the system prioritizing them seems to be a challenge (unless the developers find new cool features in the new version of a framework they want to use, naturally).
  2. Lack of monitoring tools. Making sure you don’t run out of system resources – be that disk space, memory or CPU – is boring. The same goes for customer support tools, diagnostic tools and other tools which forecast operational issues. As those tool belong to Operations, sure they can’t be important in DevOps and are often skipped or haphazard at best.
  3. No plan for handling incidents. As developers tend to move forwards and rarely lack confidence and thus the plan for handling incidents and operational issues is usually made ad-hoc when the issues occur. During daytime, when everyone is available this may not be a significant issue, but during nights, weekends and holidays, finding the right developer who can help often causes the incident to last longer and in some cases even worse if eager developers make changes in a part of the code they aren’t familiar with.

I do firmly believe that DevOps is the right way to build and manage IT systems, but I also find, that too many teams forget the Ops part and doesn’t incorporate the skills brought to IT from Operations minded people, and the potential to build better systems through an DevOps setup is thus often not fully realized.

(This post originally appeared on Linked)

Have your IT systems joined Social Media?

No, your servers should (probably) not have a facebook profile, nor should your servicebus have a twitter profile, but as the work tools change and evolve, you should probably consider updating the stream of status mails to more modern “social media” used at work.

When you’re in DevOps you probably get a steady stream of emails from various systems checking in. It may be alert emails, health checks or backup completed emails. It’s been more “fun” getting these mails with the rise of unlimited mail storage and powerful email-search tools should you ever need to find something in the endless stream of server-generated mails.

As we’ve adopted new tools, the automatic messaging from the servers has more or less stayed the same – until recently. We’ve been drinking the Kool-Aid with Slack, Jira and other fancy tools, and recently we considered moving the IT systems along… and so we did.

Slack has a very nice API and even with free tier, you can enable a “robot” (a robot.sh shell script that is) to emit messages on the various slack channels. We’ve integrated the slackbot into our various (automated)workflows in our Continuous integration / Continuous Deployment  pipeline, so that releases which move from environment to the next – and finally into production, emits a message to a #devops channel . We’ve also made a #operations channel, and when our monitoring registers “alert events”, it emits messages onto this channel. Using this setup anyone on the team can effectively and easily subscribe to push messages.

As a sanity measure and not to have “yet another mailbox”, we’ve tried to preach that everything on Slack should be considered ephemeral (that is short lived), and if anything worth remembering is said, it must be captured elsewhere.
releasetrain
As many other companies we use Jira to manage our software development. A task is tracked from idea to delivery and then closed and archived. As a Jira task is long-lived we’ve also integrated the same CI/CD pipeline into Jira. Once the release train starts rolling – a ticket in Jira is created (by the scripting tools through the Jira API) and updated as it pass through the environments – and closed automatically when the solution has been deployed to production.

The ticket created in Jira contains a changelog generated from pulling the contained commits from Git included in the pull request and if possible (assuming the commit comments are formatted correctly) linked to the Jira issues contained in the release (build from the pull request).

The jira tickets created by the release train is collected in a kanban board where each environment/stage have a separate column giving a quick overview of the complete state of releases (what is where right now).

A future move we’ve considered was if we should have blogging by servers. Assuming agile developers is able to create reasonable commit comments which may be readable (and comprehensible) by non-developers, it might me interesting utilizing a blogging tool such as wordpress to provide a historical log of releases.

As you adopt new tools for communication, remember to also think of automated communication, which may have a useful place in the new tools. Often new platforms have readily available APIs which allows your IT platforms to provide – or receive -information much more efficiently than pagers, email or whatever was available once you set it up in times gone by.

(This post originally appeared on Linked)

No more pets in IT

Remember the good old days, when IT got a new server. It was a special event – and naturally the naming game. Finding that special name for the server, which neatly fitted into the naming scheme adopted (be it Indian cities, Nordic mythology or cartoon characters).

This ought to be then, but the ceremony still happens regularly in many IT departments, where servers are treated with the same affection as with pets – and with all the bad side effects it may bring along.

Remember the “superman” server – it must not die – superman will live on forever- nor matter how much patching, maintenance and expensive parts replacements it need, we will care for that special pet server… and we would be wrong to do so.

Modern IT should not be that cozy farm from the 1950ies, but find their reflection in modern farming.

From pets to cattle

In a modern farm the cattle isn’t named individually and harsh at may seem – when one of the cows doesn’t perform, it replaced as the performance of the farm as a whole matter much more, than the individual care and nurture of the individual animals on the farm… and rarely are the individual animals named – perhaps in recognition that they will be replaced and the number of animals will be adjusted to align with the requirements of the farm.

Modern IT does have all the technology to adopt the modern farm metaphor and should do so as we move to virtual servers, containers, micro services and cloud-based infrastructure.

All these technologies (along with others) have enabled us to care much less about a specific server or service, and instead focus on what “server templates” are needed to support the services provided by IT – and mange the number of instances needed to support the requirements posed to IT.

Hardware as Software – From care to control

As servers move from a special gem to a commodity and we may need 10, 50 or 100 small servers in the cloud instead of a single huge “enterprise” spaceship in the server room, a key challenge is our ability to manage and control them – and the tools to do that is also readily available.

Using tools like Chef, Puppet or Docker(files) – as the enabler for our server templates from above – developers are able to describe a specific server configuration and use this template to produce as many identical copies as may be needed. Further more, as we’re moving to manage the herd of servers, the server templates should easily be manged using the standard version control software used to mange your source code already.

Using this template model, the developers take control (and responsibility) of making sure the complete stack needed to run the service, is coherent, and the operations can make sure to size (and resize) the resources available as needed.

Finally as we move to a “cattle” perception of servers, no one should ever need to login to a specific server and make changes – it all needs to go through the configuration management tools and tracking changes all changes to the production environment. If a server starts acting up, kill that the server and spin a new server up in your infrastructure.

(This post originally appeared on Linked)

Ephemeral Feature Toggles

At its simplest  feature toggle is a small if-statement, which may be toggled without deploying code, and used to enable features (new or changed functionality) in your software.

In an agile setup most developers love  having feature toggles available, as they often allow for a continuous delivery setup  with very little friction and obstacles. While this is true, it often seems developers forget to think of feature toggles as ephemeral, and doesn’t realize what a terrible mess, this can cause – given they don’t remove the toggles once the feature is launched and part of the product.

While feature toggles often is an extremely lean way to  introduce public features in software – short circuiting normal release ceremony, as it has already happened before the launch of the feature – which often literally is a button press in a admin/back end interface.

Feature toggles must be ephemeral when used to support Continuous Delivery.

Introducing feature toggles in source code is often just allowing an if-statement and toggling between two states. While it ought to be a single place a toggle occur in the source code, it may often be in views, models, libraries and/or templates – and thus leads to (potentially) a lot of clutter in the source code.

If feature toggles are allowed to live on, the next issue is often that feature toggles may become nested or unintentional crossed as various functions, components in the code, may depend on other features.

Feature toggles is a wonderful thing, but all toggles should not have a lifespan beyond the two sprints forward – allowing for the wins of the feature toggles, yet avoiding long term clutter and complications in the source code.

Other uses for feature toggles…

Feature toggles may exist as hooks for operations or to define various class-of-service for your users.

When used for operations, they may be used as hooks to disable especially heavy functionality during times of heavy load – black Friday, Christmas or whenever it may be applicable.

When used for Class-of-Service feature toggles can be used to determine which feature a basic and a premium user – or any other classes you may choose to have – have access to.

If you’re applying feature toggles as Operations hooks or class-of-service, don’t make them ephemeral, but have your developers mark them clearly in source code.

Baking Audiobooks with m4baker

Building audiobooks on (Debian) Linux in the m4b format is actually possible and doesn’t have to be a pain. I’ve found numerous recipes with shell instructions, but having a nice simple app to handle the building of the books seems much easier.

Most of the apps available for Linux seemed to be in a pre-alpha state, but after a few experiments I’ve settled on m4baker, which – while a bit rough – actually seems to do the job just fine.

Getting the m4baker running on my Debian Testing took a few steps:

sudo apt-get install python-qt4
 sudo apt-get install libcanberra-gtk-module
 sudo apt-get install faac
 sudo apt-get install libmp4v2-2
 sudo apt-get install mp4v2-utils
 sudo apt-get install sox
 sudo apt-get install libsox-fmt-mp3

Once these steps have completed successfully the final step is getting m4baker installed and running:

  • Download the source from https://github.com/crabmanX/m4baker/releases
  • Unpack the file and from the unpacked directory run the install script:
    python setup.py install --optimize=1
    

This should have successfully installed M4Baker and all the required files and libraries to build m4b-audiobooks (suitable for iTunes and other m4b-supporting audio players).

You launch  m4baker either through the (start) menu or simply with the m4Baker command from the shell.

m4Baker is an open source project available on GitHub.

 

Why DevOps works…

I’m digging through a backlog of podcasts and the gem of the day goes to SE-Radio podcast. In episode #247 they talk about DevOps and while I’ve preached and practiced DevOps for years, as mainly as common sense, the podcast did present a more reasonable argument why it works.

Developers are praised and appreciated for short time to market; the number of new features they introduce and changes they make to the running system.

Operations are praised for stability and up time, and thus change is bad; many changes is really bad.

A DevOps fuse the two roles and responsibilities, the developers need to balance the business value new development may cause with the risks it introduce and balance what and when things are released into production.

If you’re into Software Engineering and DevOps curious give it a listen at SE-Radio #247.

(This post originally appeared on Linked)

Pioneering the Internet….