Category Archives: Code

On Code – best practice, tips and other useful tricks

Crawl and save a website as PDF files

The web is constantly changing and sometimes sites are deleted as the business or people behind it moves on. Recently we removed a few sites as we were doing maintenance and updates on the many sites we run at work. Some of them had interesting content – for personal or professional reasons, and we wanted to make a static copy of the sites before deleting the sites completely.

I have not found any easy, simple and well-working software, which can produce and an all-inclusive downloaded copy of a website (including all resources sourced from CDNs and 3rd party sites (to actually make them browsable offline). As I needed to make the copy reasonable fast, I choose to try to capture the contents of the site (a text/article heavy site) as PDFs.

My solution was to (try to) crawl all links on the site (to pages on the site) and feed all the URLs to a browser for rendering and generating a PDF.
This is a rough overview of what it took.

Crawling the site, finding links

Go seems an interesting language and as it seems the Cooly package was suited to help do the job – and actually does most of the work, and the script needed (which found 500+ pages on the site I crawled) looks something like this (in Go – 1.8):

package main
 
import (
    "fmt"
    "os"
    "github.com/gocolly/colly"
)
 
func main() {
    // Instantiate default collector
    website := os.Args[1]
 
    c := colly.NewCollector(
        colly.AllowedDomains(website),
    )
 
    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        c.Visit(e.Request.AbsoluteURL(link))
    })
 
    c.OnRequest(func(r *colly.Request) {
        fmt.Println(r.URL.String())
    })
    c.Visit("https://" + website)
}

It assumes the site is running HTTPS and takes the domain name (a FQDN) as the first and only parameter and should be piped into a file, which will have the complete list of all URLs (one URL on every line). Run the script without piping to a file to see the output on STDOUT and validate it seems to work as expected.

Printing a PDF from each URL on the site

Next step is to generate a PDF from a URL. There are a few different options to do this. My main criteria were to find something which could work as part of a batch job as I had hundreds of URLs to visit and “PDF’ify”. Google Chrome supports doing the job – like this (from the shell):

	google-chrome --headless --disable-gpu --print-to-pdf=output.pdf https://google.com/

This line should generate a PDF file called output.pdf of the Google.com front page.

Putting it all together

So with the above to pieces in place, the rest is just about automating the job which a small batch job was put together todo:

#!/bin/bash
go1.8 run crawler.go example.com > example.com.txt
 
for url in $(cat example.com.txt); do
	filename=${url//\//_}
	filename=${filename/\?//_}
	filename=${filename//:/_}
	filename=${filename//\//_}
	google-chrome --headless --disable-gpu --print-to-pdf=$filename.pdf $url
done

This is a rought job. The filenames of the generated PDF files are based on the original URL, but not pretty and could probably be much nicer with a little tinkering, but with a few hours playing around, I had a passable copy of the hundres of pages on the website as individual PDFs.

Linux – No space left on device, yet plenty of free space

My little server ran into an issue, and started reporting the error:

No space left on device

No worries, lest figure out which disk has full and clean up…

Using the df command with the -h (for human-readable output) it should be easy to find the issue:

root@server:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 483M 0 483M 0% /dev
tmpfs 100M 3.1M 97M 4% /run
/dev/vda 20G 9.3G 9.4G 50% /
tmpfs 500M 0 500M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 500M 0 500M 0% /sys/fs/cgroup
cgmfs 100K 0 100K 0% /run/cgmanager/fs
tmpfs 100M 0 100M 0% /run/user/1000

Strange. Notice who the /dev/vda is 50% fillled and all other disk devices seems to be finde too. Well after a little digging, thinking and googling, it turns out device space consists of two things – space (for data) on the device and iNodes (the stuff used to mange the space – where the data go – simplified).

So next move was to look at the inodes:

root@server:~# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 123562 375 123187 1% /dev
tmpfs 127991 460 127531 1% /run
/dev/vda 1310720 1308258 2462 100% /
tmpfs 127991 1 127990 1% /dev/shm
tmpfs 127991 3 127988 1% /run/lock
tmpfs 127991 18 127973 1% /sys/fs/cgroup
cgmfs 127991 14 127977 1% /run/cgmanager/fs
tmpfs 127991 4 127987 1% /run/user/1000

Bingo – no iNodes left on /dev/vda – “too many files in the file system” is the cause, and that’s why it can’t save any more data.

The cleanup

Now, I did not expect the server in this case to have a huge number of files, so something must be off.

Finding where the many files to a little digging too. Starting with this command:

du --inodes -d 1 / | sort -n
</div>

It lists who many iNoes are consumed by each directory in the root.

The highest number was in /var, and next step was doing:

<div id="foo">
du --inodes -d 1 /var | sort -n

Until I found the folder where an extreme number of files was consumed and solved the issue(*).

*) Turned out to be PHP sessions files, which ate the space, which there is an easy solution for.

Ephemeral Feature Toggles

At its simplest  feature toggle is a small if-statement, which may be toggled without deploying code, and used to enable features (new or changed functionality) in your software.

In an agile setup most developers love  having feature toggles available, as they often allow for a continuous delivery setup  with very little friction and obstacles. While this is true, it often seems developers forget to think of feature toggles as ephemeral, and doesn’t realize what a terrible mess, this can cause – given they don’t remove the toggles once the feature is launched and part of the product.

While feature toggles often is an extremely lean way to  introduce public features in software – short circuiting normal release ceremony, as it has already happened before the launch of the feature – which often literally is a button press in a admin/back end interface.

Feature toggles must be ephemeral when used to support Continuous Delivery.

Introducing feature toggles in source code is often just allowing an if-statement and toggling between two states. While it ought to be a single place a toggle occur in the source code, it may often be in views, models, libraries and/or templates – and thus leads to (potentially) a lot of clutter in the source code.

If feature toggles are allowed to live on, the next issue is often that feature toggles may become nested or unintentional crossed as various functions, components in the code, may depend on other features.

Feature toggles is a wonderful thing, but all toggles should not have a lifespan beyond the two sprints forward – allowing for the wins of the feature toggles, yet avoiding long term clutter and complications in the source code.

Other uses for feature toggles…

Feature toggles may exist as hooks for operations or to define various class-of-service for your users.

When used for operations, they may be used as hooks to disable especially heavy functionality during times of heavy load – black Friday, Christmas or whenever it may be applicable.

When used for Class-of-Service feature toggles can be used to determine which feature a basic and a premium user – or any other classes you may choose to have – have access to.

If you’re applying feature toggles as Operations hooks or class-of-service, don’t make them ephemeral, but have your developers mark them clearly in source code.

How not to become the maintenance developer

As a developer it seems, you always seem to strive towards producing ever more complicated code. Utilizing new frameworks, adopting an ever evolving “convention before configuration”, pushing object-oriented programming – maybe Domain Driven Development – are practices introduced, refined and explored in the quest to prove yourself as a steadily better developer with rising skills.

Workstation

Yet to what point?

While the intricate complications may impress fellow developers, doing so often digs a hole which may be pretty hard to get out of. Every complication – not matter if it is in the design, the architecture or the structure of the code – often provides the opposite effect of the desired outcome. The whims of your current self to impress developers with the latest fashionable technique, is short termed reward for a long term pain.

I accept some fashions and whims do change and solidify to become better practice, often the urge to over use the latest and greatest frameworks, techniques and paradigms, often only leads to a painful maintenance in years to come – and as you’ve complicated things beyond comprehension you’ll probably be the trapped in the maintenance for the lifetime of the developed code.

Keep It Simple

As new development challenges often are much more fun, than maintaining legacy code, there are a few basic things you can do to keep yourself from being the eventual eternal maintenance developer and they are straightforward:
Keep it simple
When solving a problem, make as simple and as little code as needed. Don’t wrap everything in object hierarchies, split configuration, views, models and controllers in individual files – do it only, when it provide clear and apparent value.

Names matter

Choose short, but descriptive names for variables, functions, classes and objects in your code. An object containing content for an invoice should be named $invoice, not just $x, $y or $z. Using naming of artifacts in the code provide clues to content and functionality makes it much easier for anyone – including your future self – to comprehend and understand the code when doing maintenance.

Don’t fear comments

Comments does not slow down your code. Al code is compiled at the latest at run-time and a few well placed comments may often be very valuable for the eventual maintainer. Don’t comment obvious code constructs (“don’t state the obvious”) , but do reference business rules or why something tricky is going on.

Be consistent

Find a consistent way to solve the same task/pattern every time, as it will help you focus on what the code is doing instead of the syntax expressed. If you during development find a better way to do something, remember to go back and fix the instances where applicable.

Move on…

Every developer will eventually be stuck with some development, but making your code accessible and easy to understand – the odds of you being stuck forever on maintenance duty is much lower. Great code move on from the parent and have a life on it’s own in the care of other developers if you’ve done your job right.

Bulk conversion of webp files to png format

Google has come up with a nice new image format called webp. Currently support for this format is fairly limited, so if you need to use webp images else where it might be nice to convert them to a more widely supported format. To do the conversion, Google has made a small tool available called dwebp. The tool however does only seem to support conversion of a single image, not a batch of images.

Using regular command line magic it’s easy though. Download the tool a pair it with the find and xargs command and you should quickly be on you way. If all the webp files needing conversion to png is in a single directory, simply do this:

find . -name "*.webp" | xargs -I {} dwebp {} -o {}.png

It findes all webp files, and converts them one by one. If the initial files name was image.webp the resulting file will be called image.webp.png (as the command above doesn’t remove the .webp but only appends .png at the end.

The command assumes the dwebp program is available in you include path. If this isn’t the case, you need to specify the complete path to the program.

Server setup: A user account

So, I’ve been moving the site to a VPS – a Virtual Private Server. A VPS is basically the same as a physical server to which you can’t have physical access. When you get your virtual server, most likely it will be setup with a basic disk image with an Operating System and a root account. In my case at DigitalOcean I choose to setup an Ubuntu Linux image and here are the first moves you should take after creating the VPS to get the basic security in place.

Setting up a user account

At DigitalOcean the server images is deployed and once it’s ready you get a mail with the root password. Letting root login over the internet is pretty bad practice, so the first step you should do is login (over SSH) and setup a new user. Creating the new user is done with the adduser command and follow the instructions, then start visudo to grant your new user some special powers:

adduser newuser
visudo

In the visudo file you want to add copy of an existing line. Find this line:

root    ALL=(ALL:ALL) ALL

… and make a copy of the line. Change the “root” to your newly created login name to grant you new user the right to become root.
Save and exit the file. Check out can be come root from you new account (first switch to the new user with the command “su – newuser” (change newuser to you new username), then try to switch back to root by writing “sudo su -” and enter the password to your new user account (not the root password, and surely you didn’t use the same right?). If this success enter “exit” twice to get back to the initial root shell. The new account is setup and has the rights to become root.

Setting up SSH

Next step is preventing root from login in from remote locations (we only want the newly created account from above to be able to login remotely and then change to root if needed).

Setup the .ssh directory

Assuming you have an existing SSH key set start up creating a “.ssh” directory in you new users directory.
Add your public key to the directory (it’s probably called “id_rsa.pub”) and name it “authorized_keys”.

Make sure…

  • the .ssh directory and the file in it is owned by your newuser-account (not root).
  • the directory is set to 0700 and the file to 0600 (using the chmod command).

You should now be able to login to the “newuser” account remotely using SSH.

Reconfiguring the SSH daemon

Asuming your new account is setup and able to login from remote with SSH the next step should be reconfiguring the SSH daemon to a more secyre setup, open the sshd-configuration file with this command (as root):

vi /etc/ssh/sshd_config

The changes you should make are these two:

PasswordAuthentication no
PermitRootLogin no

The first requires we only allow logins using public-key authentication – no password-only logins. The second denies root to login from remote. If we need root access, we must login with the regular account and then change to root.

Once the changes are med, make sure they take effect by reloading the SSH daemon with this command (as root):

reload ssh

Once this is completed, please move on and setup a firewall.

The emergency hatch

Should you get into trouble and not be able to get back in to your server using SSH, DigitalOcean offers an emergency hatch. If you log into the backend (where you created the VPS) there’s an option to get “console” access to your server. Using this console is as close as you can get to actually sitting with a console next to the machine, and could be the access you needed to fix any misconfiguration or problem preventing you getting in through regular SSH.

Moving the site

This site (and my other site in Danish) have been hosted on a cheap shared hosting site a few years. As shared hosting platforms go, the service and features at GigaHost was quite reasonable, but their servers seemed continuously overloaded and the site had a few issues from time to time. I’ve been moving everything from the shared hosting platform to the smallest available VPS server at DigitalOcean.

Why the move?

  • Performance on shared hosting platforms never seems to amaze.
  • Limited set of features – no shell access, dummy selfcare interface, reasonable features – but limited.
  • Was dirt cheap when I moved in, but not as much – the VPS is actually priced lower.

How did I move the site?

The various parts of the move will probably be described in details in further posts on the site in the foreseeable future, but basically the steps included:

  • setting up an account on Digital Ocean and creating a droplet.
  • setting up a user acount, getting a firewall up and running, securing a few items.
  • installing a webserver and mysql.
  • moving the data from the shared hosting platform (databases and code) to the new webserver.
  • testing everything works by hacking the local hosts-file.
  • redirecting DNS to point to the new site.
  • deleting all stuff from the shared hosting platform once everything has been verified to work as expected.

What comes next…

Running my own server opens a lot of interesting new possibilities. I’m no longer running Apache (which was mandatory previously). Now I’m running nginx which seems much more light-weight.  I’m also running NewRelic which seems to provide amazing insights into how the server resources are utilized.

My first experiments on this server, has been focused on getting the old stuff up and running. You might notice, that the site is running somewhat faster (and I’m still tweaking things).

I expect to be able to use this server to experiment with node.js, ruby and other interesting stuff… and the Comunity help pages at Digital Ocean seems quite amazing.

 

Caution: Here be dragons!

Running your on server (virtual or real) is slightly more complicated than being just another guest on a shared hosting platform. While I do feel reasonable fit on a Linux platform (and run it as my daily desktop), I’ve been blessed with a hints and help from a friend throughout the process which made the move considerably faster (and the settings far more secure from the outset.

I’m sure I’ll run into some trouble along the way – I even managed to -amost – shut myself out of my virtual server once, as I only allowed for SSH access,  but seemed to have deleted all public keys needed on the server to allow my self to get back in.

Fetching the most recent entry from a log-table

Sometimes there’s a need to keep a simple log in a database. A common format could be a table with a layout like this:

log
- area (CHAR)
- lognotice (CHAR OR text)
- logtime (TIMESTAMP WHEN the event was logged).

Fetching

Fetching all log entries from a certain area is a simple matter of fetching by the area field, but when building a dashboard with the most recent entry from each area is slightly more complicated – the Query to fetch the data could typically look like this:

SELECT * FROM log log1
WHERE logtime = (
	SELECT MAX(logtime) 
	FROM log log2 
	WHERE log2.area = log.area)

Cleaning up

To keep things clean and tidy, I only sorte data from the past month, week or day (depending on the “log intensity”). To achieve this I usually do something like this:

DELETE FROM log WHERE logtime < ###time###

Linux Mint: OpenSSH Daemon

I’m in the process of reinstalling my work desktop. One of the mandatory packages which I install once the core system is up and running is a SSH Daemon.
Setting it up (on Linux Mint which I’m running) is pretty easy. To install the OpenSSH daemon go to the shell and write:

sudo apt-get install openssh-server

It’s a fairly small install, so in a few seconds it ought to be up and running. Next step is editing the default config file and change a few things.
Editing the config file is done by entering:

sudo vi /etc/ssh/sshd_config

The cofiguration options I usually edit these parameters:

PermitRootLogin no
#Banner /etc/issue.net
 
AllowUsers <username>
  • PermitRootLogin – The default option is yes, but frankly root should never be allowed to login remote unless absolutely needed.
  • Banner – Can allow a custom message be displayed at login (if needed).
  • AllowUsers – A space separated list of users allowed to login remotely.

Once the edits are done and saved, the openSSH Daemon needs to restarted which is done by:

sudo service ssh restart

Trying and failing (twice)

PHP like many other programming languages has facilities to handle exceptions. Using them is pretty easy, but sometimes lazy programmers seems to misuse them to suppress error messages. A try/catch in PHP is usually constructed something like this:

try {
	// Something can go (horribly) wrong...
} catch {
}

The lazy programmer may leave the catch empty, but frankly you should never do it. When you’re doing something – try’ing – it’s for a reason, and if it fails, someone quite possible need to know – the end user, a log file for the sysadm or someone else. Never leave your catch empty, and if you really have a case, where it’s applicable, at least leave a comment in catch block explaining why it’s okay to do nothing.