Category Archives: Data

On data mangement – in databases, through APIs and other data management issues.

Moving the site

This site (and my other site in Danish) have been hosted on a cheap shared hosting site a few years. As shared hosting platforms go, the service and features at GigaHost was quite reasonable, but their servers seemed continuously overloaded and the site had a few issues from time to time. I’ve been moving everything from the shared hosting platform to the smallest available VPS server at DigitalOcean.

Why the move?

  • Performance on shared hosting platforms never seems to amaze.
  • Limited set of features – no shell access, dummy selfcare interface, reasonable features – but limited.
  • Was dirt cheap when I moved in, but not as much – the VPS is actually priced lower.

How did I move the site?

The various parts of the move will probably be described in details in further posts on the site in the foreseeable future, but basically the steps included:

  • setting up an account on Digital Ocean and creating a droplet.
  • setting up a user acount, getting a firewall up and running, securing a few items.
  • installing a webserver and mysql.
  • moving the data from the shared hosting platform (databases and code) to the new webserver.
  • testing everything works by hacking the local hosts-file.
  • redirecting DNS to point to the new site.
  • deleting all stuff from the shared hosting platform once everything has been verified to work as expected.

What comes next…

Running my own server opens a lot of interesting new possibilities. I’m no longer running Apache (which was mandatory previously). Now I’m running nginx which seems much more light-weight.  I’m also running NewRelic which seems to provide amazing insights into how the server resources are utilized.

My first experiments on this server, has been focused on getting the old stuff up and running. You might notice, that the site is running somewhat faster (and I’m still tweaking things).

I expect to be able to use this server to experiment with node.js, ruby and other interesting stuff… and the Comunity help pages at Digital Ocean seems quite amazing.

 

Caution: Here be dragons!

Running your on server (virtual or real) is slightly more complicated than being just another guest on a shared hosting platform. While I do feel reasonable fit on a Linux platform (and run it as my daily desktop), I’ve been blessed with a hints and help from a friend throughout the process which made the move considerably faster (and the settings far more secure from the outset.

I’m sure I’ll run into some trouble along the way – I even managed to -amost – shut myself out of my virtual server once, as I only allowed for SSH access,  but seemed to have deleted all public keys needed on the server to allow my self to get back in.

Fetching the most recent entry from a log-table

Sometimes there’s a need to keep a simple log in a database. A common format could be a table with a layout like this:

log
- area (CHAR)
- lognotice (CHAR OR text)
- logtime (TIMESTAMP WHEN the event was logged).

Fetching

Fetching all log entries from a certain area is a simple matter of fetching by the area field, but when building a dashboard with the most recent entry from each area is slightly more complicated – the Query to fetch the data could typically look like this:

SELECT * FROM log log1
WHERE logtime = (
	SELECT MAX(logtime) 
	FROM log log2 
	WHERE log2.area = log.area)

Cleaning up

To keep things clean and tidy, I only sorte data from the past month, week or day (depending on the “log intensity”). To achieve this I usually do something like this:

DELETE FROM log WHERE logtime < ###time###

Mysql: display row count for all tables in a database

When playing the role of the DBA, it’s often useful to get a quick listing of how many rows each table in a database contains. The syntax for this is pretty simple in Mysql:

SELECT TABLE_NAME, table_rows 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_SCHEMA = '***database name***';

Replace ***database name*** with the actual database name in the SQL above.

Notice that when using innodb tables, it’s only a rough estimate.

Database optimization thoughts

If you’re working on heavy duty websites, knowing your database and how to use it best can make a world of difference in terms of performance, and thus you should always optimize the database. That’s pretty much obvious.

The tricky part is how you do the optimization? Often it requires a lot of reading up on how the database works – strengths, weaknesses and other details — and loads of experience. Having a DBA available to help you optimize would be ideal in some cases, but often you need to do it yourself.

So, is there a “free lunch” recipe with guidelines to help you do the correct optimization? Well, no. All database optimizations are usually case-specific, and the optimizations which worked last time may not be applicable in the current case.

There are however some generic rules, which may help you go in the right direction.

  • Database optimization should start early. Think about performance when designing you database scheme – table layout and column types.
  • Consider the transaction types during the data life cycles. Are you primarily doing reads or writes? How many columns is expected?
  • Learn to use indexes – wisely. Too many indexes is just as bad as no indexes.
  • Try to benchmark various table and column layouts and see how they perform – sometimes you might be surprised and other times just confirm theories.

Substring magic with mysql

Mysql is a wonderful database, and while many use it, most people only scratch the surface of what the database can do. One of the practical functions available is the substring_index function, and an imaginary mailing list example is a nice way to show how to use it.

Let imagine we have a mailinglist in a table named “mailinglist” and it has a (char) column with the email addresses subscribed to the list. We now want to figure out how many users, that are subscribed from the various domains in the list.

Finding the domain name from an email address is quite simple – just find the @ sign – anything past that, will be the domain name and substring_index will do just that. To create our list of domains with the number of subscribers, we simply issue this query:

SELECT SUBSTRING_INDEX(email, ‘@’, -1) AS domain, count(*) as subscribed
FROM mailinglist
GROUP BY domain
ORDER BY subscribed;

Some email providers may use 3rd level domains (sales.example.com, tech.example.com, etc). What it we want to summarize the subscribers on the second level (example.com). No worries – substring_index will help us with that too. The query to do that looks like this:

SELECT  SUBSTRING_INDEX(SUBSTRING_INDEX(email, ‘.’, -2),’@’,-1) AS domain,
count(SUBSTRING_INDEX(SUBSTRING_INDEX(email, ‘.’, -2),’@’,-1)) AS subscribed
FROM  mailinglist
GROUP BY domain
ORDER BY subscribed;

While most developers may master simple queries in SQL, most databases have a library of functions – like substring_index – available and must too often they are ignored and hardly used at all.

If you want to be a better developer, learn to use the entire toolbox available – not just what you know already in Perl, in PHP or what ever you use to do your programming.

Mysql metadata

If you’re a developer and use mysql, I’m sure you’re aware that it’s a database and it quite good at storing data, but one of the neat things about Mysql (and most other databases) is also their ability to provide meta-data on the contents of the database.

Most people know how to use the meta-data queries in the commandline, but if you want you can also use them in your (php/perl/some-other- ) language. Here is a quick guide to some of them.

show databases

The show databases provide a list of all databases available in the datbase-server you’re accessing. It doesn’t tell you which of the databases, you’re allowed to access.

Once a database is selected, you can see a list of tables with the command:

show tables

And with either the ”desc tablename” or with the command

Show columns from tablename

(replace ”tablename” with an actual tablename from the database).

You can exclore which columns and column definition is available.

It’s probably rarely you need to use these functions unless you’re writing a phpmysqladmin replacement – often a script makes assumptions on which tables and columns exist.

If you’re developing an upgrade to an existing application/webbsite/script and the update requires database changes, you can use these functions to check if the database layout version is the one matching you application version needs. By doing this, you can provide much better feedback to the user on what’s wrong with the script, instead of just breaking horribly with database errors.

Should you use sql specific statements?

It seems there are two camps when it comes to SQL and how to do database optimizations – the “generic camp” and “the specialist camp”. While I don’t consider myself an extremist, I am absolutely in the specialist camp and this little post is an explanation of why.

SQL is a generic database langauge . There are a few different standards in use (the language has progressed over time), but the core of the SQL language is pretty much the standard in most databases. It’s probably also standard – in any database – that the SQL standard has been extended with database-specific extensions which provides optimizations, functions or other options not available in the SQL standard.

Using these database-specific extensions while developing your application ties your application/website to the specific database, and if you need to switch database at some point, your need to rewrite your applications SQL statements, so they aren’t tied to that specific database.

While this may be true I haven’t once during my ten years of web development, once had to switch database either during development nor during operations. I’m sure it happens in some cases, but I’m also sure that those cases are pretty rare, and if you need to go over your application and change the SQL statements, spending time on that is probably one of the easiest parts of a “technology switch” (ie. switching from Mysql to an Oracle cluster).

In most cases, using and utilizing database specific extensions can provide you with some easy optimizations and boost the performance significantly. While you probably can avoid using them, you’ll probably need to move the functions into you application or make more complex database queries. Optimization is usually an evolution, not a revolution. If your performance isn’t as expected, the first step is usually where are the bottlenecks, where can we optimize the current state of things – not switching database, not switching programming language.

Before you become a SQL purist, do make a calculated guess on what the “database switch probability” is. In most cases it’ll probably be less than 1%, and if this is the case, all common sense should tell you to use the tool available to the best of your ability, right?

Basics on tables in Mysql

Navigating around in mysql is quite easy. You can see which databases by using the command “show databases“. There will always exist a database called “mysql” which is used by mysql itself.

If your user account has access to a database, you can access it by using the “use DATABASENAME” command. You can se which tables exist within a database by using the “show tables” command – and see details for each table by issuing the “desc TABLENAME” command.

changing Tables

All alterations of database tables is done with the alter table command. The main challenge is use the command beyond just making it work. Here’s a few clues to get the most of it.

Let’s make a simple table as an example:

CREATE TABLE demotable (
id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY);

Adding a column

ALTER TABLE demotable ADD COLUMN label CHAR(20);

More…

Find more details in the Mysql manual.

Mysql: Random dice

Getting a random roll of the dice:

          CREATE TABLE dice (
            d_id int(11) NOT NULL auto_increment,
            roll int,
            PRIMARY KEY  (d_id)
          );

          insert into dice (roll) values (1);
          insert into dice (roll) values (2);
          insert into dice (roll) values (3);
          insert into dice (roll) values (4);
          insert into dice (roll) values (5);
          insert into dice (roll) values (6);

          select roll from dice order by rand() limit 1;