Database optimization thoughts

If you’re working on heavy duty websites, knowing your database and how to use it best can make a world of difference in terms of performance, and thus you should always optimize the database. That’s pretty much obvious.

The tricky part is how you do the optimization? Often it requires a lot of reading up on how the database works – strengths, weaknesses and other details — and loads of experience. Having a DBA available to help you optimize would be ideal in some cases, but often you need to do it yourself.

So, is there a “free lunch” recipe with guidelines to help you do the correct optimization? Well, no. All database optimizations are usually case-specific, and the optimizations which worked last time may not be applicable in the current case.

There are however some generic rules, which may help you go in the right direction.

  • Database optimization should start early. Think about performance when designing you database scheme – table layout and column types.
  • Consider the transaction types during the data life cycles. Are you primarily doing reads or writes? How many columns is expected?
  • Learn to use indexes – wisely. Too many indexes is just as bad as no indexes.
  • Try to benchmark various table and column layouts and see how they perform – sometimes you might be surprised and other times just confirm theories.

Substring magic with mysql

Mysql is a wonderful database, and while many use it, most people only scratch the surface of what the database can do. One of the practical functions available is the substring_index function, and an imaginary mailing list example is a nice way to show how to use it.

Let imagine we have a mailinglist in a table named “mailinglist” and it has a (char) column with the email addresses subscribed to the list. We now want to figure out how many users, that are subscribed from the various domains in the list.

Finding the domain name from an email address is quite simple – just find the @ sign – anything past that, will be the domain name and substring_index will do just that. To create our list of domains with the number of subscribers, we simply issue this query:

SELECT SUBSTRING_INDEX(email, ‘@’, -1) AS domain, count(*) as subscribed
FROM mailinglist
GROUP BY domain
ORDER BY subscribed;

Some email providers may use 3rd level domains (,, etc). What it we want to summarize the subscribers on the second level ( No worries – substring_index will help us with that too. The query to do that looks like this:

SELECT  SUBSTRING_INDEX(SUBSTRING_INDEX(email, ‘.’, -2),’@’,-1) AS domain,
count(SUBSTRING_INDEX(SUBSTRING_INDEX(email, ‘.’, -2),’@’,-1)) AS subscribed
FROM  mailinglist
GROUP BY domain
ORDER BY subscribed;

While most developers may master simple queries in SQL, most databases have a library of functions – like substring_index – available and must too often they are ignored and hardly used at all.

If you want to be a better developer, learn to use the entire toolbox available – not just what you know already in Perl, in PHP or what ever you use to do your programming.

Mysql metadata

If you’re a developer and use mysql, I’m sure you’re aware that it’s a database and it quite good at storing data, but one of the neat things about Mysql (and most other databases) is also their ability to provide meta-data on the contents of the database.

Most people know how to use the meta-data queries in the commandline, but if you want you can also use them in your (php/perl/some-other- ) language. Here is a quick guide to some of them.

show databases

The show databases provide a list of all databases available in the datbase-server you’re accessing. It doesn’t tell you which of the databases, you’re allowed to access.

Once a database is selected, you can see a list of tables with the command:

show tables

And with either the ”desc tablename” or with the command

Show columns from tablename

(replace ”tablename” with an actual tablename from the database).

You can exclore which columns and column definition is available.

It’s probably rarely you need to use these functions unless you’re writing a phpmysqladmin replacement – often a script makes assumptions on which tables and columns exist.

If you’re developing an upgrade to an existing application/webbsite/script and the update requires database changes, you can use these functions to check if the database layout version is the one matching you application version needs. By doing this, you can provide much better feedback to the user on what’s wrong with the script, instead of just breaking horribly with database errors.

Should you use sql specific statements?

It seems there are two camps when it comes to SQL and how to do database optimizations – the “generic camp” and “the specialist camp”. While I don’t consider myself an extremist, I am absolutely in the specialist camp and this little post is an explanation of why.

SQL is a generic database langauge . There are a few different standards in use (the language has progressed over time), but the core of the SQL language is pretty much the standard in most databases. It’s probably also standard – in any database – that the SQL standard has been extended with database-specific extensions which provides optimizations, functions or other options not available in the SQL standard.

Using these database-specific extensions while developing your application ties your application/website to the specific database, and if you need to switch database at some point, your need to rewrite your applications SQL statements, so they aren’t tied to that specific database.

While this may be true I haven’t once during my ten years of web development, once had to switch database either during development nor during operations. I’m sure it happens in some cases, but I’m also sure that those cases are pretty rare, and if you need to go over your application and change the SQL statements, spending time on that is probably one of the easiest parts of a “technology switch” (ie. switching from Mysql to an Oracle cluster).

In most cases, using and utilizing database specific extensions can provide you with some easy optimizations and boost the performance significantly. While you probably can avoid using them, you’ll probably need to move the functions into you application or make more complex database queries. Optimization is usually an evolution, not a revolution. If your performance isn’t as expected, the first step is usually where are the bottlenecks, where can we optimize the current state of things – not switching database, not switching programming language.

Before you become a SQL purist, do make a calculated guess on what the “database switch probability” is. In most cases it’ll probably be less than 1%, and if this is the case, all common sense should tell you to use the tool available to the best of your ability, right?

Mysql: Random dice

Getting a random roll of the dice:

          CREATE TABLE dice (
            d_id int(11) NOT NULL auto_increment,
            roll int,
            PRIMARY KEY  (d_id)

          insert into dice (roll) values (1);
          insert into dice (roll) values (2);
          insert into dice (roll) values (3);
          insert into dice (roll) values (4);
          insert into dice (roll) values (5);
          insert into dice (roll) values (6);

          select roll from dice order by rand() limit 1;

Mysql: Delete orphan records

Finding records that do not match between two tables.

          CREATE TABLE bookreport (
            b_id int(11) NOT NULL auto_increment,
            s_id int(11) NOT NULL,
            report varchar(50),
            PRIMARY KEY  (b_id)


          CREATE TABLE student (
            s_id int(11) NOT NULL auto_increment,
            name varchar(15),
            PRIMARY KEY  (s_id)

          insert into student (name) values ('bob');
          insert into bookreport (s_id,report)
            values ( last_insert_id(),'A Death in the Family');

          insert into student (name) values ('sue');
          insert into bookreport (s_id,report)
            values ( last_insert_id(),'Go Tell It On the Mountain');

          insert into student (name) values ('doug');
          insert into bookreport (s_id,report)
            values ( last_insert_id(),'The Red Badge of Courage');

          insert into student (name) values ('tom');

     To find the sudents where are missing reports:

          select from student s
            left outer join bookreport b on s.s_id = b.s_id
          where b.s_id is null;

              | name |
              | tom  |
              1 row in set (0.00 sec)

     Ok, next suppose there is an orphan record in
     in bookreport. First delete a matching record
     in student:

       delete from student where s_id in (select max(s_id) from bookreport);

     Now, how to find which one is orphaned:

       select * from bookreport b left outer join
       student s on b.s_id=s.s_id where s.s_id is null;

     | b_id | s_id | report                   | s_id | name |
     |    4 |    4 | The Red Badge of Courage | NULL | NULL |
     1 row in set (0.00 sec)

     To clean things up (Note in 4.1 you can't do subquery on
     same table in a delete so it has to be done in 2 steps):

       select @t_sid:=b.s_id from bookreport b left outer join
         student s on b.s_id=s.s_id where s.s_id is null;

       delete from student where s_id=@t_sid;

     But, functions do work in delete.  For instance the
     following is possible:

        delete from student where s_id=max(s_id);

     It just a problem when joining the table where the
     delete will occur with another table. Another
     option is two create a second temp table and
     locking the first one.

Mysql: Loading data from file

Loading Data into Tables from Text Files.

Assume you have the following table.

            CREATE TABLE loadtest (
                pkey int(11) NOT NULL auto_increment,
                name varchar(20),
                exam int,
                score int,
                timeEnter timestamp(14),
                PRIMARY KEY  (pkey)

And you have the following formatted text file as shown below with the unix "tail" command:

          $ tail /tmp/out.txt

    NOTE: loadtest contains the "pkey" and "timeEnter" fields which are not
    present in the "/tmp/out.txt" file. Therefore, to successfully load
    the specific fields issue the following:

         mysql> load data infile '/tmp/out.txt' into table loadtest
                  fields terminated by ',' (name,exam,score);