Frames Modems Help Home Page Chipsets Search No Frames
Diary Entries See also Site Info & Diary.
03 November 2003 Optimising SQL statements - a practical walkthrough...
 

My last 4 months have been spent re-writing the site at modem-help.com - closeted away in my attic, like some demented monk, chained to the desk and only released for short periods for necessary ablutions, or a crust of bread, or a few hours sleep. The first stage is almost finished and the results will be seen shortly on the site. There have been 3 criteria:

  1. Prepare the pages for new features (activation of these in the next stage)
  2. Improve the page display, with easily-adjusted colour-themed pages
  3. Bench-mark each page

Implicit in the last one is to radically shorten the preparation time of each page at the server. This latter has become a crucial issue for me since the sites moved to a co-located server in July 2003.

 

The sites were formerly hosted by UK Linux on a twin-Pentium III server. My sites shared the server with many other UKLinux paying customers, and after a time it became clear that the folks at UKLinux considered my sites to be crippling their server with poorly-written php/mysql scripts. Naturally, my attitude was that this was bunkum. After the experience of the last 4 months, I’m not so sure.

Olivia is the name of the server that has delivered this page up to you. She is quite some beast - dual matched-intel Xeon 2.4 GHz cpus on a Gigabyte SR222E-R MB, 1 Gig memory, Adaptec zero channel RAID with 3 x 73.4 GB 10,000 rpm SCSI SCA HDDs running RedHat Linux v8.0, Apache/2.0.40, MySQL 3.23.56-log & PHP. Ridiculously over-specified for a 25,000 hits/day server. And yet, and yet...

Olivia is running phpMyAdmin 2.5.1, and this version offers many features vital for bench-marking sql statements. The hosted-server is running 2.2.5 (from memory) and benchmarks are not available with this version - if you do not already have the later version, upgrade quick. Running my own server also means being able to access as the root MySQL user, and this gives access to runtime info & system variables that were unavailable in the hosted state. All of this became vital on the 16 Oct just gone.

So, what’s offered below are live examples of benchmarking one of the final sql statements from the chipsets.php script. The example page is chipsets.php?mid=122&ncd=4485, the list of drivers for all modems with this chipset. The sql tables are listed at 14 Jan. The original php-script is:


"SELECT m.MID,m.Name,n.NID,n.Name,n.dead,k.LTD,lt.Name,k.OTD,ot.Name,DATE_FORMAT(k.Updated,'%d %b %Y'),Version
"._QF_MODEM."
"._QJ_MFCS."
"._QJ_LINKL."
"._QJ_LINKT."
LEFT JOIN ostypes ot ON k.OTD=ot.OTD
WHERE ".(($nbd)?"b.NID=$nbd":"c.NID=$ncd")." && k.LTD="._QTL_DRIVR."
GROUP BY m.Name,n.Name,ot.Name
ORDER BY k.LTD DESC,ot.Name,k.Version DESC,m.Name,n.Name"
		

_QF_MODEM and the other constants are pre-defined slivers of sql script:


define("_QF_MODEM","
FROM names n
"._QJ_MODEM ."
LEFT JOIN names nm ON dn.NMD=nm.NID
LEFT JOIN names no ON dn.NCD<=>no.NID
LEFT JOIN names a ON dn.CAD<=>a.NID
LEFT JOIN names b ON dn.CBD<=>b.NID
LEFT JOIN names c ON dn.CCD<=>c.NID
");
		

...which makes it easier to build the sql statements. Use of these constants does introduce some redundancy into the statements, but this affects neither the results nor the speed - MySQL just ignores the redundant portions - whilst the use of a global define allows every sql statement that uses it to be changed in one go if improvements are found, and also reduces the size of the php pages.

For the page in question, this script turns out like this (the examples below are drawn from the phpMyAdmin pages):

 

Showing rows 0 - 24 (25 total, Query took 0.0787 sec)
SQL-query : [Edit] [Explain SQL] [Create PHP Code]
SELECT m.MID, m.Name, n.NID, n.Name, n.dead, k.LTD, lt.Name, k.OTD, ot.Name, DATE_FORMAT( k.Updated, '%d %b %Y' ) , Version
FROM names n
LEFT JOIN modems dn ON n.NID = dn.NID
LEFT JOIN names nm ON dn.NMD = nm.NID
LEFT JOIN names no ON dn.NCD <=> no.NID
LEFT JOIN names a ON dn.CAD <=> a.NID
LEFT JOIN names b ON dn.CBD <=> b.NID
LEFT JOIN names c ON dn.CCD <=> c.NID
LEFT JOIN mfcs m ON n.MID = m.MID
LEFT JOIN linklu k ON n.NID = k.NID
LEFT JOIN linktypes lt ON k.LTD = lt.LTD
LEFT JOIN ostypes ot ON k.OTD = ot.OTD
WHERE c.NID = 4485 && k.LTD = 2
GROUP BY m.Name, n.Name, ot.Name
ORDER BY k.LTD DESC , ot.Name, k.Version DESC , m.Name, n.Name

Less than 0.1 sec may not sound slow but, apart from the fact that there are scores of queries on each page, clicking the Explain SQL link (obviously not a link here) makes the point:

table type possible_keys key key_len ref rows Extra
n ALL NULL NULL NULL NULL 11019 Using temporary; Using filesort
dn eq_ref PRIMARY PRIMARY 4 n.NID 1  
nm eq_ref PRIMARY PRIMARY 4 dn.NMD 1 Using index
no eq_ref PRIMARY PRIMARY 4 dn.NCD 1 Using index
a eq_ref PRIMARY PRIMARY 4 dn.CAD 1 Using index
b eq_ref PRIMARY PRIMARY 4 dn.CBD 1 Using index
c eq_ref PRIMARY PRIMARY 4 dn.CCD 1 where used; Using index
m eq_ref PRIMARY PRIMARY 4 n.MID 1  
k ref NID NID 4 n.NID 5 where used
lt eq_ref PRIMARY PRIMARY 1 k.LTD 1  
ot eq_ref PRIMARY PRIMARY 1 k.OTD 1  

The first row (names n) is the problem: no keys, all rows in the table have to be examined, and a temporary table with filesort is being used - bad news all round. Clearly, this is a downside in the original decision in the php-script to use DEFINEs. These were all based on ‘names n’ as the first table (to maintain consistency). This was a good choice for most sql statements, but a poor choice for this one. Total rows examined is the product of all the ‘rows’ column: 11019 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 5 x 1 x 1 = 55,095 rows.

‘modems’ is the obvious target for the first table, and a very small change to the sql statement makes a dramatic change to the script duration:

 

Showing rows 0 - 24 (25 total, Query took 0.0318 sec)
SQL-query : [Edit] [Explain SQL] [Create PHP Code]
SELECT m.MID, m.Name, n.NID, n.Name, n.dead, k.LTD, lt.Name, k.OTD, ot.Name, DATE_FORMAT( k.Updated, '%d %b %Y' ) , Version
FROM modems dn
LEFT JOIN names n ON dn.NID = n.NID
LEFT JOIN names nm ON dn.NMD = nm.NID
LEFT JOIN names no ON dn.NCD <=> no.NID
LEFT JOIN names a ON dn.CAD <=> a.NID
LEFT JOIN names b ON dn.CBD <=> b.NID
LEFT JOIN names c ON dn.CCD <=> c.NID
LEFT JOIN mfcs m ON n.MID = m.MID
LEFT JOIN linklu k ON n.NID = k.NID
LEFT JOIN linktypes lt ON k.LTD = lt.LTD
LEFT JOIN ostypes ot ON k.OTD = ot.OTD
WHERE c.NID = 4485 && k.LTD = 2
GROUP BY m.Name, n.Name, ot.Name
ORDER BY k.LTD DESC , ot.Name, k.Version DESC , m.Name, n.Name

More than twice as quick as the previous query. Explain SQL gives:

table type possible_keys key key_len ref rows Extra
dn ALL NULL NULL NULL NULL 2596 Using temporary; Using filesort
n eq_ref PRIMARY PRIMARY 4 n.NID 1  
nm eq_ref PRIMARY PRIMARY 4 dn.NMD 1 Using index
no eq_ref PRIMARY PRIMARY 4 dn.NCD 1 Using index
a eq_ref PRIMARY PRIMARY 4 dn.CAD 1 Using index
b eq_ref PRIMARY PRIMARY 4 dn.CBD 1 Using index
c eq_ref PRIMARY PRIMARY 4 dn.CCD 1 where used; Using index
m eq_ref PRIMARY PRIMARY 4 n.MID 1  
k ref NID NID 4 n.NID 5 where used
lt eq_ref PRIMARY PRIMARY 1 k.LTD 1  
ot eq_ref PRIMARY PRIMARY 1 k.OTD 1  

The time savings are coming purely because there are less modems than names; still no keys, all rows have to be examined and a temporary table with filesort. The number of rows that need to be examined is now 2596 x 5 = 12,980 (23.5% of the previous figure). The choice of key in the WHERE is the issue, as the original choice was wrong. c.NID (used to find the chipset name) is exactly equivalent to dn.NID, the modem chipset. Making this one change makes another dramatic difference:

 

Showing rows 0 - 24 (25 total, Query took 0.0073 sec)
SQL-query : [Edit] [Explain SQL] [Create PHP Code]
SELECT m.MID, m.Name, n.NID, n.Name, n.dead, k.LTD, lt.Name, k.OTD, ot.Name, DATE_FORMAT( k.Updated, '%d %b %Y' ) , Version
FROM modems dn
LEFT JOIN names n ON dn.NID = n.NID
LEFT JOIN names nm ON dn.NMD = nm.NID
LEFT JOIN names no ON dn.NCD <=> no.NID
LEFT JOIN names a ON dn.CAD <=> a.NID
LEFT JOIN names b ON dn.CBD <=> b.NID
LEFT JOIN names c ON dn.CCD <=> c.NID
LEFT JOIN mfcs m ON n.MID = m.MID
LEFT JOIN linklu k ON n.NID = k.NID
LEFT JOIN linktypes lt ON k.LTD = lt.LTD
LEFT JOIN ostypes ot ON k.OTD = ot.OTD
WHERE dn.CCD = 4485 && k.LTD = 2
GROUP BY m.Name, n.Name, ot.Name
ORDER BY k.LTD DESC , ot.Name, k.Version DESC , m.Name, n.Name

Now, this is more like it. More than ten times as quick as the original query, yet giving exactly the same results. Explain SQL gives:

table type possible_keys key key_len ref rows Extra
dn ALL NULL NULL NULL NULL 2596 where used; Using temporary; Using filesort
n eq_ref PRIMARY PRIMARY 4 n.NID 1  
nm eq_ref PRIMARY PRIMARY 4 dn.NMD 1 Using index
no eq_ref PRIMARY PRIMARY 4 dn.NCD 1 Using index
a eq_ref PRIMARY PRIMARY 4 dn.CAD 1 Using index
b eq_ref PRIMARY PRIMARY 4 dn.CBD 1 Using index
c eq_ref PRIMARY PRIMARY 4 dn.CCD 1 where used; Using index
m eq_ref PRIMARY PRIMARY 4 n.MID 1  
k ref NID NID 4 n.NID 5 where used
lt eq_ref PRIMARY PRIMARY 1 k.LTD 1  
ot eq_ref PRIMARY PRIMARY 1 k.OTD 1  

Notice that the number of rows that need to be examined is still 12,980. Only the addition of where used; in the primary table has made the difference. Looking at the table indexes, CCD forms part of a combo-index NCD (chipset component names), designed originally to save disk space as the sites were on a space-limited hosted arrangement. That is most certainly not an issue now. Adding an index for CCD (plus all other components) allows another time boost:

 

Showing rows 0 - 24 (25 total, Query took 0.0039 sec)
SQL-query : [Edit] [Explain SQL] [Create PHP Code]
SELECT m.MID, m.Name, n.NID, n.Name, n.dead, k.LTD, lt.Name, k.OTD, ot.Name, DATE_FORMAT( k.Updated, '%d %b %Y' ) , Version
FROM modems dn
LEFT JOIN names n ON dn.NID = n.NID
LEFT JOIN names nm ON dn.NMD = nm.NID
LEFT JOIN names no ON dn.NCD <=> no.NID
LEFT JOIN names a ON dn.CAD <=> a.NID
LEFT JOIN names b ON dn.CBD <=> b.NID
LEFT JOIN names c ON dn.CCD <=> c.NID
LEFT JOIN mfcs m ON n.MID = m.MID
LEFT JOIN linklu k ON n.NID = k.NID
LEFT JOIN linktypes lt ON k.LTD = lt.LTD
LEFT JOIN ostypes ot ON k.OTD = ot.OTD
WHERE dn.CCD = 4485 && k.LTD = 2
GROUP BY m.Name, n.Name, ot.Name
ORDER BY k.LTD DESC , ot.Name, k.Version DESC , m.Name, n.Name

Another 50% time reduction. Explain SQL gives:

table type possible_keys key key_len ref rows Extra
dn ref CCD CCD 5 const 19 where used; Using temporary; Using filesort
n eq_ref PRIMARY PRIMARY 4 n.NID 1  
nm eq_ref PRIMARY PRIMARY 4 dn.NMD 1 Using index
no eq_ref PRIMARY PRIMARY 4 dn.NCD 1 Using index
a eq_ref PRIMARY PRIMARY 4 dn.CAD 1 Using index
b eq_ref PRIMARY PRIMARY 4 dn.CBD 1 Using index
c eq_ref PRIMARY PRIMARY 4 dn.CCD 1 where used; Using index
m eq_ref PRIMARY PRIMARY 4 n.MID 1  
k ref NID NID 4 n.NID 5 where used
lt eq_ref PRIMARY PRIMARY 1 k.LTD 1  
ot eq_ref PRIMARY PRIMARY 1 k.OTD 1  

Notice that the number of rows that need to be examined is now 19 x 5 = 95, compared to the original 55,095, which explains the value of the index together with all other improvements.

There is one further refinement that I can see to be made, and this involves k.LTD. This is the Type of the link (in this case it is Drivers). The scripts on this page show up to 4 different types of listings: drivers, firmware, manuals & support. Originally, 4 different connections were made, one for each type. Far more sensible is to make one connection and gather all the links, then use php to split them according to type. Removing the k.LTD conditional gives the following result:

 

Showing rows 0 - 29 (38 total, Query took 0.0049 sec)
SQL-query : [Edit] [Explain SQL] [Create PHP Code]
SELECT m.MID, m.Name, n.NID, n.Name, n.dead, k.LTD, lt.Name, k.OTD, ot.Name, DATE_FORMAT( k.Updated, '%d %b %Y' ) , Version
FROM modems dn
LEFT JOIN names n ON dn.NID = n.NID
LEFT JOIN names nm ON dn.NMD = nm.NID
LEFT JOIN names no ON dn.NCD <=> no.NID
LEFT JOIN names a ON dn.CAD <=> a.NID
LEFT JOIN names b ON dn.CBD <=> b.NID
LEFT JOIN names c ON dn.CCD <=> c.NID
LEFT JOIN mfcs m ON n.MID = m.MID
LEFT JOIN linklu k ON n.NID = k.NID
LEFT JOIN linktypes lt ON k.LTD = lt.LTD
LEFT JOIN ostypes ot ON k.OTD = ot.OTD
WHERE dn.CCD = 4485
GROUP BY m.Name, n.Name, ot.Name, k.LTD
ORDER BY k.LTD DESC , ot.Name, k.Version DESC , m.Name, n.Name

A little longer than the single query, but much less time than 4 seperate queries. Explain SQL gives:

table type possible_keys key key_len ref rows Extra
dn ref CCD CCD 5 const 19 where used; Using temporary; Using filesort
n eq_ref PRIMARY PRIMARY 4 n.NID 1  
nm eq_ref PRIMARY PRIMARY 4 dn.NMD 1 Using index
no eq_ref PRIMARY PRIMARY 4 dn.NCD 1 Using index
a eq_ref PRIMARY PRIMARY 4 dn.CAD 1 Using index
b eq_ref PRIMARY PRIMARY 4 dn.CBD 1 Using index
c eq_ref PRIMARY PRIMARY 4 dn.CCD 1 Using index
m eq_ref PRIMARY PRIMARY 4 n.MID 1  
k ref NID NID 4 n.NID 5  
lt eq_ref PRIMARY PRIMARY 1 k.LTD 1  
ot eq_ref PRIMARY PRIMARY 1 k.OTD 1  

So, what’s the bottom line after all this work?

The non-optimised queries took 4 x 0.0787 = 0.3148 secs.
The final version query took 1 x 0.0049 = 0.0049 secs.

This is more than 64 times quicker than the original statements - astonishing! - and the difference between a speedy page and a laggard one, particularly at times of high load.