Frames Modems Help Home Page Chipsets Search No Frames
Diary Entries See also Site Info & Diary.
16 October 2003 Panic - the .com site becomes unavailable...
 

(This is actually written up on 3 Nov)

At 2 in the morning of 16 October 2003 the .com site became unavailable. The Apache status page showed 133 current requests, which was quite something for 2 am. Every IP-address was different, also, so it wasn`t a search-robot.

Looking at TOP on the server, mysqld was occupying one cpu (there are 4) at 99.9% for up to 20 minutes at a time. 45 minutes later all 4 cpus were close to 100% usage & mysqld had been running for 50 mins (1 secs is long). I began to tear out what little hair I had left across the next 6 hours.

Across the following 24 hours I did discover 3 tables to report “...clients not closed the table properly” using the maintenance scripts and this clearly was impacting the problem, but also may well have been caused by the problem rather than initiating it. Lance--the UKLinux server guru--had set up a slow-query log (unsubtle hint) and this log was ***huge***! Another clue was that whenever MySQL was triggered on the server, TOP would report 99.9% usage for mysqld, even if this was only for a very short time. I eventually hit paydirt at a blog where the server admin had had identical problems with the cpus max-ing out. The issue was 2 default settings of MySQL:

From the Variables page of phpmyadmin as root:

  • wait timeout: 28800
  • thread cache size: 0

The last one was causing the cpus to max-out, and the first was causing sleeping-mysqld processes to choke the system (that is 28800 seconds, BTW). With no thread cache, every single init of mysqld requires a new thread to be spawned, and this is more time-consuming & processor-intensive than re-using a cached thread. Once found, the fix was easy:


		(into `my.cnf`)
		# added by Alex 17 Oct 2003
		set-variable = thread_cache_size=10
		set-variable = wait_timeout=30

		(then from shell:)
		/etc/init.d/mysqld stop
		/etc/init.d/mysqld start
		

CPU usage under top for mysqld dropped from a typical 99.9% to 10% max & more typical 6% on the 5-sec transitory showing. I believe `thread_cache_size=0` is the default, though I`m not certain; 28800 is *certainly* the default for `wait_timeout`, which is extraordinary. I also later altered the `long query time` (viewed on the same Variables page of phpmyadmin) from 3 secs to 15 - more on this later. Changing the thread_cache_size altered the number of ‘Threads created’ (Runtime Information, phpmyadmin, near the bottom of the page) dramatically. It was previously the same as the number of queries - about 660,000 on the first day. At this moment the stats are as follows:

“This MySQL server has been running for 12 days, 7 hours, 45 minutes and 54 seconds.
Since its startup, 4,532,838 queries have been sent to the server.”

  • Threads cached: 8
  • Threads created: 115
  • Threads connected: 2

So, `threads created` has dropped from 4,532,838 to 115 - quite a change.

 

The final issue was to look for slow queries; after all, the slow-query log was getting into the gigabyte size.

One of the first interesting features was that `long query time` was set at 3 secs. At first sight this seems reasonable, but the problem is that this is not a measure of how long the query takes, but rather of how long the script calling MySQL takes to close it. The following is from the MySQL documentation:

long_query_time
If a query takes longer than this (in seconds), the Slow_queries counter will be incremented. If you are using --log-slow-queries, the query will be logged to the slow query logfile. This value is measured in real time, not CPU time, so a query that may be under the threshold on a lightly loaded system may be above the threshold on a heavily loaded one. See section 4.9.5 The Slow Query Log.

Later, using


		/usr/bin/mysqldumpslow slow.log
		
I did find a query that was taking 5.27 secs (average) to complete. This was used to present a graphical total-stats page for the .co.uk pages. It was unacceptable, and that’s why the `view total stats` no longer appears. After disabling this page the *very* slow queries vanished. However, even after this 1 in every 6 pages takes longer than 15 secs to deliver. This is also unacceptable, but for the moment I do not know what is causing it.