VERY strange MySQL problem, definitely worth a look

planet-admin

I have a mysql 4.1 server running on a quad opteron machine running redhat E/S 4u4.

This mysql runs at about 30-40% cpu usage. It has tables that are no larger than 2 gb. They are all MyISAM tables.

The cpus in this box are 2.2ghz opterons, and there are 20gb of RAM.

This box also serves as a web server, so we wanted my migrate the mysql off of the box, onto a 16core opteron 32gb RAM box, (running gentoo, with a vanilla 2.6.19, and it runs very well) with fiber channel attached raid array. Alreay on this machine we have a MySQL 4.1 with several databases, and a MySQL 5.0.26 with tables upwards of 6gb (a vbulletin site). (those databases have NO problem at ALL)

With both of these running (and that vbulletin sees a LOT of traffic), the load avg is 1-2, and the cpu usage is no more than 10%. (It is 16 cores after all).

I created a new mysql instance/config file/init script, and moved our binary data files over (created using mysqlhotcopy).

I started the server (used even the same config file, mysql variable-wise), and the cpu usage slams the server, upwards of 1100%. The load avg gets as high as 90.

I have tried:
clearing the linux disk cache, and cat'ing the myi/myd files to /dev/null, just to get them back in the cache.
MySQL 5
switching the tables to innodb
fine-tuning the mysql variables
several other things

Now, quite obviously, the 16 core machine is a much beefier server, and the nic wasn't being overutilized, yet the mysql process was taking at the very least 400-500% cpu, with actual numbers around 900%. It was causing the load average to spike, making the other mysql instances....slow.

I moved the EXACT SAME data files back (so, that rules out something strictly relating to them) to the other quad opteron box, and it runs perfectly.

I don't even know where to BEGIN to try to figure out why it would perform so differently. Anyone have any ideas?

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com

Dan · Veteran Joined: 25 Oct 2005 Posts: 1302

nothing fishy in /var/log/mysql ?

planet-admin · Posted: Sun Dec 31, 2006 9:04 pm Post subject:

Nada, zip. I can't figure it out for the life of me. It's completely incomprehensible.

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com

BitJam

m.wales · n00b Joined: 06 Sep 2006 Posts: 18

So the only thing that stood out to me was that you were moving from MySQL 4 to 5. Perhaps the problem lies somewhere in that switch? Maybe the configuration you used for 4 doesn't work too well with 5? Also are all the tables in the new 5 instance indexed and analyzed properly as I've certainly had migration problems due to that in the past.
_________________
- Mark W -

planet-admin · Posted: Mon Jan 01, 2007 3:43 am Post subject:

I'm already using multiple instances, and they run without problem. In regards to the above message, that's not the case either, as I tried it as MySQL 4.1, and then moved it to 5 to see if that would solve the problem. I currently have 2 MySQL instances running on that machine, with cpu usage of 10%, and a load average of about 1. If I add that other database, it freaks out, whether I make it 4.1, ol 5.0.

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com

planet-admin

BitJam · Posted: Mon Jan 01, 2007 5:12 am Post subject:

I can think of three possible reasons for the slowdown:

The new database is running unoptimized
Misconfiguration
Resource conflict

1. To test the indices, I would suggest actually timing select statements under a no-load condition. Compare timing on the new machine and on the old machine. I wouldn't trust explain. It would also be interesting to see how the new database works on the new machine when the vB database is off but I realize this might not be possible.

2. Make sure each instance has it's own:

Socket
Port
PID file
Log file
Data directory

Consider using mysql_multi.

3. Since the other instances slow down after you start the new one, (if there is no misconfiguration) it is clear that there is resource contention. What is not clear is if this is a symptom of some other problem or if it is the actual cause of the problem. I realize that you don't think this could be the cause because your current two instances play nice together, but resource contention among multiple instances is only an issue if the databases have a high load. Your new database could be the straw that breaks the camel's back.

Ideally, it would be nice to be able to see if the problem exists without the vB database running (or with no load). Alternatively you could add the new database but don't give it any load and see what happens. This might help isolate a misconfiguration. See if simple queries affect resource use. Then write a Perl (or some other language) script to start hitting the new database with multiple queries. See if this starts to cause a problem.

If such tests would be difficult or painful to run then as an alternative, you might want to closely monitor resource use when the contention starts. Perhaps you could set up GKrellm or some other monitoring tool to monitor memory use, processes, CPU, network traffic, and disk IO. Be prepared to take some screen shots.

If the new database works well by itself (tested either with timing selects or actually running it with the other dbs off or throttled back) then my next guess would be a misconfiguration causing two instances to vie for the same resource. If there is no configuration problem then you might have to start tracking down resource use. For example, depending on how your raid is configured it could be extremely unoptimized for multiple instances and you are getting creamed by seeks back and forth between two different data directories. Raids can often slow things down due to the seek times. In the past, I've tended to try to design my disk layouts to minimize seeks by playing tricks such as putting log file directories on separate drives. Generally, raids are often not well optimized for database performance. They are good for reliability but the can really suck performance-wise when compared with having multiple spindles available to cut down on seek times.

Finally, you might want to take a day off away from this and come back to it with a fresh set of eyes. Happy New Year!

planet-admin · Posted: Mon Jan 01, 2007 6:14 am Post subject:

I used mysqlhotcopy to get the datafiles initially, so they're the same ones. Also, when I moved them back, I moved back the exact same files, where they are now running fine. This probably rules out some "not using indexes" problem.
I used the same config file (all the various buffer_size, etc variables from the "lesser" machine), and that didn't work. I tuned it specifically for my hardware, that didn't work. I used one of the config files from the existing mysql instances, that didn't work.

Each instance is bound to a different ip address/nic. Each has it's own pid, socket, log file, and data directory, and port.

I have turned on "log queries not using indexes", and it IS using the indexes.

I have a VERY high amount of context switching, upwards of 250,000. (using vmstat)

The database does NOT do this when there are no open connections to it, but once I start the web server (hosted on another machine), it gets SLAMMED. Again, the vbulletin, and the other database instances are far far far more utilized, but they run at a nice load average of.....1.

My RAID is actually a 16-disk 12TB raid 6 array, connected via 4gb fiberchannel, with a 1gb RAM buffer on the raid controller. Plus, I flushed the linux disk buffer cache, and then cat'ed the files to dev/null to get them loaded back into the linux disk buffer cache. This raid array is using sata2 ncq drives. It's so blazing fast it's not even funny.

The key_read vs key_read_request ratio is .002.

The tmp_tables ratio is also very low.

The query cache isn't full.

NOTHING that I can think of is causing this.

Thanks,
Michael

jonnevers · Posted: Mon Jan 01, 2007 4:37 pm Post subject:

planet-admin · Posted: Tue Jan 02, 2007 12:53 am Post subject:

Well, I first tried it on MySQL 4.1. I'm aware of the migration path, I just wanted to "quickly" try a solution, when 4.1 didn't work, I started the 5 binary I had installed. Sadly, it didn't work either.

And, yes, this hardware is a serious piece of hardware. It boots off of the fiberchannel. It'll compile the COMPLETE 2.6.19 kernel, with a make -j in about 19 seconds. 19 SECONDS!

(Sadly, I also have the "joy" of working on an sgi altix, itanium 2, supercomputer pain, but I digress)

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com