MySQL 5.5 InnoDB prefetch patch

From Zarafa wiki

Revision as of 16:02, 7 February 2011 by Steve (Talk | contribs)
Jump to: navigation, search

Contents

InnoDB prefetch patch

Goal

The goal of this patch is to minimize I/O load on your server during random access on traditional magnetic harddisks.


Theory

MySQL 5.5 brought us native asynchronous I/O for linux. This means that MySQL can post multiple I/O requests at once to the underlying disk I/O subsystem. MySQL 5.1 have a similar mechanism employing multiple read threads to do the same.

However, the InnoDB row engine basically works sequentially; when requesting multiple pages in a table, each table will be posted to the I/O subsystem separately, then InnoDB will wait for the data to come back from the disk, and then processing of the page continues.

In many cases, this is more efficient than it sounds; filesystem readahead, kernel caches, your RAID controller and disks all work together to make sure that each of these separate requests will not result in a single physical I/O request (IOP).

However, there are also a lot of other cases that these readahead or caching mechanisms fail miserably. This results in rock-bottom read performance, equal to the performance of a single disk doing random I/O. You can notice this most when you have a very fragmented table (a table you have done lots of random writes in), which is not in the cache, and you try mysqldump the table; Irrespective of the number of disks you have, you see only something like 100-250 IOPS on your RAID subsystem, which is about 1,5-4.0 MB/sec. Yes, that's only 4 megabytes per second, even on your multi-$1000 RAID array that can do a gigabyte per second.

The patch described here removes this problem by adding a prefetch stage before doing the actual query: When possible, the patch finds out which pages are about to be requested sequentially, and pre-requests the reads for up to 1024 pages. After this, the 'normal' processing kicks in. Due to this prefetch, all your disks can be working on getting the data in parallel, boosting performance by a factor more or less equal to the number of disks in your RAID.

Even without a RAID array, your NCQ-enabled disks will benefit, since even a single disk can re-order the many requests better than when receiving single page requests.

FOSDEM 2011 presentation

Here is the presentation a gave on 5 Feb 2011 at FOSDEM. It had a live demo part so the slides don't completely speak for themselves, but it gives the general idea.

[Presentation on SlideShare]

Status

Current version is 0.3: [download 0.3]

 It currently patches against MySQL 5.5 only, but a port to 5.1 should not be that hard, and theoretically should yield good results when you have lots of read threads (more theads than disks at least)
 Patch against 5.5.8
 It passes all MySQL/innodb tests
 It is not production-ready, since it needs more testing
 It 'leaks' a little bit of memory
 Needs some refactoring because ha_innodb.* are a mess now

Benchmarks

Benchmarks done on our dual quad-core xeon, 3ghz, MySQL 5.5.6rc:

 mysqldump > /dev/null
 1.5GB fragmented table
 4x 10k RPM SAS disks, RAID-0, 128k stripe
 no O_DIRECT, unpatched:       5m36s, ~170 IOPS
 no O_DIRECT, 4 page prefetch: 5m39s, ~170 IOPS
 O_DIRECT, 4 page prefetch:    3m18s, ~300 IOPS
 O_DIRECT, 8 page prefetch:    2m25s, ~400 IOPS
 O_DIRECT, 16 page prefetch:   1m58s, ~500 IOPS
 O_DIRECT, 32 page prefetch:   1m41s, ~600 IOPS
 O_DIRECT, 64 page prefetch:   1m30s, ~650 IOPS
 O_DIRECT, 128 page prefetch:  1m24s, ~700 IOPS
 O_DIRECT, 256 page prefetch:  1m14s, ~750 IOPS
 O_DIRECT, 512 page prefetch:  1m9s, ~800 IOPS
 O_DIRECT, 1024 page prefetch: 1m3s

Secondary changes in the patch

There are also a few other changes in the patch, which do not directly affect prefetch:

Statistics logging

We now also log the time it takes to do statistics in the slow log. This is useful because for some queries it takes longer to do statistics than the actual query. The slow log shows:

 # Query_time: %s  Lock_time: %s Statistics_time: %s

You can parse this information by standard with the maatkit mk-query-digest program with no changes.

NOTE There seems to be a problem in which it shows an insanely high statistics time, I have to look into that.

Dont do statistics when doing FORCE INDEX or there is only a single primary key

Another patch here is to make sure that we're not doing unneccessary estimation. There is a problem in this fix which isn't found by the test set apparently, because the estimation could be important even if there is no other index. (Eg it quickly breaks when doing joins). This is a very bad thing and should be fixed. It doesn't affect the benchmarks though.

Optimize records_in_range

The old behaviour in innodb to do an estimation for a range is (in short, it's missing some details):

  • Seek down the btree to the leaf nodes for the first and last record
  • In a loop, start at the top-level (root) node and descend a level each time, then for each level:
    • If the start and end paths are on the same page, do nothing
    • If the start and end patcs are on logically adjacent pages, count the number of records between the pages
    • Else, read up to 10 pages between start and end, if we find the end page then use the record count from there, else use the average to guess the number of records
    • Multiply the number of records up to the previous level with the records found on this level
    • If last level (leaf page), break
  • This number gives an estimate

The new behaviour is:

  • Seek down the btree to the leaf nodes for the first and last record to the last NODE level (so one level LESS than before)
  • In a loop, start at the top-level (root) node and descend a level each time, then for each level:
    • If the start and end paths are on the same page, do nothing
    • If the start and end patcs are on logically adjacent pages, count the number of records between the pages
    • Else, if we're on a node page, guess the number of records by using a fixed guess of the number of node entries per page
    • Multiply the number of records up to the previous level with the records found on this level
    • If last level (lead page), break
  • If at the last level we're still on the same page or adjacent pages, read the last leaf pages (on the next level), and use the record counts there
  • Else multiply by the average number of records per page using the average records per page in the table statistics
  • This number gives an estimate

The great thing about this is that we almost never need to read leaf pages (except for very small ranges), and has the following characteristics compared to the old way:

  • Exact estimation for small ranges (like before)
  • Less precision for ranges that cover > 2 pages in the leaf nodes,
  • but no reads of leaf pages for large ranges

The idea here is that your node pages are much more likely to be in the buffer pool, and are therefore mostly 'free' I/O-wise. Also, if you have lots of indexes, we don't have to read lots of leaf pages to get a good query plan. A different option would be to use prefetching for reading the 10 sequential pages in the original algorithm, but that's still less efficient for large (>2 pages) estimations.

Another change we could do is to still do the 10-page read, but only on node pages, never on leaf pages.

Current problems in the patch

general
  • The no-statistics patch can break certain query plans, this is a bad thing
  • There is a 'memleak' in the core code - a heap is allocated on the handler which is used to store some data, but it isn't freed until the handler is cleaned up
  • We keep track of up to which records we have done a prefetch in 'ha_innodb::lastread'. This works well for unique indexes, but non-unique indexes may have many records with the same key. This means that prefetch breaks after the first batch of equal key values, since we cannot detect that we have read past that value. The only way I can think of to detect this is to combine the key with a page number, and then wait until we have read that page number. It has the drawback that the page may not be read at all due to a rearrangement (we dont lock the pages), and it sounds hackish to me anyway.
  • There are some commented fprintf's here and there but they just need to be removed
  • Fix the statistics-time bug with huge statistics values
ha_innodb.cc

The old way that ha_innodb.cc worked was that it basically implemented the simplest interfaces (index_read, index_next, etc), and called those from the more 'advanced' functions. With prefetch, we want to do the opposite; implement the most complicated pattern (multi_range_read/multi_range_next), and call that from the simpler functions (index_read, etc). Since I don't have in-depth knowledge of the requirements of these functions, the patch currently rearranges this with various renames to FUNCNAME_low(), which now makes everything very messy. This makes me uncertain that I have not broken core requirements of the handler.

Personal tools