Taobao internal sharing MySQL and MariaDB performance optimization

By Frank Dunn,2015-10-04 04:52
10 views 0
Taobao internal sharing MySQL and MariaDB performance optimization

    Taobao internal sharing: MySQL and MariaDB

    performance optimization

    MySQL is currently the most used open source database, but the default Settings of the MySQL database performance is very poor, must be constantly optimized, and the optimization is a complex task, this article describes taobao team for MySQL database Metadata database Lock subsystem optimization, hash_scan algorithm implementation of analytical performance optimization, TokuDB version optimization, performance optimization, and MariaDB.This article from taobao experience sharing within the team.

    MySQL, 5.7 optimization, Metadata Lock subsystem optimization


    The purpose of the introduction of MDL lock, originally is 989, in order to solve the famous bug# in MySQL 5.1 and the previous version, the execution of a transaction is not maintenance involves all the tables Metatdata locks, appear easily copy disruptions, such as the execution sequence is as follows:

    Session 1: BEGIN;

    Session 1: INSERT INTO t1 VALUES (1);

    Session 2: Drop table t1;-- -- -- -- -- -- -- -- write BINLOG SQL

    Session 1: COMMIT;-- -- -- -- -- write BINLOG transaction

    In case library replay binlog, will perform DROP TABLE first, and then INSERT data, leading to replicate the interrupt.

    In MySQL 5.5 version, the introduction of MDL, in transactions involving all the tables in the process of MDL lock, release until end of the transaction.This means that the sequence of the DROP TABLE operation will be the Session 1 block until its submission.

    But used 5.5 knows, MDL is a really annoying thing, believe a lot of people must have met when using mysqldump do logical backup, because of the need to perform FLUSH TABLES WITH READ LOCK (the following abbreviation FTWRL instead of) to obtain the GLOBAL MDL of GLOBAL LOCK, so often can see "wait for GLOBAL READ LOCK".If inventory in large queries, or duplicate threads are executing relatively long DDL, live and FTWRL block, and then the QUERY will be block, lead to business failure is not available.

    In order to solve this problem, Facebook to add new interfaces to MySQL replace FTWRL create a read only view, and return the binlog read view is consistent with the site;Percona Server also achieve a similar way to bypass the FTWRL, clickDocument connectionAs well asPercona blog, not in this.

    MDL is solvedbug#989, but introduces a new hot spot, all the MDL lock object is maintained in a hash object;For hot spots, the most normal idea, of course, is to partition to disperse hot spots, but that is a great god Facebook Mark Callaghan in the reportbug#66473Before joining, Mark was observed MDL_map: : mutex locking competition is very high, and the official change.So in later versions of MySQL 5.6.8 and introduced a new parameter metadata_locks_hash_instances to control of MDL hash partition number (Rev:4350)

    But the story doesn't end at the back of the test and found that there is something wrong with the hash function, somedb. Someprefix1.... Somedb. Someprefix8 hash key values are the same, are under the hash to the same bucket, equivalent to a hash partitioning didn't take effect.That belongs to the problem of hash algorithm, like archaeology students can read under bug# behind 66473 Dmitry Lenev analysis.

    Mark further testing found that Innodb hash algorithm is more efficient than my_hash_sort_bin calculation, Oracle developer reopenedbug#68487To track the issue, and in the MySQL5.6.15 optimize the hash key computing functions, including fix it says the hash calculation problem (Rev:5459), using MurmurHash3 algorithm to compute the hash value of MDL key.

    MySQL 5.7 the optimization of MDL lock

    In MySQL 5.7 do to MDL subsystem optimization more thoroughly.Mainly from the following points:

    First, although the MDL HASH partitioning, but because of the way is the table name + library name as a key value partition, if the query or DML is concentrated on the same table, will HASH to the same partition, cause obvious MDL lock contention on the HASH.

    Aiming at this point, the introduction of the LOCK - FREE to store MDL_lock HASH, LF_HASH lock-free algorithm based on the paper "the Split - Ordered Lists: the LOCK - FREE Extensible HASH Tables", there is more complex.Note: in fact LF_HASH have long been used in the Performance Schema, is relatively mature code modules.Due to the introduction of the LF_HASH, MDL HASH partitioning features natural directly be abolished.The correspondingWL#7305 PATCH(Rev:7249)

    Widely used in the second place, from the actual scene, DML compared/SELECT DDL contour level MDL lock type, are more common, it can reduce the DML and SELECT MDL overhead.

    In order to realize the DML/SELECT FAST LOCK, use a similar LOCK - WORD LOCK mode, called the FAST PATH, if FAST - PATH LOCK fails, walked missile - the PATH to LOCK.

    Every MDL lock objects (MDL_lock) lasted for a long long type of value to mark the lock of the current state and variable called MDL_lock: : m_fast_path_state a simple example: (initial corresponding MDL_lock on sbtest1 table: : m_fast_path_state value 0)

    ; Session 1: BEGIN;

    ; Session 1: SELECT * FROM sbtest1 WHERE id = 1;/ / m_fast_path_state =

    1048576, MDL ticket without MDL_lock: : m_granted queue

    ; Session 2: BEGIN;

    ; Session 2: SELECT * FROM sbtest1 WHERE id = 2;/ / m_fast_path_state =

    1048576 + 1048576 = 2097152, same as above, walk FAST PATH

    ; Session 3: the ALTER TABLE sbtest1 ENGINE = INNODB;/ / DDL request

    add MDL_SHARED_UPGRADABLE type lock is as unobtrusive lock, can

    think this is better than the above SQL MDL locked level higher, and

    incompatible, therefore be forced to go missile path.The missile path is need to

    add MDL_lock: : m_rwlock write locks.M_fast_path_state =

    m_fast_path_state | MDL_lock: : HAS_SLOW_PATH | MDL_lock: :


    ; Note: DDL will get library level of intent exclusive MDL or table level lock

    Shared lock can upgrade, but easy to express, directly ignored here, only

    consider involving the same MDL_lock lock object.

    ; Session 4: SELECT * FROM sbtest1 WHERE id = 3;/ / check

    m_fast_path_state & HAS_OBTRUSIVE, if DDL was no more than the

    missile will go path.

    Can be seen from the above description, explicit MDL subsystem to distinguish the lock type (OBTRUSIVE or UNOBTRUSIVE), matrix m_unobtrusive_lock_increment stored in the array.So for incompatible types of MDL lock types, such as the DML/SELECT, almost no read-write lock lock operation or a MUTEX overhead.The

    correspondingWL#7304, WL#7306 PATCH;Rev:7067,Rev:7129(Rev:7586)

    Third, due to the introduction of the MDL lock, in fact, early version is used to control the Server and the engine layer table level concurrent THR_LOCK for Innodb has some redundancy, so the Innodb tables can completely ignore this part of the overhead.

    Innodb in existing logic, however, still rely on THR_LOCK to realize the LOCK TABLE tbname READ, thus added new MDL LOCK type instead of this

    implementation.Most of the changes are actually code to handle new types of MDL, Innodb changes are only a few lines of code.The correspondingWL#6671PATCH(Rev:8232)

    Fourth, the user lock the Server layer (get) through GET_LOCK function use MDL to implementation.

    Users can GET_LOCK () to obtain multiple users at the same time lock, at the same time due to the use of MDL, can use MDL subsystem implementation of deadlock

    detection.Note due to the changes, cause the user lock name must be less than 64 bytes, this is restricted by MDL subsystem.The correspondingWL#1159, PATCH(Rev:8356)

    MySQL, performance optimization, hash_scan the realization of the algorithm

    Problem description

    First of all, we perform the TestCase below:

    --source include/

     --source include/

     connection slave;

     set global slave_rows_search_algorithms='TABLE_SCAN';

     connection master;

     create table t1(id int, name varchar(20);

     insert into t1 values(1,'a');

     insert into t2 values(2, 'b');


     insert into t3 values(1000, 'xxx');

     delete from t1;

     ---source include/

    , with the increase of t1 data rpl_hash_scan. Test execution time will increase as the amount of t1 data and rapid growth, because the execution 'delete from t1;'for each line of t1 deletion, libraries are prepared to scan t1, namely a full table scan, if select count (*) from t1 = N, then you need to scan N times t1 table, then read the record number is: O (N + (N - 1) + (N - 2) +... + 1) = O (N ^ 2), in the replication without introducing hash_scan, binlog_format = row, for no index table, are implemented through table_scan, if a update_rows_log_event/delete_rows_log_event contains a row change more, every change to a full table scan, the stack is as follows: #0 Rows_log_event::do_table_scan_and_update

    #1 0x0000000000a3d7f7 in Rows_log_event::do_apply_event

    #2 0x0000000000a28e3a in Log_event::apply_event

#3 0x0000000000a8365f in apply_event_and_update_pos

    #4 0x0000000000a84764 in exec_relay_log_event

    #5 0x0000000000a89e97 in handle_slave_sql (arg=0x1b3e030)

    #6 0x0000000000e341c3 in pfs_spawn_thread (arg=0x2b7f48004b20)

    #7 0x0000003a00a07851 in start_thread () from /lib64/

    #8 0x0000003a006e767d in clone () from /lib64/

    This kind of circumstance, often cause delay for library, which is no replication latency problems brought by the index table.

    How to solve the problem:

    1. RDS to understand this problem, will check the table at the time of each table

    creation contains the main building or only build, if not included, then create

    an implicit main building, the main building transparent to users, users

    non-inductive, corresponding show the create, select * operations such as

    block implicit main building, in order to reduce the effects of no index table;

    2. Official in order to solve this problem, the 5.6.6

    slave_rows_search_algorithms and later introduced parameters, used to

    indicate a library in apply_binlog_event used in the algorithm, there are three

    kinds of algorithm TABLE_SCAN, INDEX_SCAN, HASH_SCAN, including

    TABLE_SCAN and INDEX_SCAN is already exist, this paper mainly studies

    the realization of the HASH_SCAN way, slave_rows_search_algorithms about

    parameter Settings.

    The realization of the hash_scan method:

    Simple speaking, in the apply rows_log_event, on line will log_event update cached in two structure, respectively is: m_hash, m_distinct_key_list.M_hash: mainly used for cache update rows starting position, is a hash table.M_distinct_key_list: if there is an index, the index value of push to m_distinct_key_list, if there is no index table, do not use this List structure;The preliminary scan the whole process is as follows: call Log_event: : apply_event Rows_log_event::do_apply_event


     Rows_log_event::do_hash_row (add entry info of changed records)

     if (m_key_index < MAX_KEY) (index used instead of table scan)

     Rows_log_event::add_key_to_distinct_keyset ()

When an event include multiple lines of change, will first scan all of the changes, will

    cache the result to the m_hash, if the table there is an index, the index value caching

    to m_distinct_key_list List, if not, do not use this cache structure, and a full table scan


    Execution stack is as follows:

    #0 handler::ha_delete_row

    #1 0x0000000000a4192b in Delete_rows_log_event::do_exec_row #2 0x0000000000a3a9c8 in Rows_log_event::do_apply_row #3 0x0000000000a3c1f4 in Rows_log_event::do_scan_and_update #4 0x0000000000a3c5ef in Rows_log_event::do_hash_scan_and_update #5 0x0000000000a3d7f7 in Rows_log_event::do_apply_event #6 0x0000000000a28e3a in Log_event::apply_event

    #7 0x0000000000a8365f in apply_event_and_update_pos #8 0x0000000000a84764 in exec_relay_log_event

    #9 0x0000000000a89e97 in handle_slave_sql

    #10 0x0000000000e341c3 in pfs_spawn_thread

    #11 0x0000003a00a07851 in start_thread ()

    #12 0x0000003a006e767d in clone ()

    Perform process description:





     if (m_key_index > MAX_KEY)



     ha_index_read_map(m_key from m_distinct_key_list)

     entry= m_hash->get()



     while (m_hash->size > 0);

    It can be seen from the implementation process, when using hash_scan, will be a full table scan only once, although will traverse m_hash the hash table for many times, but the scan is O (1), so the price is very small, so can reduce the scanning time, improve the execution efficiency.

    Hash_scan a bug

    Bug details:

    Bug: m_distinct_key_list index of the key is not the only, so there is a repeated to have removed the records deleted.

    Bug fix:

    Problem extension:

    ; In the case of no index, don't you open the hash_scan can improve efficiency,

    reduce the delay?Not necessarily, if every time update a record, only at this

    time still needs a full table scan, and because the cost of entry, there should be


    ; An event can contain many records update?The table structure and size of the

    recorded data, the size of an event not more than 9000 bytes, no parameters

    can control the size;

    ; Hash_scan any restrictions?Hash_scan will only be effective for update, and

    delete operations, for binlog_format = statement produces Query_log_event or

    binlog_format = row when Write_rows_log_event doesn't work;

    TokuDB version, optimize 7.5.0

    TokuDB 7.5.0 version has been published, is a milestone of version, here to talk about some optimization, fodder for storage engines enthusiasts.

    A) shutdown speed

    A user feedback TokuDB at the time of shutdown, haven't finished half an hour, very unacceptable.At the time of shutdown, TokuDB doing?In do the checkpoint, the memory of a node data serialization and compression to disk.

    Why is so time consuming?If tokudb_cache_size open is larger, the memory of nodes will be very much, at the time of shutdown, everyone queued to be compressed to disk (serial).

    In 7.5.0 version, TokuDB official has been optimized for this problem, make parallel compression multiple nodes to shorten the time.

    BTW: TokuDB in early design has kept parallel interface, but has not been open.

    B) node within the reading speed

    In memory, TokuDB node (internal node) of each message buffer are two important data structure:

    1) FIFO structure, save {key, value}

    2) the OMT structure, save - offset} {key, FIFO

    Because of FIFO does not have quickly find properties, using OMT to do quickly find (up according to the key value).So, when the node cache miss, index layer needs to be done: 1) read from disk node content into memory

    2) FIFO structure construction

    (3) based on FIFO structure OMT structure do sort)

    Because there are many TokuDB internal performance (ji) needle (shu), they found that the step 3) is a big performance consumption point, because every time the message buffer do sorting under OMT, so on 7.5.0 version, the OMT FIFO - offset (sorted) are persisted to disk, so they don't have a sort of loss.

    C) in order to write faster

    When writing happens, according to the current key lookup in the pivots (binary) current writing which mesage to fall into the buffer, if write is order (or partial order, data to the right path), can be avoided by "finding" extra overhead.

    How to judge whether a written order?TokuDB used a simple heuristic method (heurstic) : seqinsert_score integral type.If:

    1) current written to fall into the right node, the seqinsert_score add a point (atomic) 2) current written into the the right nodes, the seqinsert_score reset (atomic) when seqinsert_score is greater than 100, can be considered a written order, the next time write operation occurs, the first compared with the right node pivot judgment, if that is the writing, for the order will be written the node, save a lot of the compare overhead.The method is simple and effective.

    MariaDB, performance optimization, filesort with small LIMIT optimization

    From MySQL 5.6.2 / MariaDB 10.0.0 version, the MySQL/MariaDB for "ORDER BY... LIMIT n" statement implements a new optimization strategy.When n is small enough, the optimizer will use a priority queue to volume for n sort, rather than sorting all the data and then take out the first n.Is this new algorithm can describe: (assuming that ASC)

    1. Establish a only n elements of the priority queue (heap), the root node for the

    heap element

    2. According to the other conditions, in order to take out a line from the table


    3. If the current row sorting keywords is less than the pile head, the replacement

    of pile head, the elements of the current Shift again keep the characteristics of

    the pile

    4. Then take a data repeat step 2, if there is no a data is performed under 5

    5. In turn out in the heap element (from big to small), reverse output (smallest),

    then the ranking results of ASC

    Time complexity, this algorithm for m * log (n), m for the index filtration after the number of rows, n to LIMIT the number of rows.While the original full sorting algorithms, time complexity for m * log (m).As long as n is far less than m, this algorithm is effective.

    But in MySQL 5.6, in addition to optimizer_trace, no good way to see how much this new plan.MariaDB 10.013, providing a system state, you can look at the number of new execution plan calls:


    Description: through the priority queue sort.Sort (total = Sort_range + Sort_scan) Scope: Global, the Session

    Data type: numeric

    Introducing version: MariaDB 10.0.13 moreover, MariaDB will also this information into the missile in the Log.As long as you specify log_slow_verbosity = query_plan, can in missile see such records in the Log:

    # Time: 140714 18:30:39

     # User@Host: root[root] @ localhost []

     # Thread_id: 3 Schema: test QC_hit: No

     # Query_time: 0.053857 Lock_time: 0.000188 Rows_sent: 11 Rows_examined: 100011

     # Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No

     # Filesort: Yes Filesort_on_disk: No Merge_passes: 0 Priority_queue: Yes

     SET timestamp=1405348239;SET timestamp=1405348239;

     select * from t1 where col1 between 10 and 20 order by col2 limit 100;

    Priority_queue: "Yes," said this Query takes advantage of the priority queue execution plan (pt - Query - digest is now can parse Priority_queue this column).

Report this document

For any questions or suggestions please email