Shard-Query blog

The only open source MPP database engine for MySQL

Tag Archives: mysql

SQL injection in the MySQL server (of the proxy kind!)

As work on WarpSQL (Shard-Query 3) progresses, it has outgrown MySQL proxy.  MySQL proxy is a very useful tool, but it requires LUA scripting, and it is an external daemon that needs to be maintained.  The MySQL proxy module for Shard-Query works well, but to make WarpSQL into a real distributed transaction coordinator, moving the proxy logic inside of the server makes more sense.

The main benefit of MySQL proxy is that it allows a script to “inject” queries between the client and server, intercepting the results and possibly sending back new results to the client.  I would like similar functionality, but inside of the server.

For example, I would like to implement new SHOW commands, and these commands do not need to be implemented as actual MySQL SHOW commands under the covers.

For example, for this blog post I made a new example command called “SHOW PASSWORD

Example “injection” which adds SHOW PASSWORD functionality to the server
mysql> select user();
+----------------+
| user()         |
+----------------+
| root@localhost |
+----------------+
1 row in set (0.00 sec)

-- THIS COMMAND DOES NOT EXIST
mysql> show password;
+-------------------------------------------+
| password_hash                             |
+-------------------------------------------+
| *00A51F3F48415C7D4E8908980D443C29C69B60C9 |
+-------------------------------------------+
1 row in set (0.00 sec)

Important – This isn’t a MySQL proxy plugin.  There is C++ code in the SERVER to answer that query, but it isn’t the normal SHOW command code.  This “plugin” (I put it in quotes because my plan is for a pluggable interface but it isn’t added to the server yet) doesn’t access the mysql.user table using normal internal access methods.  It runs actual SQL inside of the server, on the same THD as the client connection, in the same transaction as the client connection, to get the answer!

Problem #1 – Running SQL in the server

The MySQL C client API doesn’t have any methods for connecting to the server from inside of the server, except to connect to the normally available socket interfaces, authenticate, and then issue queries like a normal client.  While it is perfectly possible to connect to the server as a client in this manner, it is sub-optimal for a number of reasons.  First, it requires a second connection to the server, second, it requires that you authenticate again (which requires you have the user’s password), and lastly, any work done in the second connection is not party to transactional changes in the first, and vice-versa.

The problem is communication between the client and server, which uses a mechanism called VIO.  There was work done a long time ago for external stored procedures, which never made it into the main server that would have alleviated this problem by implementing a in-server VIO layer, and making the parser re-entrant.  That work was done on MySQL 5.1 though.

It is possible to run queries without using VIO though.  You simply can’t get results back, except to know if the query succeeded or not.  This means it is perfectly acceptable for any command that doesn’t need a resultset, basically anything other than SELECT.  There is a loophole however, in that any changes made to the THD stay made to that THD.  Thus, if the SQL executed sets any user variables, then those variables are of course visible after query execution.

Solution  – encapsulate arbitrary SQL resultsets through a user variable

Since user variables are visible after query execution, the goal is to get the complete results of a query into a user variable, so that the resultset can be accessed from the server.  To accomplish this, first a method to get the results into the variable must be established, and then some data format for communication that is amenable to that method has to be decided upon so that the resultset can be accessed conveniently..

With a little elbow grease MySQL can convert any SELECT statement into CSV resultset.  To do so, the following are used:

  1. SELECT … INTO @user_variable
  2. A subquery in the FROM clause (for the original query)
  3. CONCAT, REPLACE, IFNULL, GROUP_CONCAT (to encode the resultset data)
Here is the SQL that the SHOW PASSWORD command uses to get the correct password:
select authentication_string as pw,
       user 
  from mysql.user 
 where concat(user,'@',host) = USER() 
    or user = USER() 
LIMIT 1
Here is the “injected” SQL that the database generates to encapsulate the SQL resultset as CSV:
select 
  group_concat( 
    concat('"',
           IFNULL(REPLACE(REPLACE(`pw`,'"','\\"'),"\n","\\n"),"\N"),
           '"|"',
           IFNULL(REPLACE(REPLACE(`user`,'"','\\"'),"\n","\\n"),"\N"),
           '"'
    ) 
  separator "\n"
  ) 
from 
  ( select authentication_string as pw,
           user 
      from mysql.user 
      where concat(user,'@',host) = USER() 
        OR user = USER() 
    LIMIT 1
  ) the_query 
into @sql_resultset ;
Query OK, 1 row affected (0.00 sec)
Here is the actual encapsulated resultset.  If there were more than one row, they would be newline separated.
mysql> select @sql_resultset;
+----------------+
| @sql_resultset |
+----------------+
| ""|"root"      |
+----------------+
1 row in set (0.00 sec)

Injecting SQL in the server

With the ability to encapsulate resultsets into CSV in user variables, it is possible to create a cursor over the resultset data and access it in the server.  The MySQL 5.7 pre-parse rewrite plugins, however,  still run inside the parser.  The THD is not “clean” with respect to being able to run a second query.  The parser is not re-entrant.  Because I desire to run (perhaps many) queries between the time a user enters a query and the server actually answers the query (perhaps with a different query than the user entered!) the MySQL 5.7 pre-parse rewrite plugin infrastructure doesn’t work for me.

I modified the server, instead, so that there is a hook in do_command() for query injections.  I called it conveniently query_injection_point() and the goal is to make it a new plugin type, but I haven’t written that code yet.  Here is the current signature for query_injection_point():

bool query_injection_point(
  THD* thd, COM_DATA *com_data, enum enum_server_command command,
  COM_DATA* new_com_data, enum enum_server_command* new_command );

It has essentially the same signature as dispatch_command(), but it provides the ability to replace the command, or keep it as is.  It returns true when the command has been replaced.

Because it is not yet pluggable, here is the code that I placed in the injection point:

/* TODO: make this pluggable */
bool query_injection_point(THD* thd, COM_DATA *com_data, enum enum_server_command command,
 COM_DATA* new_com_data, enum enum_server_command* new_command)
{
 /* example rewrite rule for SHOW PASSWORD*/
 if(command != COM_QUERY)
 { return false; }
 
 /* convert query to upper case */
 std::locale loc;
 std::string old_query(com_data->com_query.query,com_data->com_query.length);
 for(unsigned int i=0;i<com_data->com_query.length;++i) {
   old_query[i] = std::toupper(old_query[i], loc);
 } 
   
 if(old_query == "SHOW PASSWORD")
 {
   std::string new_query;
   SQLClient conn(thd);
   SQLCursor* stmt;
   SQLRow* row;

   if(conn.query("pw,user",
    "select authentication_string as pw,user from mysql.user " \
    "where concat(user,'@',host) = USER() or user = USER() LIMIT 1", &stmt))
   {
     if(stmt != NULL)
     {
       if((row = stmt->next()))
       {
          new_query = "SELECT '" + row->at(0) + "' as password_hash";
       }
       } else
       {
         return false;
       }
     } else {
       return false;
     }

     /* replace the command sent to the server */
     if(new_query != "")
     {
       Protocol_classic *protocol= thd->get_protocol_classic();
       protocol->create_command(
         new_com_data, COM_QUERY, 
         (uchar *) strdup(new_query.c_str()), 
         new_query.length()
       );
       *new_command = COM_QUERY;
     } else {
       if(stmt) delete stmt;
       return false;
     }
     if(stmt) delete stmt;
     return true;
   }
 }

 /* don't replace command */
 return false;
}

SQLClient

You will notice that the code access the mysql.user table using SQL, using the SQLClient, SQLCursor, and SQLRow objects.  These are the objects that wrap around encapsulating the SQL into a CSV resultset, and actually accessing the result set.  The interface is very simple, as you can see from the example.  You create a SQLClient for a THD (one that is NOT running a query already!) and then you simply run queries and access the results.

The SQLClient uses a stored procedure to methodically encapsulate the SQL into CSV and then provides objects to access and iterate over the data that is buffered in the user variable.  Because MySQL 5.7 comes with the sys schema, I placed the stored procedure into it, as there is no other available default database that allows the creation of stored procedures.  I called it sys.sql_client().

Because the resultset is stored as text data, the SQLRow object returns all column values as std::string.

What’s next?

I need to add a proper plugin type for “SQL injection plugins”.  Then I need to work on a plugin for parallel queries.  Most of the work for that is already done, actually, at least to get it into an alpha quality state.  There is still quite a bit of work to be done though.

You can find the code in the internal_client branch of my fork of MySQL 5.7:

http://github.com/greenlion/warpsql-server

 

New Shard-Query features checked into SVN

I checked some updates to Shard-Query into SVN.

Partitioning support has been extended for MySQL 5.6+ to ALL partitioning types.

This includes all previously unsupported types including RANGE LIST/COLUMNS partitioned tables that are partitioned over more than one column, and HASH/KEY/LINEAR variants as well. Shard-Query now exclusively uses the PARTITION hint for partition elimination instead of WHERE clauses in MySQL 5.6. For 5.5 and previous versions, support remains limited to LIST,RANGE, and LIST/RANGE COLUMNS over a single column.

The old mysql interface DAL has been replaced completely by the PDO DAL.

There is no major difference for end users except that you have to check that the return of the query() method is an object with the is_object() function instead of checking that it is a resource with the is_resource() function. I updated bin/run_query, which is the example application.

I made a few bug fixes to the PDO DAL as well.  Thanks again to Alex Hurd for contributing the PDO DAL.

Fix for INFORMATION_SCHEMA.PARTITIONS losing table stats

Here is a fix for the MySQL/TokuDB/MariaDB bug I reported earlier today.  I think this fix is correct (it is only one line) but I don’t delve into the storage engine very often (and particularly not into ha_partition.cc) so I think it would be good to wait for Oracle (or Percona, MariaDB, or Tokutek) to validate that it is correct before using it.

diff -u ha_partition.cc /tmp/ha_partition.cc 
--- ha_partition.cc 2013-04-05 05:27:18.000000000 -0700
+++ /tmp/ha_partition.cc 2013-05-27 02:45:01.680676228 -0700
@@ -6455,9 +6455,11 @@
void ha_partition::get_dynamic_partition_info(PARTITION_STATS *stat_info,
uint part_id)
{
+
handler *file= m_file[part_id];
DBUG_ASSERT(bitmap_is_set(&(m_part_info->read_partitions), part_id));
- file->info(HA_STATUS_CONST | HA_STATUS_TIME | HA_STATUS_VARIABLE |
+ 
+ info(HA_STATUS_CONST | HA_STATUS_TIME | HA_STATUS_VARIABLE |
HA_STATUS_VARIABLE_EXTRA | HA_STATUS_NO_LOCK);
stat_info->records= file->stats.records;

Shard-Query 2.0 performance on the SSB with InnoDB on Tokutek’s MariaDB distribution

Scaling up a workload to many cores on a single host

Here are results for Shard-Query 2.0 Beta 1* on the Star Schema Benchmark at scale factor 10.  In the comparison below the “single threaded” response times for InnoDB are the response times reported in my previous test which did not use Shard-Query.

Shard-Query configuration

Shard-Query has been configured to use a single host.  The Shard-Query configuration repository is stored on the host.  Gearman is also running on the host, as are the Gearman workers.  In short, only one host is involved in the testing.

The Shard-Query response times are for 6 gearman workers.  There are six physical cores in my test machine.  In my testing I’ve found that Shard-Query works best when the number of  workers is equal to the number of physical cores.

Why partitions?

As in the previous test the lineorder table is partitioned.  This allows Shard-Query to automatically take advantage of multiple cores without changing any of the queries.

How?  Well Shard-Query transforms a single expensive query into smaller “tasks”.  Each task is a query which examines a small amount of data.  Shard-Query takes a “divide and conquer” approach, where the data is divided into small chunks, and the chunks are operated on in parallel.  Shard-Query treats each partition as a chunk.  Future versions of Shard-Query will support subpartitions and hash partitions in MySQL 5.6.

Results

In general, Shard-Query is faster than MySQL both cold and hot.  There are a few cases where the speed is about the same, or where Shard-Query is slower.  I believe this is due to MySQL bug #68079: queries may not scale linearly on MySQL.  Star schema optimization is not turned on for these queries, so each of the sub-tasks is joining many rows.

1_cold

1_hot

2_cold

2_hot

3_cold

3_hot

4_cold

4_hot

Software:
Shard-Query 2.0 Beta 1 (patched for I_S.partitions)
Tokutek MariaDB 5.5.30-7.0.1

MySQL bug 69179 – INFORMATION_SCHEMA.PARTITIONS causes query plan changes

Shard-Query examines INFORMATION_SCHEMA.PARTITIONS to determine if a table is partitioned.  When a table is partitioned, Shard-Query creates multiple background queries, each limited to a single partition.  Unfortunately, it seems that examining INFORMATION_SCHEMA.PARTITIONS causes query plans to change after the view is accessed.

I have reported bug 69179 to MySQL AB  Oracle Corporation(old habits die hard).

Be careful: If you have automated tools (like schema management GUI tools) then make sure they don’t examine INFORMATION_SCHEMA.PARTITIONS or you may get bad plans until you analyze your tables or restart the database, even if using persistent stats.

I can only get the bug to happen when a WHERE clause is issued that limits access to a single partition.  It may be that the per partition statistics get wiped out when the I_S table is accessed, but the per table stats remain unchanged.

In order to get stable and good query plans I implemented a workaround in my local Shard-Query branch.  I replaced access to INFORMATION_SCHEMA.PARTITIONS with a copy of the table stored in shard_query.partitions.  This ensures that the query plans don’t change between the cold run and the hot run.  This bug affects MySQL 5.5.31, 5.6.11, TokuDB 5.5.30, mariadb-5.5.30-tokudb-7.0.1 and possibly others.

Shard-Query 2.0 Beta 1 released

It is finally here.  After three years of development, the new version of Shard-Query is finally available for broad testing.

This new version of Shard-Query is vastly improved over previous versions in many ways.  This is in large part due to the fact that the previous version of Shard-Query (version 1.1) entered into production at a large company.  Their feedback during implementation was invaluable in building the new Shard-Query features.   The great thing is that this means that many of the new 2.0 features have already been tested in at least one production environment.

This post is intended to highlight the new features in Shard-Query 2.0.  I will be making posts about individual features as well as posting benchmark results.

Configuration Repository
Shard-Query 1.x stored the configuration in a .ini file.  In an installation with many nodes, keeping the .ini file in sync was difficult.  There was no way to guarantee that all of Shard-Query saw the state of the cluster (in terms of node composition) as an entire unit.  Shard-Query 2.0 solves this problem and makes configuration and bootstrapping of new nodes simpler.  Shard-Query 2.0 stores the configuration inside of a database schema, rather than in a flat file.  Each node is made aware of the configuration repository through the node setup process.  The cluster configuration is then read from the repository each time Shard-Query initializes. Read more of this post