Shard-Query blog

The only open source MPP database engine for MySQL

Monthly Archives: February 2014

Shard-Query supports background jobs, query parallelism, and all SELECT syntax

SkySQL just blogged about a tool to schedule long running MySQL jobs, prevent too many queries from running simultaneously, and stores the results in tables.  It even uses Gearman.  You will note that the article says that it uses PAQU, which uses Shard-Query.

I think PAQU was created for two reasons.  A) Shard-Query lacked support for fast aggregation of STDDEV and VARIANCE (this has been fixed), and B) their data set requires “cross-shard queries”.  From what I can see though, their type of cross-shard queries can be solved using subqueries in the FROM clause using Shard-Query, instead of using a customized (forked) version of Shard-Query.  It is unfortunate, because my recent improvements to Shard-Query have to be ported into PAQU by the PAQU authors.

I’d like to encourage you to look at Shard-Query if you need to run complex jobs in the background and get the results later.  As a bonus, you get support for parallel query using partitioning, and you get shared-nothing scale-out for sharding too.  You get the latest and greatest improvements to Shard-Query, and if you have a query that you aren’t sure can be executed by Shard-Query, run the problem by me first so that I can suggest solutions before you fork the product.

Also:
Since scientific data sets rarely need to be updated (if ever), a column store like Infobright is ideal for such data.  This can reduce raw multi-TB size data sets into multi-hundred GB sized data sets that can be processed much faster.  Combine this with the ability to save on IO by reading on the necessary columns, and extremely fast data processing is possible.

To run background jobs in Shard-Query:

  1. $SQ = new ShardQuery();
  2. $SQ->async = true;
  3. $job_id=$SQ->query(“select count(*) from 1TB_table”);
  4. echo “Check the shard_query.jobs table for the completion status of $job_id\n”;

To poll for completed jobs:

  1. function wait_for_jobid($job_id) {
  2.   $sql = “select * from jobs where completion_percent = 100.00 or completion = ‘error’ and job_id = ” . (!is_numeric($job_id) ? ‘NULL’ : $job_id);
  3.   while(1) {
  4.     $SQ->state->mapper->sweep_jobs();
  5.     $stmt = $SQ->state->mapper->conn->my_query($sql);
  6.     $cnt = 0;
  7.     while($row=$SQ->DAL->my_fetch_assoc($stmt)) {
  8.       $cnt++;
  9.       break;
  10.     }
  11.     if($cnt > 0) break;
  12.     sleep(1);
  13.   }
  14.   return 1;
  15. }

To get the result after it is finished:

  1. $stmt = $SQ->get_async_result($job_id);
  2. while($row = $SQ->DAL->my_fetch_assoc($stmt)) {
  3. print_r($row);
  4. }

Note: SimpleDAL – Shard-Query data access layer
You will notice the above query uses the ->DAL member.  This is the SimpleDAL interface which ships with Shard-Query.  It allows Shard-Query to talk to different database providers (even possibly NoSQL providers) through a unified interface.  The class operates similarly to the mysqli class, except the functions are prefixed with my_ instead of mysqli.  Internally it uses PDO to talk to MySQL.

Shard-Query is now much faster for some aggregate functions

I checked in some improvements to Shard-Query.

Now STD, STD_SAMP, VAR and VAR_SAMP can be orders of magnitude faster for large data sets. This is because they are now distributed like COUNT, AVG and other fully distributable MySQL aggregate functions.   Prior to this change, my test query would create a 22GB (530M row) temporary table.  It now creates a one row temporary table.  This reduces network traffic, temporary storage space and increases performance.

Shard-Query better reports initialization errors. This mostly means that if you don’t have gearmand running you will no longer get a cryptic PHP output, but a nice error message.

You can change the storage engine for the repo by changing only one line in shard_query.sql. This lets you more easily install on Infobright, which needs MyISAM tables, not InnoDB tables in the repo.