Shard-Query blog

The only open source MPP database engine for MySQL

Tag Archives: shard-query

Shard-Query supports background jobs, query parallelism, and all SELECT syntax

SkySQL just blogged about a tool to schedule long running MySQL jobs, prevent too many queries from running simultaneously, and stores the results in tables.  It even uses Gearman.  You will note that the article says that it uses PAQU, which uses Shard-Query.

I think PAQU was created for two reasons.  A) Shard-Query lacked support for fast aggregation of STDDEV and VARIANCE (this has been fixed), and B) their data set requires “cross-shard queries”.  From what I can see though, their type of cross-shard queries can be solved using subqueries in the FROM clause using Shard-Query, instead of using a customized (forked) version of Shard-Query.  It is unfortunate, because my recent improvements to Shard-Query have to be ported into PAQU by the PAQU authors.

I’d like to encourage you to look at Shard-Query if you need to run complex jobs in the background and get the results later.  As a bonus, you get support for parallel query using partitioning, and you get shared-nothing scale-out for sharding too.  You get the latest and greatest improvements to Shard-Query, and if you have a query that you aren’t sure can be executed by Shard-Query, run the problem by me first so that I can suggest solutions before you fork the product.

Since scientific data sets rarely need to be updated (if ever), a column store like Infobright is ideal for such data.  This can reduce raw multi-TB size data sets into multi-hundred GB sized data sets that can be processed much faster.  Combine this with the ability to save on IO by reading on the necessary columns, and extremely fast data processing is possible.

To run background jobs in Shard-Query:

  1. $SQ = new ShardQuery();
  2. $SQ->async = true;
  3. $job_id=$SQ->query(“select count(*) from 1TB_table”);
  4. echo “Check the table for the completion status of $job_id\n”;

To poll for completed jobs:

  1. function wait_for_jobid($job_id) {
  2.   $sql = “select * from jobs where completion_percent = 100.00 or completion = ‘error’ and job_id = ” . (!is_numeric($job_id) ? ‘NULL’ : $job_id);
  3.   while(1) {
  4.     $SQ->state->mapper->sweep_jobs();
  5.     $stmt = $SQ->state->mapper->conn->my_query($sql);
  6.     $cnt = 0;
  7.     while($row=$SQ->DAL->my_fetch_assoc($stmt)) {
  8.       $cnt++;
  9.       break;
  10.     }
  11.     if($cnt > 0) break;
  12.     sleep(1);
  13.   }
  14.   return 1;
  15. }

To get the result after it is finished:

  1. $stmt = $SQ->get_async_result($job_id);
  2. while($row = $SQ->DAL->my_fetch_assoc($stmt)) {
  3. print_r($row);
  4. }

Note: SimpleDAL – Shard-Query data access layer
You will notice the above query uses the ->DAL member.  This is the SimpleDAL interface which ships with Shard-Query.  It allows Shard-Query to talk to different database providers (even possibly NoSQL providers) through a unified interface.  The class operates similarly to the mysqli class, except the functions are prefixed with my_ instead of mysqli.  Internally it uses PDO to talk to MySQL.

Shard-Query 2.0 Beta 1 released

It is finally here.  After three years of development, the new version of Shard-Query is finally available for broad testing.

This new version of Shard-Query is vastly improved over previous versions in many ways.  This is in large part due to the fact that the previous version of Shard-Query (version 1.1) entered into production at a large company.  Their feedback during implementation was invaluable in building the new Shard-Query features.   The great thing is that this means that many of the new 2.0 features have already been tested in at least one production environment.

This post is intended to highlight the new features in Shard-Query 2.0.  I will be making posts about individual features as well as posting benchmark results.

Configuration Repository
Shard-Query 1.x stored the configuration in a .ini file.  In an installation with many nodes, keeping the .ini file in sync was difficult.  There was no way to guarantee that all of Shard-Query saw the state of the cluster (in terms of node composition) as an entire unit.  Shard-Query 2.0 solves this problem and makes configuration and bootstrapping of new nodes simpler.  Shard-Query 2.0 stores the configuration inside of a database schema, rather than in a flat file.  Each node is made aware of the configuration repository through the node setup process.  The cluster configuration is then read from the repository each time Shard-Query initializes. Read more of this post