Shard-Query blog

The only open source MPP database engine for MySQL

Comparing Amazon Redshift and Shard-Query features and performance

What is Amazon Redshift and how does it compare to Shard-Query?

Amazon Redshift is the petabyte scale data warehousing system built by Amazon.   It is (currently at the time of this writing)  a beta-quality data warehouse as a service platform hosted in the AWS cloud.   It has has been built from ParAccel technology.  Amazon is an investor in ParAccel.

Amazon Redshift works similarly to Shard-Query.   Both systems allow you to spread data over many machines and treat them as one logical machine.  This architecture  is called “shared nothing” and it has developed the short name “sharding”.

Both systems essentially provide a “virtual database” composed of smaller database.  It is like a RAID array of databases.  You could think of it as a “redundant array of independent databases” in fact.

Sharding

Both Redshift and Shard-Query shard your data internally, that is, they transparently split the database up into smaller physical databases that represent one larger logical database.

Amazon Redshift is built on top of PostgreSQL 8 but has significant underlying changes in operation.

Similarly, Shard-Query is built to access MySQL 5.1+ with some limitations.   The limitations vary since  Shard-Query can use different versions and variants of MySQL, each with various different incompatibilities with the core version.   Thus available Shard-Query (and MySQL) features are always somewhat dependent on the storage engine being used.

In short, Redshift is the petabyte scale version of PostgreSQL, as built by Amazon and ParAccel.  Shard-Query is the petabyte scale version of MySQL,  as built by me.

Data distribution in Redshift

When a table is created in Redshift, the user can optionally choose a “distkey” for that table.   If no distkey is chosen, then the table is evenly distributed over all nodes.  In Redshift, joins between tables that do not share a distkey are possible but an expensive query plan may be generated which may require the transfer of a lot of data between nodes.  This is a problem common in most shared-nothing database systems.

With Redshift, in general it is usually necessary to pick good sortkeys and distkeys to get good performance.   You must be careful to avoid joining large tables together if they are not both distributed on the same distkey.  The per-table “sortkey” indicates in which physical order in which Redshift should store tuples in a table.

In many cases sort/merge joins can be made significantly less expensive by the careful selection of sortkey and distkey because Redshift can then eliminate the sort phase of the sort/merge join when performing the join.   However, careful construction of the schema and proposed queries must be done to ensure that distkey/sortkey combinations result in good performing queries.

Data distribution in Shard-Query

Shard-Query operates a bit differently.  In Shard-Query the user must specify a “shard key”.  This is essentially a database-wide “distkey”.

When a table contains the shard key then the table is distributed over the nodes and is considered as a sharded table.  Tables that do not contain the shard key are considered unsharded tables.  Unsharded tables are completely duplicated  (ie, replicated) on each node.  For example, in this benchmark the shard key is LO_OrderDateKey which is a column contained only in the fact table and guaranteed to result in even data distribution of the largest table.  The dimension tables (lookup tables) are duplicated in each shard.

This means that joins between tables which are sharded and those that are unsharded are fast because data movement between nodes is not necessary to complete the query.  The downside is that Shard-Query doesn’t have a native understanding of how to join between tables that have different shard keys.  In fact, the system is constructed so that you can’t even do that, even if you want to.  This is enforced by the limitation that you may only have one shard key per virtual database.   I suggest that you investigate Hadoop if you really need to join between very large tables which must be sharded on different columns, or of course you can always try Redshift and see if it performs well enough for you if Shard-Query doesn’t fit your use case.

Shard-Query is geared mostly toward high performance star schema or monolithic tables.   These are common in data marts, data warehouses and in big data processing systems like call detail records, web logs and other data with a long tail.  It also works well when a “master/detail” relationship like blog posts and blog comments can be established (both are sharded by the shard key “blog_id”.)  The second case of course works well for Redshift too, as the joins between the tables will be local if the tables share the same distkey (blog_id in this case).

Database storage

Amazon Redshift is a columnar database.  It does not support indexes and it enforces no constraints but can use constraint definitions to create better informed plans.   In Redshfit you must be careful, as improperly defined constraints can lead to wrong results during query execution, especially during subquery evaluation.  Redshift has an automatic compression mechanism that can be used when COPY is used to load data from S3 or DynamoDB sources.

Shard-Query is MySQL storage engine agnostic.  It works with InnoDB, MyISAM, TokuDB, Infobright and probably more.  Thus you can have a compressing row store (Tokudb or XtraDB) or a compressing column store (ICE or IEE) or some hybrid of both: you could keep recent data in the row store for frequent modification, then move data into ICE over time,  for example.  Compression is automatic as long as the underlying storage engine compresses data automatically.

Query execution

Each Amazon Redshift cluster is assigned a “leader node”.  Some operations can be single threaded on the leader node, such as inserting into tables or doing certain types of loads.

A Shard-Query grid has no leader node.  Any node can operate as a leader for any operation and all nodes can lead independently at the same time.  Coordination and a FIFO of work is done through a message queue called Gearman.  As long as the storage engine supports concurrency, then there are no limits on concurrent loading and querying for example, as long as there are resources to support the concurrency.  If the underlying storage engine can not support concurrent insert (MyISAM, ICE) then loading performance is very limited right now.   This will be resolved in the next beta of Shard-Query.

Both systems support INSERT/UPDATE/DELETE/TRUNCATE/etc.  Shard-Query can only perform operations on a table that are allowed by the underlying storage engine.  Shard-Query does not allow DML on ICE based tables, because ICE based tables can not be modified via DML, for example.  You will simply get an error from the storage engine that the operation is unsupported if you try to modify ICE data with DML.

MPP (massively parallel processing)

Both Shard-Query and Redshift access shards in parallel.  Redshift defines “slices” on each shard so that the CPU and memory resources on each node are divided evenly for query processing.   Shard-Query supports the same by placing multiple “shards” on the same physical machine, or by using table partitioning for engines which support it, or both.

Because both systems are based on shared-nothing architecture when you double the nodes you double the speed.

Analytics and custom functions

Redshift has SQL:2003 window functions (PARTITION … OVER …), common table expressions and other SQL:2003 features that are part of PostgreSQL but not part of MySQL.  Thus Redshift is capable of executing advanced SQL:2003 queries that Shard-Query is not (yet) capable of.  Window functions are on the Shard-Query roadmap for the next major release after the current beta.

Shard-Query on the other hand can be extended with custom functions, but Redshift can not.  Shard-Query ships with a PERCENTILE example function which is similar to the NTILE SQL:2003 window function (I named it differently to avoid ambiguity.)  The extensible PHP nature of Shard-Query makes adding custom functions easy.

In the next major release of Shard-Query this pluggable function interface will be used to implement all of the SQL:2003 windowing functions, but additional parser support is needed to make that completely possible.  This will bring Redshift and Shard-Query into SQL level parity, with Shard-Query still able to support custom functions if you want to add them.

Comparing performance of Redshift versus Shard-Query on SSB scale factor 200 (120GB raw data)

Not everything is equal in the world

Instance type differences

Amazon doesn’t appear to provide a way to spin up the same AMI type they use for Redshift  for use as a normal EC2 instance.   That is, I could find no AWS EC2 instance which is a direct 1:1 mapping to a Redshift XL or 8XL node.  For the purposes to testing Shard-Query I decided to use the older m1.large instance type.  This instance has less CPU, half the ram and half the storage of the Redshift XL nodes, so please take that into consideration when comparing the results.

Execution strategy differences

Redshift can perform joins very effectively if the sortkey and distkey are carefully selected.  I have configured the Redshift schema (attached later) with optimal sortkeys and distkeys for this benchmark.  This allows Redshift to perform the most common and largest join more effectively than  Shard-Query/MySQL can.  Of course this means that the apple cart is already tilted in the benefit of Redshift for some queries.  That being said, I feel it is only fair to compare both systems when they are both configured in the most optimal way available.

On the other hand, by distributing unsharded tables to all nodes, Shard-Query trades space for time by distributing data once instead of transmitting data for each query.

Context is very important

This benchmark is somewhat of an apples to oranges comparison.  Shard-Query has different capabilities than Redshift and is given less overall resources.  I’m not complaining about this.   It just is important to understand all of the benchmark numbers in this context.

You should already understand that Shard-Query will likely not perform exactly the same as Redshift with the same data set and same queries because of the above logic.  It maybe worse, or it may be better.  Understanding why it is different is what is key.

Tested Configuration

    • Both tests used 16 nodes.  Shard-Query used 16 m1.large nodes (no leader node as one is not required).
    • Redshift was provisioned with 16 XL nodes and one leader node (not configurable).

Shard-Query was used with the newest version of Infobright Community Edition.  Each Shard-Query node ran two copies of the database software, each on a different port for isolation.  Thus there are two “slices” per node.  Redshift also places two slices per XL node.

Aggregated grid resources

    • At 16 nodes Shard-Query has 120GB of aggregate memory available.
      • Redshift has 240GB of aggregate memory.
    • At 16 nodes Shard-Query has 64 ECU.
      • Redshift has 70.4 (plus the leader node).
    • At 16 nodes Shard-Query has 13340MB of locally available database storage.
      • Redshift has 32000MB.

Price/performance ratio

    • In the tested configuration Shard-Query costs 3.84/hour to run 16 nodes.  Redshift costs 13.60/hour.
    • Redshift at most exceeds Shard-Query performance by 3x.  Most queries are close in performance for significantly less cost.
    • The price/performance argument for Shard-Query is very compelling.
    • With Shard-Query you can choose any instance size from micro (not a good idea) all the way to  high IO  instances.  The latter of which have 16 cores, 60GB ram and 2TB of SSD storage.
    • Only two node sizes are available in Redshift which limits flexibility.

Response times (each measure is an average over three query executions)

image001

Here in this first set of tests you can see that the distkey/sortkey combination between dim_date and the fact table (lineorder) is very effective in increasing join performance in Redshift.

The date dimension has a sortkey of (D_Year, D_DateKey) and a distkey of (D_DateKey).

The fact table (lineorder) is being joined to dim_date using D_DateKey = LO_OrderDateKey, and lineorder has a distkey of LO_OrderDateKey and ALSO a sortkey of LO_OrderDateKey.

There is a predicate on D_Year that allows the filter of the dimension to access both the dimension and fact tables in sorted order, thus the join happens on the same node and the sort phase of the sort/merge join can be bypassed.

Shard-Query does a hash join because that is what ICE does and that is just going to be slower than the fancy pre-sorted sort/merge optimization employed by Redshift.

I must say that the way Redshift does it is quite neat :)
image003

Here Shard-Query bests Redshift when the amount of data to be examined decreases and performs very similarly when the data to be examined is large.  It looks like execution in the ICE engine may be slower when lots of data has to be uncompressed but this will need more investigation.

image005

image007

These two flights show that Shard-Query performs pretty closely to Redshift, but Redshift still performs better in a lot of cases.  That being said, it is important to understand price/performance measurement.

Conclusion

In many cases Redshift does better in this benchmark, sometimes quite a bit better (around 3x better in some cases).

Keep in mind that Shard-Query is running with less resources at 1/3 the cost!

And remember that we can triple the size of the Shard-Query cluster to likely get the same or better performance compared to Redshift for the same amount of money.

Overall I am very happy with Shard-Query performance.  Let me know how it works for you.

Redshift Schema

CREATE TABLE customer
(
 C_CustomerKey int primary key,
 C_Name varchar(25),
 C_Address varchar(25),
 C_City varchar(10),
 C_Nation varchar(15),
 C_Region varchar(12),
 C_Phone varchar(15),
 C_MktSegment varchar(10)
)
 distkey(C_CustomerKey)
 sortkey(C_Region, C_CustomerKey)
;
CREATE TABLE part
(
 P_PartKey int primary key,
 P_Name varchar(25),
 P_MFGR varchar(10),
 P_Category varchar(10),
 P_Brand varchar(15),
 P_Colour varchar(15),
 P_Type varchar(25),
 P_Size smallint,
 P_Container char(10)
) 
 distkey(P_PartKey)
 sortkey(P_MFGR, P_PartKey)
;
CREATE TABLE supplier
(
 S_SuppKey int primary key,
 S_Name char(25),
 S_Address varchar(25),
 S_City char(10),
 S_Nation char(15),
 S_Region char(12),
 S_Phone char(15)
) 
 distkey ( S_SuppKey )
 sortkey ( S_Region, S_SuppKey )
;
CREATE TABLE dim_date
(
 D_DateKey int primary key,
 D_Date char(18),
 D_DayOfWeek char(9),
 D_Month char(9),
 D_Year smallint,
 D_YearMonthNum int,
 D_YearMonth char(7),
 D_DayNumInWeek smallint,
 D_DayNumInMonth smallint,
 D_DayNumInYear smallint,
 D_MonthNumInYear smallint,
 D_WeekNumInYear smallint,
 D_SellingSeason char(12),
 D_LastDayInWeekFl smallint,
 D_LastDayInMonthFl smallint,
 D_HolidayFl smallint,
 D_WeekDayFl smallint
) 
 distkey ( D_DateKey )
 sortkey ( D_Year,D_DateKey )
;
CREATE TABLE lineorder
(
 LO_OrderKey bigint not null,
 LO_LineNumber smallint not null,
 LO_CustKey int not null references customer(C_CustomerKey),
 LO_PartKey int not null references part(P_PartKey),
 LO_SuppKey int not null references supplier(S_SuppKey),
 LO_OrderDateKey int not null references dim_date(D_DateKey),
 LO_OrderPriority varchar(15),
 LO_ShipPriority char(1),
 LO_Quantity smallint,
 LO_ExtendedPrice decimal,
 LO_OrdTotalPrice decimal,
 LO_Discount decimal,
 LO_Revenue decimal,
 LO_SupplyCost decimal,
 LO_Tax smallint,
 LO_CommitDateKey int not null references dim_date(D_DateKey),
 LO_ShipMode varchar(10)
)
 distkey( LO_OrderDateKey )
 sortkey( LO_OrderDateKey )
;

ICE Schema

CREATE TABLE IF NOT EXISTS customer
(
 C_CustomerKey int ,
 C_Name varchar(25),
 C_Address varchar(25),
 C_City varchar(10),
 C_Nation varchar(15),
 C_Region varchar(12),
 C_Phone varchar(15),
 C_MktSegment varchar(10)
);

CREATE TABLE IF NOT EXISTS part
(
 P_PartKey int ,
 P_Name varchar(25),
 P_MFGR varchar(10),
 P_Category varchar(10),
 P_Brand varchar(15),
 P_Colour varchar(15),
 P_Type varchar(25),
 P_Size tinyint,
 P_Container char(10)
);

CREATE TABLE supplier
(
 S_SuppKey int ,
 S_Name char(25),
 S_Address varchar(25),
 S_City char(10),
 S_Nation char(15),
 S_Region char(12),
 S_Phone char(15)
);

CREATE TABLE IF NOT EXISTS dim_date
(
 D_DateKey int ,
 D_Date char(18),
 D_DayOfWeek char(9),
 D_Month char(9),
 D_Year smallint,
 D_YearMonthNum int,
 D_YearMonth char(7),
 D_DayNumInWeek tinyint,
 D_DayNumInMonth tinyint,
 D_DayNumInYear smallint,
 D_MonthNumInYear tinyint,
 D_WeekNumInYear tinyint,
 D_SellingSeason char(12),
 D_LastDayInWeekFl tinyint,
 D_LastDayInMonthFl tinyint,
 D_HolidayFl tinyint,
 D_WeekDayFl tinyint
);

CREATE TABLE IF NOT EXISTS lineorder
(
 LO_OrderKey bigint not null,
 LO_LineNumber tinyint not null,
 LO_CustKey int not null,
 LO_PartKey int not null,
 LO_SuppKey int not null,
 LO_OrderDateKey int not null,
 LO_OrderPriority varchar(15),
 LO_ShipPriority char(1),
 LO_Quantity tinyint,
 LO_ExtendedPrice decimal,
 LO_OrdTotalPrice decimal,
 LO_Discount decimal,
 LO_Revenue decimal,
 LO_SupplyCost decimal,
 LO_Tax tinyint,
 LO_CommitDateKey int not null,
 LO_ShipMode varchar(10)
);

Queries:

-- Q1.1
 select sum(lo_extendedprice*lo_discount) as
 revenue
 from lineorder join dim_date on lo_orderdatekey = d_datekey
 where
 d_year = 1993
 and lo_discount between 1 and 3
 and lo_quantity < 25;
-- Q1.2
 select sum(lo_extendedprice*lo_discount) as revenue
 from lineorder
 join dim_date on lo_orderdatekey = d_datekey
 where d_yearmonth = 'Jan1994' and lo_discount
 between 4 and 6 and lo_quantity between 26 and 35;
-- Q1.3
 select sum(lo_extendedprice*lo_discount) as revenue
 from lineorder
 join dim_date on lo_orderdatekey = d_datekey
 where d_weeknuminyear = 6
 and d_year = 1994
 and lo_discount between 5 and 7 and lo_quantity between 26 and 35;
-- Q2.1
 select sum(lo_revenue), d_year, p_brand
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join part
 on lo_partkey = p_partkey join supplier
 on lo_suppkey = s_suppkey
 where p_category = 'MFGR#12'
 and s_region = 'AMERICA'
 group by d_year, p_brand
 order by d_year, p_brand;
-- Q2.2
 select sum(lo_revenue), d_year, p_brand
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join part
 on lo_partkey = p_partkey
 join supplier
 on lo_suppkey = s_suppkey
 where p_brand between 'MFGR#2221' and 'MFGR#2228'
 and s_region = 'ASIA'
 group by d_year, p_brand
 order by d_year, p_brand;
-- Q2.3
 select sum(lo_revenue), d_year, p_brand
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join part
 on lo_partkey = p_partkey
 join supplier
 on lo_suppkey = s_suppkey
 where p_brand= 'MFGR#2239'
 and s_region = 'EUROPE'
 group by d_year, p_brand
 order by d_year, p_brand;
-- Q3.1
 select c_nation, s_nation, d_year, sum(lo_revenue) as revenue
 from customer
 join lineorder
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join dim_date on lo_orderdatekey = d_datekey
 where c_region = 'ASIA'
 and s_region = 'ASIA'
 and d_year >= 1992 and d_year <= 1997
 group by c_nation, s_nation, d_year
 order by d_year asc, revenue desc;
-- Q3.2
 select c_city, s_city, d_year, sum(lo_revenue) as revenue
 from customer
 join lineorder
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join dim_date
 on lo_orderdatekey = d_datekey
 where c_nation = 'UNITED STATES'
 and s_nation = 'UNITED STATES'
 and d_year >= 1992
 and d_year <= 1997
 group by c_city, s_city, d_year
 order by d_year asc, revenue desc;
-- Q3.3
 select c_city, s_city, d_year, sum(lo_revenue) as revenue
 from customer
 join lineorder
 on lo_custkey = c_customerkey
 join supplier on lo_suppkey = s_suppkey
 join dim_date on lo_orderdatekey = d_datekey
 where (c_city='UNITED KI1' or c_city='UNITED KI5')
 and (s_city='UNITED KI1' or s_city='UNITED KI5')
 and d_year >= 1992
 and d_year <= 1997
 group by c_city, s_city, d_year
 order by d_year asc, revenue desc;
-- Q3.4
 select c_city, s_city, d_year, sum(lo_revenue)
 as revenue
 from customer
 join lineorder
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join dim_date
 on lo_orderdatekey = d_datekey
 where
 (c_city='UNITED KI1' or c_city='UNITED KI5')
 and (s_city='UNITED KI1' or s_city='UNITED KI5')
 and d_yearmonth = 'Dec1997'
 group by c_city, s_city, d_year
 order by d_year asc, revenue desc;
-- Q4.1
 select d_year, c_nation,
 sum(lo_revenue - lo_supplycost) as profit
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join customer
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join part
 on lo_partkey = p_partkey
 where
 c_region = 'AMERICA'
 and s_region = 'AMERICA'
 and (p_mfgr = 'MFGR#1'
 or p_mfgr = 'MFGR#2')
 group by d_year, c_nation
 order by d_year, c_nation;
-- Q4.2
 select d_year, s_nation, p_category,
 sum(lo_revenue - lo_supplycost) as profit
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join customer
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join part
 on lo_partkey = p_partkey
 where
 c_region = 'AMERICA'
 and s_region = 'AMERICA'
 and (d_year = 1997 or d_year = 1998)
 and (p_mfgr = 'MFGR#1'
 or p_mfgr = 'MFGR#2')
 group by d_year, s_nation, p_category
 order by d_year, s_nation, p_category;
-- Q4.3
 select d_year, s_city, p_brand,
 sum(lo_revenue - lo_supplycost) as profit
 from lineorder
 join dim_date
 on lo_orderdatekey = d_datekey
 join customer
 on lo_custkey = c_customerkey
 join supplier
 on lo_suppkey = s_suppkey
 join part
 on lo_partkey = p_partkey
 where
 s_nation = 'UNITED STATES'
 and (d_year = 1997 or d_year = 1998)
 and p_category = 'MFGR#14'
 group by d_year, s_city, p_brand
 order by d_year, s_city, p_brand;

4 responses to “Comparing Amazon Redshift and Shard-Query features and performance

  1. Jeff Kibler July 17, 2013 at 11:15 AM

    Very nice writeup, Justin. Given that you used ICE, you were lacking in multi-core query execution. Assuming that multi-core can be used, you can expect a significant improvement. Given Redshift has a hard limit of 15 concurrent queries, I’d be curious to see how Shard query performs over Redshift in high concurrency environments. Specifically, I’d like to see Infobright’s Enterprise Edition against Redshift in high concurrency environments using Shard Query.

  2. Justin July 17, 2013 at 11:24 AM

    Jeff,
    I ran two copies of ICE on each node, so I got parallel query on each node using ICE. If I can get 16 evaluation licences for IEE, then I’ll set up with only a single copy of the software on each node. I’m unsure exactly how the performance will differ, which is why I would like to test it.

    I think that Amazon Redshift and Shard-Query should both degrade linearly with concurrency. They should both be getting 100% CPU utilization for these queries as the data set fits in ram , thus the queries are CPU bound. Regardless, in both systems, the more concurrency there is, the slower each query will become, but predictably so.

  3. santhoshsd (@santhoshsd) October 10, 2013 at 1:56 AM

    Instance store is not a good option in production environments,. AWS says don’t use instance storage until and unless you have a synchronous copy of your data on another node as there is a high chance of loosing data if we use instance store.

    If we have to use EBS:

    Storage
    To have 2 TB of compressed data on EBS, it will cost around 200$
    and we would need a minimum of 20000 IOPS (1:10 ratio) which will cost around 2100$ / month.
    2 TB of Backup – 200 $ / month

    Yearly this would come upto $30,000 for storage costs, leaving compute costs. There is no option to reserve storage and IOPS.
    This would cost 2,000$ in Redshift.

    Management
    Management is a lot easier in Redshift. Snapshot and restore is easy. Based on data size restore time increases.
    Backup and restore will take a long time with shard-query.
    Ensuring High Availability / fault tolerance on Shard-query setup would be major task.

    Looking forward to your opinion on this.

    • Justin February 13, 2014 at 10:18 AM

      You normally use an instance store and keep LVM snapshots in S3 or you can replicate for an HA solution. For example, each shard can be a three node instance store region distributed PXC cluster. If using other storage engines normal MySQL replication or DRBD can be used for data duplication.

      If you want complete redundancy in three regions, you can do it for the same cost as the Redshift deployment and use all of the nodes for computation for improved performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 162 other followers

%d bloggers like this: