You are here

Query performance comparison between MariaDB Column Store and other Storage Engines

Storage Engines like InnoDB, Aria and MyISAM are Row Stores. They store rows one after the other in blocks or even directly in a single file (MyISAM). On the other hand a Column Store like MariaDB Column Store stores all the same attributes (columns) of the rows together in chunks.

This is how the table sales_fact looks like:

CREATE TABLE `sales_fact` (
  `product_id` int(11) NOT NULL,
  `time_id` int(11) NOT NULL,
  `customer_id` int(11) NOT NULL,
  `promotion_id` int(11) NOT NULL,
  `store_id` int(11) NOT NULL,
  `store_sales` decimal(10,2) NOT NULL,
  `store_cost` decimal(10,4) NOT NULL,
  `unit_sales` int(11) NOT NULL
) ENGINE=ColumnStore DEFAULT CHARSET=utf8;

CREATE TABLE `sales_fact` (
  `product_id` int(11) NOT NULL,
  `time_id` int(11) NOT NULL,
  `customer_id` int(11) NOT NULL,
  `promotion_id` int(11) NOT NULL,
  `store_id` int(11) NOT NULL,
  `store_sales` decimal(10,2) NOT NULL,
  `store_cost` decimal(10,4) NOT NULL,
  `unit_sales` int(11) NOT NULL,
  KEY `i_sales_customer_id` (`customer_id`),
  KEY `i_sales_product_id` (`product_id`),
  KEY `i_sales_promotion_id` (`promotion_id`),
  KEY `i_sales_store_id` (`store_id`),
  KEY `i_sales_time_id` (`time_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

As you may have noted the Column Store table does NOT contain indexes!

sales_fact table

And this shows how it is stored in memory and on disk (left Row Store, right Column Store; in reality it is a bit more complicated):

Row Store     Column Store

Considering this and depending on your queries a Column Store can have some significant response time advantages compared to a Row Store. And the other way around as well!

Comparison with MyISAM

So let us compare some queries with the same set of data. We are still working with one single-node MariaDB Column Store as described here.

We have chosen MyISAM for comparison with MariaDB Column Store because it is the fastest storage engine we have so far for low concurrency SELECT statements. As data set we have used the good old foodmart schema prepared for Column Store. We concentrated on the sales_fact table because this is the table with the biggest amount of rows. The MyISAM key_buffer_size was set to 128 Mibyte which is big enough for all the indices.

Simple SELECT queries

SELECT COUNT(*) FROM table

As experienced MyISAM users know already for this query MyISAM is unbeatable fast:

SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact;
1 row in set (0.040 sec)

SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact;
1 row in set (0.000 sec)

This query is mainly to make sure that both tables have the same amount of data...

SELECT * FROM table or CHECKSUM TABLE table

In these queries we do a full table scans. This is probably the worst pattern you can do to Column Store. And MyISAM is significantly faster in this:

SQL> SELECT * FROM foodmart_cs.sales_fact;
1078880 rows in set (1.833 sec)

SQL> SELECT * FROM foodmart_myisam.sales_fact;
1078880 rows in set (0.607 sec)

But this pattern is NOT what a Column Store is made for...

The Column Store architecture consists of 2 different types of modules: The User Module (UM) consisting of the MariaDB Server (mysqld), the Execution Manager (ExeMgr) and the Distribution Managers (DMLProc, DDLProc and cpimport). These processes are responsible for parsing SQL and distributing and executing the SQL statements.

The other type of module is the Performance Module (PM) consisting of the Managing and Monitoring Process (ProcMgr and ProcMon), the Primary Process (PrimProc) which handles the query execution and the Performance Module process which handles loads and writes (WriteEngineServer and cpimport). The Performance Module basically performs the work and does the I/O operations.

So we have the communication from the MariaDB server to the User Module to the Performance Module and this in normal situation over a network. So it is obvious that this costs a lot of time for huge data sets.

This can be shown when we execute the CHECKSUM TABLE command which does similar things like SELECT * FROM table but does NOT return the full result set:

SQL> CHECKSUM TABLE foodmart_cs.sales_fact;
+------------------------+------------+
| Table                  | Checksum   |
+------------------------+------------+
| foodmart_cs.sales_fact | 2218293488 |
+------------------------+------------+
1 row in set (1.370 sec)

SQL> CHECKSUM TABLE foodmart_myisam.sales_fact;
+----------------------------+------------+
| Table                      | Checksum   |
+----------------------------+------------+
| foodmart_myisam.sales_fact | 2218293488 |
+----------------------------+------------+
1 row in set (0.853 sec)

It is interesting that Column Store is less slower here compared to the full table scan. It looks like it can already parallelize some of the work in this step already?

SELECT min(column), max(column) FROM table

Now let us come to more data warehouse (DWH) like queries:

SQL> SELECT MIN(time_id), MAX(time_id) FROM foodmart_cs.sales_fact;
1 row in set (0.104 sec)

SQL> SELECT MIN(time_id), MAX(time_id) FROM foodmart_myisam.sales_fact;
1 row in set (0.001 sec)

Also here MyISAM is horribly fast because it can short cut:

SQL> EXPLAIN SELECT MIN(time_id), MAX(time_id) FROM foodmart_myisam.sales_fact;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id   | SELECT_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|    1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+

It would be interesting to compare those queries with InnoDB which cannot do these short cuts.

SELECT column, COUNT(*) FROM table GROUP BY column

This is a query where MyISAM cannot shortcut and we see already an advantage for MariaDB Columns Store.

SQL> SELECT time_id, COUNT(*) FROM foodmart_cs.sales_fact GROUP BY time_id;
673 rows in set (0.099 sec)

SQL> SELECT time_id, COUNT(*) FROM foodmart_myisam.sales_fact GROUP BY time_id;
673 rows in set (0.138 sec)

It would be interesting to see those number with really huge amount of rows (109 to 1010 rows) which do not fit into caches any more...

SELECT column, COUNT(*) FROM table GROUP BY column ORDER BY column

The last result was pretty much chaotic. So let us test an ordered result:

SQL> SELECT time_id, COUNT(*) FROM foodmart_cs.sales_fact GROUP BY time_id ORDER BY time_id;
673 rows in set (0.100 sec)

SQL> SELECT time_id, COUNT(*) FROM foodmart_myisam.sales_fact GROUP BY time_id ORDER BY time_id;
673 rows in set (0.136 sec)

The ORDER BY is probably executed in the MariaDB Server. So for this small data set the network communication can be ignored. No difference observed.

SELECT COUNT(*) FROM table where column >= value

This SELECT queries data from a very big range (99%). We know MyISAM uses the index by doing an index-only-scan. MariaDB Column Store does NOT have indexes:

SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact WHERE time_id >= 400;
1 row in set (0.054 sec)

SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 400;
1 row in set (0.159 sec)

SQL> EXPLAIN SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 400;
+------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+
| id   | select_type | table      | type  | possible_keys   | key             | key_len | ref  | rows    | Extra                    |
+------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+
|    1 | SIMPLE      | sales_fact | range | i_sales_time_id | i_sales_time_id | 4       | NULL | 1055594 | Using where; Using index |
+------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+

Column Store outperforms MyISAM by factor of 3 already with a small data set.

The same query with a very small range of data (1%):

SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact WHERE time_id >= 1090;
1 row in set (0.042 sec)

SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 1090;
1 row in set (0.005 sec)

Column Store becomes only slightly faster with the smaller result set but here MyISAM key usage has a dramatic impact. Would be interesting to see the difference if the MyISAM data/key cannot be kept in memory any more.

SELECT SUM(column3) FROM table WHERE column1 = value AND column2 BETWEEN value AND value

SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_cs.sales_fact WHERE customer_id = 42 AND time_id BETWEEN 300 AND 1000;
1 row in set (0.072 sec)

SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE customer_id = 42 AND time_id BETWEEN 300 AND 1000;
1 row in set (0.002 sec)

As soon a MyISAM has an index with high cardinality and a strong filter it outperforms Column Store.

Let us go a step back to the query with the big range (99%). But forcing MyISAM to do a table access instead of an index-only-scan:

SQL> SELECT SUM(store_sales) FROM foodmart_cs.sales_fact WHERE time_id >= 400;
1 row in set (0.117 sec)

SQL> SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact WHERE time_id >= 400;
1 row in set (0.133 sec)

Ohh! Here the MariaDB optimizer was clever enough to see that a full table scan is cheaper than accessing the index:

SQL> EXPLAIN SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact WHERE time_id >= 400;
+------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+
| id   | select_type | table      | type | possible_keys   | key  | key_len | ref  | rows    | Extra       |
+------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+
|    1 | SIMPLE      | sales_fact | ALL  | i_sales_time_id | NULL | NULL    | NULL | 1078880 | Using where |
+------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+

So MyISAM is only slightly slower than Column Store. But if we force MyISAM to use the index it becomes dramatically slow (about 9 times):

SQL> SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact FORCE INDEX (i_sales_time_id) WHERE time_id >= 400;
1 row in set (1.040 sec)

So here again: MariaDB Column store starts making fun if a huge amount of data is used...

SELECT SUM(column3) FROM table WHERE column1 = value OR column2 = value

We know that WHERE clauses with OR are always bad for the optimizer. So let us try this:

SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_cs.sales_fact WHERE promotion_id = 0 OR store_id = 13;
1 row in set (0.209 sec)

MariaDB [(none)]> SELECT calGetTrace()\G
*************************** 1. row ***************************
calGetTrace(): 
Desc Mode Table      TableOID ReferencedColumns                              PIO LIO  PBE Elapsed Rows 
BPS  PM   sales_fact 4995     (promotion_id,store_id,store_sales,unit_sales) 0   3163 0   0.172   132  
TAS  UM   -          -        -                                              -   -    -   0.153   1    
TNS  UM   -          -        -                                              -   -    -   0.000   1    

Column Store has to touch 4 out of 8 column (50%) to get the result. How the Elapsed time is calculated I have to investigate some more...

SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE promotion_id = 0 OR store_id = 13;
1 row in set (0.873 sec)

SQL> EXPLAIN SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE promotion_id = 0 OR store_id = 13;
+------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+
| id   | select_type | table      | type        | possible_keys                         | key                                   | key_len | ref  | rows   | Extra                                                           |
+------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+
|    1 | SIMPLE      | sales_fact | index_merge | i_sales_promotion_id,i_sales_store_id | i_sales_promotion_id,i_sales_store_id | 4,4     | NULL | 706357 | Using union(i_sales_promotion_id,i_sales_store_id); Using where |
+------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+

MyISAM tries to short cut by an INDEX MERGE operation which we know is not super fast but still better than a full table scan. But still MariaDB Column Store is about 4 times faster than MyISAM.

More complex SELECT queries from DWH benchmarks.

I borrowed these queries from Øystein's blog:

SELECT SUM(sf.store_sales) AS tot_yearly 
  FROM foodmart_cs.sales_fact AS sf
  JOIN foodmart_cs.store AS s ON s.store_id = sf.store_id
WHERE s.store_name = 'Store 13' 
  AND sf.unit_sales < 
      (SELECT 0.2 * AVG(isf.unit_sales)
         FROM foodmart_cs.sales_fact AS isf
        WHERE isf.store_id = s.store_id
      )
;
ERROR 1815 (HY000): Internal error: IDB-3012: Scalar filter and semi join are not from the same pair of tables.

SELECT SUM(sf.store_sales) AS tot_yearly 
  FROM foodmart_myisam.sales_fact AS sf
  JOIN foodmart_myisam.store AS s ON s.store_id = sf.store_id
WHERE s.store_name = 'Store 13' 
  AND sf.unit_sales < 
      (SELECT 0.2 * AVG(isf.unit_sales)
         FROM foodmart_myisam.sales_fact AS isf
        WHERE isf.store_id = s.store_id
      )
;
1 row in set (0.184 sec)

So some more complex queries seems not to work with Column Store yet. I have to figure out yet what to do in this case...

So let us try some more complex queries from Sergei:

SELECT SUM(store_sales)
  FROM foodmart_cs.sales_fact AS sf
  JOIN foodmart_cs.customer AS c ON c.customer_id = sf.customer_id
 WHERE c.total_children BETWEEN 1 AND 2
   AND sf.unit_sales BETWEEN 2 AND 4
;
1 row in set, 1 warning (0.210 sec)

We get a warning when using JOINs:

SQL> show warnings;
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                         |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 9999 | Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0; CacheI/O-2651; BlocksTouched-2651; PartitionBlocksEliminated-0; MsgBytesIn-41KB; MsgBytesOut-74KB; Mode-Distributed |
+-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Seems to be some statistics about how the query/join was executed...

SELECT SUM(store_sales)
  FROM foodmart_myisam.sales_fact AS sf
  JOIN foodmart_myisam.customer AS c ON c.customer_id = sf.customer_id
 WHERE c.total_children BETWEEN 1 AND 2
   AND sf.unit_sales BETWEEN 2 AND 4
;
1 row in set (0.412 sec)

With JOINs I was a bit more sceptic because of my former experience with NDB but it looks like also with JOINs Column Store outperforms MyISAM with already a small amount of data.

The table customer is a typical dimension table so this would probably be a candidate for hybrid approach (HTAP)?

Converting the customer table to MyISAM did NOT make it significantly slower:

1 row in set, 1 warning (0.215 sec)

+------+---------------+-------+------+---------------+------+---------+------+------+-------+
| id   | select_type   | table | type | possible_keys | key  | key_len | ref  | rows | Extra |
+------+---------------+-------+------+---------------+------+---------+------+------+-------+
|    1 | PUSHED SELECT | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | NULL  |
+------+---------------+-------+------+---------------+------+---------+------+------+-------+

Desc Mode Table TableOID ReferencedColumns                    PIO LIO  PBE Elapsed Rows 
CES  UM   -     -        -                                    -   -    -   0.013   4093 
BPS  PM   sf    4995     (customer_id,store_sales,unit_sales) 0   2636 0   0.182   132  
HJS  PM   sf-c  4995     -                                    -   -    -   -----   -    
TAS  UM   -     -        -                                    -   -    -   0.153   1    
TNS  UM   -     -        -                                    -   -    -   0.000   1    

Converting the foodmart DWH schema to MariaDB Columns Store

Converting the existing foodmart schema to MariaDB Column Store was a bit cumbersome...

We used the normal mysqldump and replace the storage engine:

$ zcat foodmart_dump.sql.gz | sed 's/ENGINE=InnoDB/ENGINE=ColumnStore/' | mysql --user=root foodmart_cs
ERROR 1178 (42000) at line 25: The storage engine for the table doesn't support The syntax or the data type(s) is not supported by Columnstore. Please check the Columnstore syntax guide for supported syntax or data types.

The command ALTER TABLE ... ENGINE = ColumnStore seems to be a bit more flexible but it is horribly slow...

The foodmart dump optimized for MariaDB Column Store can be found here.

Problems between MariaDB server and Column Store dictionary

SQL> RENAME TABLE customer TO customer_a;
Query OK, 0 rows affected (0.137 sec)

SQL> CREATE TABLE customer LIKE customer_a;
Query OK, 0 rows affected (0.018 sec)

SQL> ALTER TABLE customer ENGINE = ColumnStore;
ERROR 1815 (HY000): Internal error: CAL0009: Internal create table error for foodmart_cs.customer : table already exists (your schema is probably out-of-sync)

SQL> DROP TABLE customer;
Query OK, 0 rows affected (0.014 sec)

SQL> ALTER TABLE customer ENGINE = ColumnStore;
ERROR 1146 (42S02): Table 'foodmart_cs.customer' doesn't exist

SQL> CREATE TABLE customer LIKE customer_a;
Query OK, 0 rows affected (0.017 sec)

SQL> ALTER TABLE customer ENGINE = ColumnStore;
ERROR 1815 (HY000): Internal error: CAL0009: Internal create table error for foodmart_cs.customer : table already exists (your schema is probably out-of-sync)

To fix this the following DDL command will help:

SQL> CREATE TABLE customer (id INT) ENGINE = ColumnStore COMMENT='SCHEMA SYNC ONLY';
Query OK, 0 rows affected (0.006 sec)

See also: Recovering from error "Cannot create table: table already exists (your schema is probably out-of-sync)"

Conclusion

MariaDB Column Store can outperform MyISAM already with a single-node set-up and a small dataset if you are choosing the right queries. It would be interesting to see the performance gains with a multi-node set-up and a huge data set. I was told that MariaDB Column Store makes sense from 100 Gibyte upwards... Our data set was about 100 Mibyte!