You are here

MariaDB indexing of NULL values

In the recent MariaDB DBA advanced training class the question came up if MariaDB can make use of an index when searching for NULL values... And to be honest I was not sure any more. So instead of reading boring documentation I did some little tests:

Search for NULL

First I started with a little test data set. Some of you might already know it:

CREATE TABLE null_test (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, data VARCHAR(32) DEFAULT NULL
, ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP() 
);

INSERT INTO null_test VALUES (NULL, 'Some data to show if null works', NULL);
INSERT INTO null_test SELECT NULL, 'Some data to show if null works', NULL FROM null_test;
... up to 1 Mio rows

Then I modified the data according to my needs to see if the MariaDB Optimizer can make use of the index:

-- Set 0.1% of the rows to NULL
UPDATE null_test SET data = NULL WHERE ID % 1000 = 0;

ALTER TABLE null_test ADD INDEX (data);

ANALYZE TABLE null_test;

and finally I run the test (MariaDB 10.3.11):

EXPLAIN EXTENDED
SELECT * FROM null_test WHERE data IS NULL;
+------+-------------+-----------+------+---------------+------+---------+-------+------+----------+-----------------------+
| id   | select_type | table     | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra                 |
+------+-------------+-----------+------+---------------+------+---------+-------+------+----------+-----------------------+
|    1 | SIMPLE      | null_test | ref  | data          | data | 35      | const | 1047 |   100.00 | Using index condition |
+------+-------------+-----------+------+---------------+------+---------+-------+------+----------+-----------------------+

We can clearly see that the MariaDB Optimizer considers and uses the index and its estimation of about 1047 rows is quite appropriate.

Unfortunately the optimizer chooses the completely wrong strategy (3 times slower) for the opposite query:

EXPLAIN EXTENDED
SELECT * FROM null_test WHERE data = 'Some data to show if null works';
+------+-------------+-----------+------+---------------+------+---------+-------+--------+----------+-----------------------+
| id   | select_type | table     | type | possible_keys | key  | key_len | ref   | rows   | filtered | Extra                 |
+------+-------------+-----------+------+---------------+------+---------+-------+--------+----------+-----------------------+
|    1 | SIMPLE      | null_test | ref  | data          | data | 35      | const | 522351 |   100.00 | Using index condition |
+------+-------------+-----------+------+---------------+------+---------+-------+--------+----------+-----------------------+

Search for NOT NULL

Now let us try to test the opposite problem:

CREATE TABLE anti_null_test (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, data VARCHAR(32) DEFAULT NULL
, ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

INSERT INTO anti_null_test VALUES (NULL, 'Some data to show if null works', NULL);
INSERT INTO anti_null_test SELECT NULL, 'Some data to show if null works', NULL FROM anti_null_test;
... up to 1 Mio rows

Then I modified the data as well but this time in the opposite direction:

-- Set 99.9% of the rows to NULL
UPDATE anti_null_test SET data = NULL WHERE ID % 1000 != 0;

ALTER TABLE anti_null_test ADD INDEX (data);

ANALYZE TABLE anti_null_test;

and then we have to test again the query:

EXPLAIN EXTENDED
SELECT * FROM anti_null_test WHERE data IS NOT NULL;
+------+-------------+----------------+-------+---------------+------+---------+------+------+----------+-----------------------+
| id   | select_type | table          | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                 |
+------+-------------+----------------+-------+---------------+------+---------+------+------+----------+-----------------------+
|    1 | SIMPLE      | anti_null_test | range | data          | data | 35      | NULL | 1047 |   100.00 | Using index condition |
+------+-------------+----------------+-------+---------------+------+---------+------+------+----------+-----------------------+

Also in this case the MariaDB Optimizer considers and uses the index and produces a quite fast Query Execution Plan.

Also in this case the optimizer behaves wrong for the opposite query:

EXPLAIN EXTENDED
SELECT * FROM anti_null_test WHERE data IS NULL;
+------+-------------+----------------+------+---------------+------+---------+-------+--------+----------+-----------------------+
| id   | select_type | table          | type | possible_keys | key  | key_len | ref   | rows   | filtered | Extra                 |
+------+-------------+----------------+------+---------------+------+---------+-------+--------+----------+-----------------------+
|    1 | SIMPLE      | anti_null_test | ref  | data          | data | 35      | const | 523506 |   100.00 | Using index condition |
+------+-------------+----------------+------+---------------+------+---------+-------+--------+----------+-----------------------+
Taxonomy upgrade extras: 

Comments

Hello,

Just wanted to say that the post misses the point. The problem is not NULL values as such. the problem is:

  • A) InnoDB returns very optimistic estimates for low-cardinality values and
  • B) cost model is not fully adequate.

Here is an example demonstrating the same effect with a non-NULL value: https://gist.github.com/spetrunia/2c6ea05ed6b0c1ffa14fa8eac6b72029 . We still get 400K rows as an estimate (instead if 990K) . The estimate is provided by InnoDB.

Let's try it with MyRocks storage engine: https://gist.github.com/spetrunia/5445ef362aa26207ccbf83c093d7135b . The optimizer here picks a full table scan for the query with "data = 'Some data to show if null works'", which is faster.

BR,
-- Sergei Petrunia

Shinguzcomment

Yes, I am aware that this is not a NULL issue. I hope my blog did not tell that this is be cause of NULL.

> Well, its title is "Indexing of NULL values", so if one takes that at face value, I'm afraid they will think it's about special treatment of NULL values :-).

The problem is, that MariaDB (and MySQL) chooses the wrong query execution plan for columns with low cardinality and chooses a 3 times slower plan (Index Lookup instead of FTS).

> Yes, agree.
(For me it is 2.2x slower. I would say that in general, if the optimizer is within 2x difference of the best query plan, things are ok. There are many factors affecting query execution speed (e.g. OS/engine cache population, correlation of conditions, etc) - it is not realistic to expect the optimizer to always pick the best plan. One should aim for query plans that are not more than N times slower than the best one, where the value of N is about 2 for simple queries and not more than 10 for complex queries).

If this is InnoDB related or not is not interesting for the customer. You are the Optimizer cracks and you deliver the Storage Engine and you have some InnoDB cracks on board! So fix it, please...

Or do you want to tell me, that InnoDB becomes obsolete in MariaDB and you will switch to MyRocks soon (devil)? .oO(ouh, ouh, what about all the read workload patterns???)

> No. It's just a proof that the issue was in the storage engine, not in the optimizer. We might need to talk to our InnoDB guys again about this. I will file an MDEV.

I still have to do other testing. According to theory even 400k rows should be cheaper with FTS instead of Index-Lookup but this I want to test before ranting around...

Shinguzcomment