You are here

Using NULL as default values

Taxonomy upgrade extras: 

Abstract:

It is common practice in MySQL table design that fields are declared as NOT NULL but some non-sense DEFAULT values are specified for unknown field contents. In this article we show why this behavior is non optimal an why you should better declare a field to allow NULL values and use NULL values instead of some dummy values.

What we can see often out in the field

Recently we had a discussion with a customer if it makes more sense to store a default value or NULL in InnoDB tables when we do not know the value of the field yet. About this 8 byte DATETIME field we were discussing:

CREATE TABLE test (
  ...
, dt DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00'
  ...

The customer mentioned, that he is mostly storing nothing in this field which in fact results in '0000-00-00 00:00:00' occupying 8 bytes.

This is a situation which happens very often in applications but is ignored or not properly designed. But the MySQL documentation clearly states what are the facts [1].

Showing the situation by example

To make the situation a bit more clear we created this little example:

CREATE TABLE t1 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, dt DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE = InnoDB;

Then we filled about 1 mio rows into this table t1:

INSERT INTO t1 VALUES (NULL, NOW());
INSERT INTO t1 SELECT NULL, NOW() FROM t1;
...
INSERT INTO t1 SELECT NULL, NOW() FROM t1;
Query OK, 524288 rows affected (9.25 sec)
Records: 524288  Duplicates: 0  Warnings: 0

Then we create a second table but this time with a "correct" declaration of the default value:

CREATE TABLE t2 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, dt DATETIME NULL DEFAULT NULL
) ENGINE = InnoDB;

Then we filled some rows into this table as well:

INSERT INTO t2 SELECT NULL, NULL FROM t1;

Verifying what comes out

Then we run:

SHOW TABLE STATUS;
+------+--------+---------+----------------+-------------+--------------+
| Name | Engine | Rows    | Avg_row_length | Data_length | Index_length |
+------+--------+---------+----------------+-------------+--------------+
| t1   | InnoDB | 1048576 |             33 |    35192832 |            0 |
| t2   | InnoDB | 1048576 |             25 |    26787840 |            0 |
+------+--------+---------+----------------+-------------+--------------+

The tables on disk occupy:

-rw-rw---- 1 mysql mysql 46137344 2011-07-19 20:27 t1.ibd
-rw-rw---- 1 mysql mysql 37748736 2011-07-19 20:28 t2.ibd

Now we can clearly see that we safe 8 bytes per row or 8 Mbyte in total just by using NULL in the correct way instead of some dummy default value.

What says the MySQL documentation about it

The MySQL documentation clearly states in Chapter Physical Row Structure:

The record header of InnoDB tables contains a bit vector for indicating NULL columns. Columns that are NULL do not occupy space other than the bit in this vector. SQL NULL value reserves one or two bytes in the record directory. Besides that, an SQL NULL value reserves zero bytes in the data part of the record if stored in a variable length column. [1]

Conclusion:

When you are defining tables you should clearly declare when a column can contain undefined values with NULL and you should use NULL instead of some dummy default values. This occupies less storage for the InnoDB table and thus makes the whole thing faster.

Comments

Hi, Using date values like '0000-00-00 00:00:00' is really terrible, but what about magic non-null values for indexing purposes? I've heard/read that Oracle indexes ignore null values. Example: "where bla is null" would not use index defined on bla. I wonder about MySQL. If MySQL indexing indeed ignores null values, and you need to track missing information in a big table, you may be better off using some magic non-null value (not elegant, I agree).
Anonymouscomment

Using NULL values is dangerous because it can lead to incorrect results (or at least unexpected results). Take the following table: CREATE TABLE t ( id int(11) NOT NULL DEFAULT 0, city varchar(10) DEFAULT NULL ); The following query is expected to return all the rows in the table: SELECT * FROM t WHERE city=city; That's true if the city is not NULL but not if the city is NULL: mysql> INSERT INTO t (id,city) VALUES (1,'XXX'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM t WHERE city=city; +----+------+ | id | city | +----+------+ | 1 | XXX | +----+------+ mysql> UPDATE t SET city=NULL WHERE id=1; mysql> SELECT * FROM t WHERE city=city; Empty set (0.00 sec) What's the problem? MySQL returns in both cases a result that is correct in mathematical logic but the introduction of NULLs changes the kind of logic that applies (3-valued logic instead of 2-valued logic). Unfortunately in the real world, you always think with the 2-valued logic. More information on this topic in C.J. Date's SQL and Relational Theory. So I try to avoid NULL values at all costs, even if it means wasting some space.
Anonymouscomment

Some people did not agree with my statements or had some questions... A search using col_name IS NULL employs indexes if col_name is indexed. [1]

Some experiments on a table with a few (= 4) NULL values:

EXPLAIN SELECT * FROM t2 WHERE dt IS NULL;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
|  1 | SIMPLE      | t2    | ref  | dt            | dt   | 9       | const |    4 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
EXPLAIN SELECT * FROM t2 WHERE dt IS NULL OR dt 

With 50% NULL values

EXPLAIN SELECT * FROM t2 WHERE dt IS NULL;
+----+-------------+-------+------+---------------+------+---------+-------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------+--------+-------------+
|  1 | SIMPLE      | t2    | ref  | dt            | dt   | 9       | const | 381416 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+--------+-------------+
--> Here the Optimizer decides wrong. A full table scan would be faster.
EXPLAIN SELECT * FROM t2 WHERE dt BETWEEN '2011-07-22 15:55:00' AND '2011-07-22 15:56:00';
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t2    | range | dt            | dt   | 9       | NULL |   56 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
EXPLAIN SELECT * FROM t2 WHERE dt IS NULL OR dt 

--> Here the Optimizer decided correctly!

With 3.6% non NULL values (= 96.4% NULL values)

EXPLAIN SELECT * FROM t2 WHERE dt IS NULL;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
|  1 | SIMPLE      | t2    | ref  | dt            | dt   | 9       | const |    4 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
--> Optimizer is still wrong and should use a Full Table Scan.
EXPLAIN SELECT * FROM t2 WHERE dt BETWEEN '2011-07-22 15:55:00' AND '2011-07-22 15:56:00';
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t2    | range | dt            | dt   | 9       | NULL |    3 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
I admit, that indexed columns or even columns you use for joining are tricky and should be handled carefully. But all the other 80% of columns can easily be used with DEFAULT NULL if it fits into you business logic.
Shinguzcomment

Shinguz, thanks a lot for your detailed reply. I'd decided to avoid relying on indexing nulls even if MySQL supported it, as insurance against future database switch. But you took the time to test and even publish your results. Highly appreciated!
Anonymouscomment

I do generally agree with the "Conclusion" mentioned at the end of this entry. It will be a topic of different post to discuss and convince people to normalize data if you see too many NULLable columns in the database. @Stephane This is how nulls are treated by design. For e.g. you may claim that the count of * and count of city should be equal which is not in case of NULL. mysql> select count(*), count(city) from t; +----------+-------------+ | count(*) | count(city) | +----------+-------------+ | 1 | 0 | +----------+-------------+ 1 row in set (0.00 sec)
Anonymouscomment

If you're using NULL to represent "anytime in the past/future" for DATETIME, that means that you need "special handling" for what would otherwise be simple inequalities. For examples, I often store validity ranges with DATETIME; but that means I can no longer query if "valid_from > '2011-07-22 00:00:00'", since that will through out any ( valid_from = NULL ) records. That's not what I mean here -- if the valid_from is NULL, that means they are _always_ valid. So in these particular cases, I prefer to use '0' and '9999-12-31' as special values so that index usage is correct and WHERE clauses are simplified. But otherwise, yes, if NULL means "unknown", as opposed to "special", I entirely agree.
Anonymouscomment