Could someone tell me the reason why MySQL SUM() function performed on FLOAT columns gives strange result ?
Example :
CREATE TABLE payments (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
amount FLOAT DEFAULT NULL,
PRIMARY KEY(id)
);
INSERT INTO payments (amount) VALUES (1.3),(1.43),(1.65),(1.71);
When performing SUM(), expecting 6.09, MySQL returns that floating number :
mysql> SELECT SUM(amount) FROM payments WHERE 1;
+--------------------+
| SUM(amount) |
+--------------------+
| 6.0899999141693115 |
+--------------------+
1 row in set (0.00 sec)
This is pretty scary for a guy that may develop let say... an accounting software ! :/
Version : Mysql 5.5.60
That's why you don't use floating point values in anything where rounding is important. Even the manual says:
The DECIMAL and NUMERIC types store exact numeric data values. These types are used when it is important to preserve exact precision, for example with monetary data.
Related
This is more of an academic question. In a MySQL database, I know tinyint(M) and its differences from tinyint(1) have already been discussed on this site, but I'm wondering something else.
Could a column "REVIEWED TINYINT(5) be used to store the values of 5 different boolean checkboxes in the frontend form? I'm thinking along the lines of treating it as an array. If so, how would they be referenced? Would something like REVIEWED(3) or REVIEWED[3] work for referencing their elements? Would there be a syntax for checking whether all their elements were 1 or 0 (or null)?
TINYINT(1) and TINYINT(5) and TINYINT(12) or any other length are actually stored exactly the same. They are all an 8-bit signed integer. They support integer values from -128 to 127. Or values from 0 to 255 if the column is defined as an unsigned integer.
What's with the "length" argument then? Nothing. It doesn't affect the size of the integer or the number of bits or the range of values. The argument is a display hint only. It's useless unless you use the ZEROFILL option.
mysql> create table mytable (i1 tinyint(1) zerofill, i2 tinyint(5) zerofill, i3 tinyint(12) zerofill);
Query OK, 0 rows affected (0.04 sec)
mysql> insert into mytable values (255,255,255);
Query OK, 1 row affected (0.02 sec)
mysql> select * from mytable;
+------+-------+--------------+
| i1 | i2 | i3 |
+------+-------+--------------+
| 255 | 00255 | 000000000255 |
+------+-------+--------------+
The ZEROFILL option forces the column to be unsigned, and when you query the column, it pads the result with zeroes up to the length you defined for the column. The zeroes are not stored in the database, they are added only when you fetch query results.
The "length" argument of integers is misleading, and it causes a lot of confusion for MySQL users. In hindsight, it would have been better to make the syntax like TINYINT ZEROFILL(12) but it's too late to change it now.
I have read in the PostgreSQL docs that without an ORDER statement, SELECT will return records in an unspecified order.
Recently on an interview, I was asked how to SELECT records in the order that they inserted without an PK or created_at or other field that can be used for order. The senior dev who interviewed me was insistent that without an ORDER statement the records will be returned in the order that they were inserted.
Is this true for PostgreSQL? Is it true for MySQL? Or any other RDBMS?
I can answer for MySQL. I don't know for PostgreSQL.
The default order is not the order of insertion, generally.
In the case of InnoDB, the default order depends on the order of the index read for the query. You can get this information from the EXPLAIN plan.
For MyISAM, it returns orders in the order they are read from the table. This might be the order of insertion, but MyISAM will reuse gaps after you delete records, so newer rows may be stored earlier.
None of this is guaranteed; it's just a side effect of the current implementation. MySQL could change the implementation in the next version, making the default order of result sets different, without violating any documented behavior.
So if you need the results in a specific order, you should use ORDER BY on your queries.
Following BK's answer, and by way of example...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table(id INT NOT NULL) ENGINE = MYISAM;
INSERT INTO my_table VALUES (1),(9),(5),(8),(7),(3),(2),(6);
DELETE FROM my_table WHERE id = 8;
INSERT INTO my_table VALUES (4),(8);
SELECT * FROM my_table;
+----+
| id |
+----+
| 1 |
| 9 |
| 5 |
| 4 | -- is this what
| 7 |
| 3 |
| 2 |
| 6 |
| 8 | -- we expect?
+----+
In the case of PostgreSQL, that is quite wrong.
If there are no deletes or updates, rows will be stored in the table in the order you insert them. And even though a sequential scan will usually return the rows in that order, that is not guaranteed: the synchronized sequential scan feature of PostgreSQL can have a sequential scan "piggy back" on an already executing one, so that rows are read starting somewhere in the middle of the table.
However, this ordering of the rows breaks down completely if you update or delete even a single row: the old version of the row will become obsolete, and (in the case of an UPDATE) the new version can end up somewhere entirely different in the table. The space for the old row version is eventually reclaimed by autovacuum and can be reused for a newly inserted row.
Without an ORDER BY clause, the database is free to return rows in any order. There is no guarantee that rows will be returned in the order they were inserted.
With MySQL (InnoDB), we observe that rows are typically returned in the order by an index used in the execution plan, or by the cluster key of a table.
It is not difficult to craft an example...
CREATE TABLE foo
( id INT NOT NULL
, val VARCHAR(10) NOT NULL DEFAULT ''
, UNIQUE KEY (id,val)
) ENGINE=InnoDB;
INSERT INTO foo (id, val) VALUES (7,'seven') ;
INSERT INTO foo (id, val) VALUES (4,'four') ;
SELECT id, val FROM foo ;
MySQL is free to return rows in any order, but in this case, we would typically observe that MySQL will access rows through the InnoDB cluster key.
id val
---- -----
4 four
7 seven
Not at all clear what point the interviewer was trying to make. If the interviewer is trying to sell the idea, given a requirement to return rows from a table in the order the rows were inserted, a query without an ORDER BY clause is ever the right solution, I'm not buying it.
We can craft examples where rows are returned in the order they were inserted, but that is a byproduct of the implementation, ... not guaranteed behavior, and we should never rely on that behavior to satisfy a specification.
I am interested in this issue. Every time I design a table, I have this doubt. Take table posts as an example, it contains a column named post_type which could be one of the following value:
post(varchar) or 1(tinyint)
page(varchar) or 2(tinyint)
revision(varchar) or 3(tinyint)
The problem is that what type should I use for that column. varchar makes query results will be more intuitive, I dont need to figure out what 1/2/3 mean.
As to tinyint, does it perform better than varchar?
PS: I am using MySQL.
Data types don't have performance. They are a storage format.
Queries do have performance. So to evaluate performance, you should be specific about which query you are trying to measure.
In a query that merely fetches the row by its primary key, there's no practical difference. InnoDB keeps columns for a given row together on a page, so once it has fetched the page from disk into RAM, all the columns are available. The difference between reading 4 bytes for an integer vs. reading 8 bytes for a string like 'revision' is insignificant.
SELECT post_type FROM posts WHERE post_id = 8675309;
If you're looking up rows by their post_type value, then it becomes a little more important, because it needs to do some comparison to evaluate each row to see if it should be included in the result. Depending on the number of rows, and whether you have an index, the difference between string comparisons and integer comparisons could be important.
SELECT ... FROM posts WHERE post_type = 'revision';
I created a table and filled it with > 1 million rows:
create table posts (
post_id serial primary key,
post_type_utf varchar(10),
post_type_bin varbinary(10),
post_type_int int
);
Then I timed how long it takes to search the whole table:
select count(*) from posts where post_type_utf = 'revision';
+----------+
| count(*) |
+----------+
| 1048576 |
+----------+
1 row in set (0.24 sec)
mysql> select count(*) from posts where post_type_bin = binary 'revision';
+----------+
| count(*) |
+----------+
| 1048576 |
+----------+
1 row in set (0.15 sec)
mysql> select count(*) from posts where post_type_int = 1;
+----------+
| count(*) |
+----------+
| 1048576 |
+----------+
1 row in set (0.15 sec)
The time suggests that searching for an integer is about the same as searching for a binary string.
Why is a utf8 string slower? Because every string comparison has to evaluate character by character, against the collation defined for the column. A binary string comparison can just use memcmp() to compare the whole string in one operation.
It's also important to consider that indexes are usually a greater factor for performance than which data type you choose. Indexes help because your query for a specific post_type value will only examine those rows that match.
But in this case, you only have a few distinct values for the post_type, so a search in an index is likely to match many rows regardless.
If you're going to use them as numbers, TINYINT(1) is definitely better as mysql won't need to do unnecessary conversions. For 1-character strings you could use CHAR(1) or ENUM.
In MySQL 5.7, a table defined as following shown
CREATE TABLE `person` (
`person_id` bigint(20) NOT NULL AUTO_INCREMENT,
`name` varchar(64) DEFAULT NULL,
PRIMARY KEY (`person_id`),
KEY `ix_name` (`name`)
) ENGINE=InnoDB CHARSET=utf8
And then we prepared two records for testing, the value of name field (with varchar type) are
123456789123456789
1
respectively.
Case 1
select * from person where name = 123456789123456789-1;
Note that we are using a number instead of string inside the where clause. The record with name 123456789123456789 returned, and it seemed that -1 in the end are ignored!
Furthermore, we add another record with name = 123456789123456788, and this time the above select returns two records, including both 123456789123456789 and 123456789123456788;
The output looks so strange!
Case 2
select * from person where name = 123456789123456789-123456789123456788;
We could get the record with name 1, and in this case it seems that the - act as a minus operator.
Why the behavior of - in two cases are so different!
I can't immediately tell you what the type of 123456789123456789-1 is but for the comparison operation, we're almost certainly falling through most of the more "normal" data type conversion rules for mysql and ending up at:
In all other cases, the arguments are compared as floating-point (real) numbers.
Because one of the argument for the comparison (name) is a string type and the other is numeric, nothing else matches. So both get converted to floats and float types don't have too many digits of precision. Certainly less than the 18 required to represent 123456789123456789 and 123456789123456788 as two different numbers.
Look here:
SELECT person_id, name, name + 0.0, 123456789123456789-1 + 0.0, name = 123456789123456789-1
FROM person
ORDER BY person_id;
Perhaps, before comparing name = 123456789123456789-1 MySQL converts name and 123456789123456789-1 to DOUBLE as I showed in select. So some digits are lost.
Demo.
First, I will describe a simplified version of the problem domain.
There is table strings:
CREATE TABLE strings (
value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
INDEX(value)
) ENGINE=InnoDB;
As you can see, it have a non-unique index of CHAR(3) column.
The table is populated using the following script:
CREATE TABLE a_variants (
letter CHAR(1) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MEMORY;
INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');
INSERT INTO strings
SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
FROM a_variants a, a_variants b, a_variants c
UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'
So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":
SELECT value, COUNT(*) FROM strings GROUP BY value;
+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA | 216000 |
| BBB | 1 |
+-------+----------+
As value is indexed, I expect the following two queries to have similar performance:
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';
But in practice the first one is more than 300x times slower than the second! See:
+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+---------------------------------------------------------------+
| 1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
| 2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
| 3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+
-- I ran the 'AAA' query twice here just to be sure.
If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.
So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?
Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.
The rest of this answer is speculation.
The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.
There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.
It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.
One of things you may want to check on is the logical I/O of each query. I'm sure you'll see quite a difference. To count the number of 'BBB's in the table, probably only 3 or 4 LIOs are needed (depending on things like bucket size). To count the number of 'AAA's, essentially the entire table must be scanned, index or not. With 216k rows, that can add up to significantly more LIOs -- not to mention physical I/Os. Logical I/Os are faster than physical I/Os, but any I/O is a performance killer.
As for text vs numbers, it is always easier and faster for software (any software, not just database engines) to compare numbers than text.