SQL calculation on number of tuples involving a composite primary key - mysql

Consider the table: myTable(a,b,c,d) Where a and b make up the primary key.
Would the result of the following query:
SELECT distinct(b) FROM myTable;
be the same as:
SELECT * FROM myTable;
In other words, will the result set of the first query have the same amount of tuples as myTable? I think no because b can have non unique values whereas only the primary key ab is unique.

No, since b is not a primary key for myTable. Consider the case
| a | b |
+---+---+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 1 | 2 |
in the first case, you'll have 2 tuples (and only the column b), while in the second case you'll have 5 tuples and all the column of the tables.

Related

Why the primary key is not the clustered index if another non clustered index is added in MariaDB

Hello I have a table created by the following query MariaDB version 10.5.9
CREATE TABLE `test` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`status` varchar(60) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `test_status_IDX` (`status`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4
I always thought that the primary key is by default the clustered index which also defines the order of the rows in the table but here it seems that the index on the status is picked as the clustered. Why is this happening and how can I change it?
MariaDB [test]> select * from test;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
6 rows in set (0.001 sec)
It is not safe to assume that the results of SELECT will be ordered by any column across dB engines. You should always use ORDER BY col [ASC|DESC] if you expect sorting to happen. I see records being displayed in the order they were added, but that can change after deletions/insertions etc, and should not be relied on. See here for more details.
(I am going to cite MySQL docs in my answer but in the context of this question, the information applies to MariaDB as well.)
First of all, let's talk about index extensions. The InnoDB engine automatically creates an additional (composite) index behind the scenes whenever you define a secondary index (i.e. any index that is not the clustered index). That is called an index extension.
This extra index contains the columns you defined in your original secondary index (in the same order) with the columns of the primary key added after them. So, in your example, InnoDB creates an index extension for test_status_IDX (let's call it X), with columns (stauts, id).
Now let's look at the query select * from test;. There is no WHERE clause here, so all the optimizer needs to do to satisfy this query is fetch all columns for all rows of the table. This boils down to fetching status & id since there are no other columns in the table. These exact fields happen to be stored within the extended index X. This makes index X a covering index for this query. A covering index is an index that, given a query, can fully produce the results of the query without having to read any actual data rows.
Therefore, the optimizer reads & returns the values needed for the result of the query from index X, in the order that they appear there, which is by status, hence the order you observed.
To further demonstrate and extend (pun intended) this point, let's reproduce the example (tested with MariaDB 10.4):
1. First create the table & add the rows
CREATE TABLE foo (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
status varchar(60) DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
INSERT INTO foo VALUES
(1, 'or'),
(2, 'cfrc'),
(3, 'test'),
(4, 'yes'),
(5, 'hjr'),
(6, 'verve');
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+`
2. Now let's add the secondary index and check the order again
CREATE INDEX secondary_idx ON foo (status);
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 2 | cfrc |
| 5 | hjr |
| 1 | or |
| 3 | test |
| 6 | verve |
| 4 | yes |
+----+--------+
As described above, the rows are returned in the order they appear in the (extended) secondary_idx
3. Now let's drop the index and re-add it with a prefix length of 2 bytes. This means that the index will not store the full value of the column but only its first two bytes, which means the extended index is no longer a covering index because it cannot fully produce the results of the query. Thus the clustered index will be used
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status(2));
SELECT * FROM foo;
+----+--------+
| id | status |
+----+--------+
| 1 | or |
| 2 | cfrc |
| 3 | test |
| 4 | yes |
| 5 | hjr |
| 6 | verve |
+----+--------+
4. Let's showcase this behaviour in another way. Here we will retain the original secondary index (without a prefix length) but we will add a 3rd column to the table. This will once again render the secondary index a non covering index (because it does not contain the 3rd column), therefore, the clustered index will be used here as well.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status);
ALTER TABLE foo ADD bar integer NOT NULL;
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 1 | or | 0 |
| 2 | cfrc | 0 |
| 3 | test | 0 |
| 4 | yes | 0 |
| 5 | hjr | 0 |
| 6 | verve | 0 |
+----+--------+-----+
Adding bar to the index (or dropping it from the table) will again make the query use the secondary index.
ALTER TABLE foo DROP INDEX secondary_idx;
CREATE INDEX secondary_idx ON foo (status, bar);
SELECT * FROM foo;
+----+--------+-----+
| id | status | bar |
+----+--------+-----+
| 2 | cfrc | 0 |
| 5 | hjr | 0 |
| 1 | or | 0 |
| 3 | test | 0 |
| 6 | verve | 0 |
| 4 | yes | 0 |
+----+--------+-----+
You can also use EXPLAIN on all of the SELECT statements above to see which index is used at each stage.
#aprsa is right I falsely assumed that the results will be in the same order as the clustered index but in this case(using INNODB) the status index is used for the query's evaluation so that's why it appears to be 'sorted' by the status. If I select the id then the primary index is used and the results appear to be 'sorted' by the id. In another engine this might not be true.
That particular table is composed of 2 BTrees:
The data, sorted by the PRIMARY KEY. Yes, it is clustered and is ordered 1,2,3,...
The secondary index, sorted by status. Each secondary index contains a copy of the PK so that it can reach into the other BTree to get the rest of the columns (not that there are any more!). That is, the is BTree is equivalent to a 2-column table with PRIMARY KEY(status) plus an id.
Note how the output is in status order. I have to assume it decided to simply read the secondary index in its order to provide the results.
Yes, you must specify an ORDER BY if you want a particular ordering. You must not assume the details I just discussed. Who knows, tomorrow there may be something else going, such as an in-memory "hash" that has the information scrambled in some other way!
(This Answer applies to both MySQL and MariaDB. However, MariaDB is already playing a game with hashing that MySQL has not yet picked up. Be warned! Or simply add an ORDER BY.)

Mysql select in not using index

I have a big tables (10M row), with 3 columns : x, y, status.
I have an primary index on x,y.
I request like '
SELECT * FROM table where (x,y) in (select 1234,5678) take approximately 5 secondes
Whereas the request SELECT * FROM table where (x,y) in (1234,5678) give the same result for less than 0.01s
I assume it's an issue with indexes, I've tried to add force index but without success.
when I run an explain on both query, the first one in not using indexes :
EXPLAIN SELECT * FROM table where (x,y) in (select 1234,5678)
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+----------------+
| 1 | PRIMARY | table | NULL | ALL | NULL | NULL | NULL | NULL | 10794773 | 100.00 | Using where |
| 2 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
EXPLAIN SELECT * FROM table where (x,y) in (1234,5678)
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | table | NULL | const | PRIMARY | PRIMARY | 8 | const,const | 1 | 100.00 | NULL |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------+
Of course I'd like to use the first syntax because the real query is like UPDATE table set status=123 where (x,y) IN (SELECT x,y from table2 where ... );
I really don't understrand this behaviour
You do not need the select 1234,5678 subquery, use ... in ((1234,5678)) instead (pls note the double parentheses around the values):
SELECT * FROM table where (x,y) in ((1234,5678))
If you check multiple fields with the in() operator against a list of constant values, then you need to include the sets of values into parentheses:
SELECT * FROM table where (x,y) in ((1,1),(2,3),...(n,m))
The above syntax would enable MySQL to match the x,y fields against constant values, thus the query can utilise the multi-column index on x,y fields.
However, this may not be effective for the update query with a subquery. In this case, I would rewrite the update with a join instead of a subquery:
UPDATE table
INNER JOIN table2 on table.x=table2.x and table.y=table2.y
SET table.status=123
WHERE table2.fieldname=...
If x,y are indexed in both tables, then the joins should be fast. Moreover, if the table2 indexes are extended to cover the where criteria, then such a query can be really fast.

My mysql statement to query by primary key sometimes returns more than one row, so what happened?

My schema is this:
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_name` varchar(10) NOT NULL,
`account_type` varchar(10) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=latin1
INSERT INTO user VALUES (1, "zhangsan", "premiumv"), (2, "lisi", "premiumv"), (3, "wangwu", "p"), (4, "maliu", "p"), (5, "hengqi", "p"), (6, "shuba", "p");
I have the following 6 rows in the table:
+----+-----------+--------------+
| id | user_name | account_type |
+----+-----------+--------------+
| 1 | zhangsan | premiumv |
| 2 | lisi | premiumv |
| 3 | wangwu | p |
| 4 | maliu | p |
| 5 | hengqi | p |
| 6 | shuba | p |
+----+-----------+--------------+
Here is mysql to query the table by id:
SELECT * FROM user WHERE id = floor(rand()*6) + 1;
I expect it to return one row, but the actual result is non-predictive. It either will return 0 row, 1 row or sometimes more than one row. Can somebody help clarify this? Thanks!
You're testing each row against a different random number, so sometimes multiple rows will match. To fix this, calculate the random number once in a subquery.
SELECT u.*
FROM user AS u
JOIN (SELECT floor(rand()*6) + 1 AS r) AS r
ON u.id = r.r
This method of selecting a random row from a table seems like a poor design. If there are any gaps in the id sequence (which can happen easily -- MySQL doesn't guarantee that they'll always be sequential, and deleting rows will leave gaps) then it could return an empty result. The usual way to select a random row from a table is with:
SELECT *
FROM user
ORDER BY RAND()
LIMIT 1
The WHERE part must be evaluated for each row to see if there is a match. Because of this, the rand() function is evaluated for every row. Getting an inconsistent number of rows seems reasonable.
If you add LIMIT 1 to your query, the probability of returning rows from the end diminishes.
It's because the WHERE clause floor(rand()*6) + 1 is evaluated against every rows in the table to see if the condition matches the criteria. The value could be different each time it is matched against the row from the table.
You can test with a table that has same values in the column used in WHERE clause, and you can see the result:
select * from test;
+------+------+
| id | name |
+------+------+
| 1 | a |
| 2 | b |
| 1 | c |
| 2 | d |
| 1 | e |
| 2 | f |
+------+------+
select * from test where id = floor(rand()*2) + 1;
+------+------+
| id | name |
+------+------+
| 1 | a |
| 2 | d |
| 1 | e |
+------+------+
In the above example, the expression floor(rand()*2) + 1 returns 1 when matching against the first row (with name = 'a') so it is included in the result set. But then it returns 2 when matching against the forth row (with name = 'd'), so it is also included in the result set even the value of id is different from the value of the first row in the result set.

How MySQL implements the loose index scan

Recently,I face a question how mysql implements the loose index scan?
For example:
the test table structure is:
CREATE TABLE test (
id int(11) NOT NULL default '0',
v1 int(10) unsigned NOT NULL default '0',
v2 int(10) unsigned NOT NULL default '0',
v3 int(10) unsigned NOT NULL default '0',
PRIMARY KEY (id),
KEY v1_v2_v3 (v1,v2,v3)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
select * from test;
+----+----+-----+----+
| id | v1 | v2 | v3 |
+----+----+-----+----+
| 1 | 1 | 0 | 1 |
| 2 | 3 | 1 | 2 |
| 10 | 4 | 10 | 10 |
| 0 | 4 | 100 | 0 |
| 3 | 4 | 100 | 3 |
| 5 | 5 | 9 | 5 |
| 8 | 7 | 3 | 8 |
| 7 | 7 | 4 | 7 |
| 30 | 8 | 15 | 30 |
+----+----+-----+----+
Now let's see two sql:
first one:
mysql> explain select v1,v2 from test group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | v1_v2_v3 | 8 | NULL | 3 | Using index for group-by |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
I know that Using index for group-by means MySQL use the loose index scan to query the sql.But why the explain output column rows is 3?I wonder how MySQL only scan three rows and get the query result.
second one:
mysql> explain select max(v3) from test where v1>3 group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
| 1 | SIMPLE | test | range | v1_v2_v3 | v1_v2_v3 | 8 | NULL | 1 | Using where; Using index for group-by |
+----+-------------+-------+-------+---------------+----------+---------+------+------+---------------------------------------+
1 row in set (0.00 sec)
mysql> explain select max(v2) from test where v1>3 group by v1,v2;
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | v1_v2_v3 | v1_v2_v3 | 4 | NULL | 4 | Using where; Using index |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
the only difference between the above two sql is in the select list,one is max(v3),another one is max(v2).But why the max(v3) uses the loose index scan,the max(v2) don't use the loose index scan? I don't unnderstand the GROUP BY Optimization says:
The only aggregate functions used in the select list (if any) are MIN() and MAX(), and all of them refer to the same column. The column must be in the index and must immediately follow the columns in the GROUP BY.
why the column must immediately follow the columns in the GROUP BY?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks!
This is too long for a comment.
Essentially, when asking "why does the optimizer behave a certain way", the answer is because the designers implemented it that way. If you want to know "why", you would have to ask them . . . that is not an appropriate question for a general-purpose forum.
I want to point out a few things, though. If you think that that the max(v2) is a bug, then you can report it at bugs.mysql.com. I don't think it is a bug for two reasons:
The documentation explicitly states how the optimization works, and this query is not documented to use the index ("v2" does not follow the keys in the group by).
Even if it were documented differently, the use of an aggregation function on a group by key is, shall I say, non-sensical. It is valid SQL, but it is simply verbose and unnecessary. Such constructs are way down on the list of priorities for database implementors.
Finally, MySQL does not really use statistics (very well?) when creating the query plan. However, in most databases, validating a query plan on 9 rows (which fit on a single data page) often results in a query plan that does a full table scan and "inefficient" algorithms. As an example, an algorithm such as bubble sort is quite inefficient on large numbers of rows, but it can be the most efficient sorting algorithm on a (very) small number of rows.
Is there any reason to use max (v2) in the query? The result is the same even if you do not use the max () function. If you change the query to "select v2 from test where v1> 3 group by v1, v2 ", it will be done by loose index scan method.
And here are the reasons why the column must immediately follow the columns in the GROUP BY.
v1 v2 v3
1 1 1
1 1 2
1 1 10
1 2 1
1 2 2
1 2 8
In this case, select max (v3) from t1 group by v1, v2 to perform loose index scan. This is done as shown in the following figure.
v1 v2 v3
1 1 1
1 1 2
1 1 10 ------------------> 10 return
1 2 1
1 2 2
1 2 8 ------------------> 8 return
However, if you perform select max (v3) from t1 group by v1, loose index scan is not possible. Because you have to access all the keys to find the maximum value(=10).
v1 v2 v3
1 1 1 ------------------> (x)
1 1 2 ------------------> (x)
1 1 10 ------------------> 10 return
1 2 1 ------------------> (x)
1 2 2 ------------------> (x)
1 2 8 ------------------> (x)
Note that you can use the following command to see how many records are accessed using loose index scan (or tight index scan).
flush status;
select max(v3) from t1 group by v1,v2; -- perform loose index scan
show session status like 'Handler_read_key%';
flush status;
select max(v3) from t1 group by v1; -- perform tight index scan
show session status like 'Handler_read_key%';

MySQL database relationship without an ID

Hi StackOverflow community,
I have these two tables:
tbl_users
ID_user (PRIMARY KEY)
Username (UNIQUE)
Password
...
tbl_posts
ID_post (PRIMARY KEY)
Owner (UNIQUE)
Description
...
Why always everybody make database relationships with foreign keys? What about if I want to relate Username with Owner instead of doing ID_user with ID_user in both tables?
Username is UNIQUE and the Owner is the username of the creator of the post.
Can it be done like that? There is something to correct or make better? Maybe I have a misconception.
I would appreciate detailed and understandable answers.
Thank you in advance.
The reason is primarily for data integrity. The argument concerning performance is a little misleading. While neither exhaustive, nor definitive, I hope this little example will shed some light on that fact:
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(i INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,s CHAR(12) NOT NULL UNIQUE
);
STEP1:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
);
STEP2:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
)
FROM my_table;
[REPEAT STEP 2 SEVERAL TIMES]
SELECT COUNT(*) FROM my_table;
+----------+
| COUNT(*) |
+----------+
| 16384 |
+----------+
1 row in set (0.01 sec)
SELECT * FROM my_table ORDER BY i LIMIT 12;;
+----+------------+
| i | s |
+----+------------+
| 1 | kkxeehxsvy |
| 2 | iuyhrk{vaq |
| 3 | ngpedelooc |
| 4 | irkbyqgkhc |
| 6 | yqkcifcxdz |
| 7 | sgezlgvjjq |
| 8 | blavbvxbnl |
| 9 | wdbtqvgvgt |
| 13 | pakzpbnhxr |
| 14 | vpoy{gdwyd |
| 15 | ezlhz{drwg |
| 16 | ncwcwbpudh |
+----+------------+
SELECT * FROM my_table x JOIN my_table y ON y.i < x.i ORDER BY x.i,y.i LIMIT 1;
+---+------------+---+------------+
| i | s | i | s |
+---+------------+---+------------+
| 2 | iuyhrk{vaq | 1 | kkxeehxsvy |
+---+------------+---+------------+
1 row in set (1 min 22.60 sec)
SELECT * FROM my_table x JOIN my_table y ON y.s < x.s ORDER BY x.s,y.s LIMIT 1;
+-------+------------+------+------------+
| i | s | i | s |
+-------+------------+------+------------+
| 21452 | aabetdlvum | 6072 | aabdnegtav |
+-------+------------+------+------------+
1 row in set (1 min 13.59 sec)
So, we have two queries doing essentially the same thing (a comparison of 270 million values). The first joins the table to itself on an integer value. The second joins the table to itself on a string value. Both columns are indexed. As you can see, in this example, the string join actually performs better than the integer join - even though the hit on the CPU may actually be greater!