Is column order important in mysql? - mysql

I read somewhere that column order in mysql is important. I believe they were referring to the indexed columns.
QUESTION: If column order is important, when and why is it important?
The reason I ask is because I have a table in mysql similar to the one below.
The primary index is on the left and I have an index on the far right. Is this bad?
It is a MyISAM table and will be used predominantly for selects (no inserts, deletes or updates).
-----------------------------------------------
| Primary index | data1| data2 | d3| Index |
-----------------------------------------------
| 1 | A | cat | 1 | A |
| 2 | B | toads | 3 | A |
| 3 | A | yabby | 7 | B |
| 4 | B | rabbits | 1 | B |
-----------------------------------------------

Column order is only important when defining indexes, as this affects whether an index is suitable to use in executing a query. (This is true of all RBDMS's, not just MySQL)
e.g.
Index defined on columns MyIndex(a, b, c) in that order.
A query such as
select a from mytable
where c = somevalue
probably won't use that index to execute the query (depends on several factors such as row count, column selectivity etc)
Whereas, it will most likely choose to use an index defined as MyIndex2(c,a,b)
Update: see use-the-index-luke.com (thanks Greg).

Related

Mysql optimized query and index for exclusion

Mysql optimized query and index with exclusion
In the case of a select on a high volume table with a select criteria excluding results, what are the possible alternatives?
for example with the following table:
+----+---+---+----+----+
| id | A | B | C | D |
+----+---+---+----+----+
| 1 | a | b | c | d |
| 2 | a | b | c | d |
| 3 | a | b | c1 | d1 |
| 4 | a | b | c2 | d |
| 5 | a | b | c | d2 |
| 6 | a | b | c | d2 |
+----+---+---+----+----+
I would like to select all the tuples (C,D) where A=a and B=b and (C!=c or D!=d)
SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d) GROUP BY C,D;
expected result:
(c1,d1)
(c2,d)
(c,d2)
I tried to add an index like that: CREATE INDEX idx_my_index ON my_table(A, B, C, D); but response times are still very long
NB: I'm using MariadDB 10.3
The explain:
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| 1 | SIMPLE | my_table | ref | idx_my_index | idx_my_index | 6 | const,const | 12055772 | Using where; Using index |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
Is there some improvement to add on my index, on mariadb config or another select to do that?
Specific solution: If we use this query as a subquery we can use the FirstMatch strategy to avoid the full scan of the table. this is described into https://mariadb.com/kb/en/firstmatch-strategy/
SELECT * FROM my__second_table tbis
WHERE (tbis.C, tbis.D)
IN (SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d));
Your index is optimal. Discussion:
INDEX(A, B, -- see Note 1
C, D) -- see note 2
Note 1: A,B can be in either order. These will be used for filtering on = to find possible rows. Confirmed by "const,const".
Note 2: C,D can be in either order. != does not work well for filtering, hence these come after A and B. They are included here to make the index "covering". Confirmed by "Using index".
"response times are still very long" -- 12M rows in the table? How many rows before the GROUP BY? How many rows in the result? (These might give us clues into where to go next.)
"Alternative". Probably SELECT DISTINCT ... instead of SELECT ... GROUP BY ... would run at the same speed. (But you could try it. Also, the EXPLAIN might be the same`; the result should be the same.)
Please provide SHOW CREATE TABLE; it might give some more clues, such as NULL/NOT NULL and Engine type. (I don't hold out much hope here.)
Please provide EXPLAIN FORMAT=JSON SELECGT ... -- This might give some more insight. Also: Turn on the Optimizer Trace.

Count rows referring to a particular row, multiple referencing tables in MySql?

My question is the following:
As asked in the question "How to count amount of rows referring to a particular row foreign key in MySql?", I want to count table references involving multiple tables referring to the table I'm interested about. However here we want the specific number of references per row for the resourced table.
In addition, what about the variant where the tables do reference eachother, but the foreign key does not exist?
Let's setup some minimal examples;
We have three tables, here called A, B, and C. B and C refer rows in A. I want to count the total amount of references for each row in A.
Contents of the first table (A), and expected query results in the column 'Count':;
+----+------------+-------+
| ID | Name | Count |
+----+------------+-------+
| 1 | First row | 0 |
| 2 | Second row | 5 |
| 3 | Third row | 2 |
| 4 | Fourth row | 1 |
+----+------------+-------+
Contents of the second table (B):
+----+------+
| ID | A_ID |
+----+------+
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
+----+------+
Contents of the third table (C):
+----+------+
| ID | A_ID |
+----+------+
| 1 | 2 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
| 5 | 4 |
+----+------+
Important restrictions for a solution
The solution should work with n tables, for reasonable values of n. The example has n=2.
The solution should not involve a subset of the product set of all the tables. As some rows from A may be referenced a bunch of times in all the other tables the size of the product set may well be stupidly large (e.g. 10*10*10*... becomes big quickly). E.g. it may not be O(q^n) where n is the number of tables and q is the amount of occurrences.
This is a partial solution, which I believe still suffers from performance problems related to condition [2]
I'm adding this as an answer as it may be useful for those working towards a better solution
Apply the following query. Extend as necessary with additional tables, adding additional lines to both the sum and the set of JOINs. This particular solution will work as long as you have less than about 90 tables. With more than that, you will have to run multiple queries like it and cache the results (for example by creating a column in the 'A' table), then sum all these later on.
SELECT
COUNT(DISTINCT B.ID) +
COUNT(DISTINCT C.ID) -- + .....
AS `Count`
FROM A
LEFT JOIN B ON A.ID = B.A_ID
LEFT JOIN C ON A.ID = C.A_ID
Unfortunately, if you have often referenced rows, the query will have a massive intermediate result, run out of memory, and thus never complete.

Select two (or more) consecutive rows in a MySQL table

The question in short
What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?
Related questions have been asked before on Stack Overflow, e.g.:
SQL check adjacent rows for sequence
How select where there are 2 consecutives rows with a specific value using MySQL?
The answers to these questions suggest a self-join. My working example described below uses that suggestion, but it performs very, very poorly on larger data sets. I've ran out of ideas how to improve it, and I'd really appreciate your input.
The issue in detail
Let's assume I were developing a database that keeps track of ball possession during a football/soccer match (please understand that I can't disclose the purpose of my real application). I require an efficient, scalable way that allows me to query changes of ball possession from one player to another (i.e. passes). For example, I might be interested in a list of all passes from any defender to any forward.
Mock database structure
My mock database consists of two tables, The first table Players stores the players' names in the Name column and their position (GOA, DEF, MID, FOR for goalie, defender, midfield, forward) in the POS column.
The second table Possession keeps track of ball possession. Whenever ball possession changes, i.e. the ball is passed to a new player, a row is added to this table. The primary key ID also indicates the temporal order of possession changes: consecutive IDs indicate an immediate sequence of ball possessions.
CREATE TABLE Players(
ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
POS VARCHAR(3) NOT NULL,
Name VARCHAR(7) NOT NULL);
CREATE TABLE Possession(
ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
PlayerID INT NOT NULL);
Next, we create some indices:
CREATE INDEX POS ON Players(POS);
CREATE INDEX Name ON Players(Name);
CREATE INDEX PlayerID ON Possession(PlayerID);
Now, we populate the Players table with a few players, and also add test entries to the Possession table:
INSERT INTO Players (POS, Name) VALUES
('DEF', 'James'), ('DEF', 'John'), ('DEF', 'Michael'),
('DEF', 'David'), ('MID', 'Charles'), ('MID', 'Thomas'),
('MID', 'Paul'), ('FOR', 'Bob'), ('GOAL', 'Kenneth');
INSERT INTO Possession (PlayerID) VALUES
(1), (8), (2), (5), (3), (8), (3), (9), (6), (4), (7), (9);
Let's quickly check our database by joining the Possession and the Players table:
SELECT Possession.ID, PlayerID, POS, Name
FROM
Possession
INNER JOIN Players ON Possession.PlayerID = Players.ID
ORDER BY Possession.ID;
This looks good:
+----+----------+-----+---------+
| ID | PlayerID | POS | Name |
+----+----------+-----+---------+
| 1 | 1 | DEF | James |
| 2 | 8 | FOR | Bob |
| 3 | 2 | DEF | John |
| 4 | 5 | MID | Charles |
| 5 | 3 | DEF | Michael |
| 6 | 8 | FOR | Bob |
| 7 | 3 | DEF | Michael |
| 8 | 9 | GOA | Kenneth |
| 9 | 6 | MID | Thomas |
| 10 | 4 | DEF | David |
| 11 | 7 | MID | Paul |
| 12 | 9 | GOA | Kenneth |
+----+----------+-----+---------+
The table can be read like this: First, the DEFender James passed to the FORward Bob. Then, Bob passed to the DEFender John, who in turn passed to the MIDfield Charles. After some more passes, the ball ends with the GOAlkeeper Kenneth.
Working solution
I need a query that lists all passes from any defender to any forward. As we can see in the previous table, there are two instances of that: right at the start, James sends the ball to Bob (row ID 1 to ID 2), and later on, Michael sends the ball to Bob (row ID 5 to ID 6).
In order to do this in SQL, I create a self-join for the Possession table, with the second instance being offset by one row. In order to be able to access the players' names and positions, I also join the two Possession table instances to the Players table. The following query does that:
SELECT
M1.ID AS "From",
M2.ID AS "To",
P1.Name AS "Sender",
P2.Name AS "Receiver"
FROM
Possession AS M1
INNER JOIN Possession as M2 ON M2.ID = M1.ID + 1
INNER JOIN Players as P1 ON M1.PlayerId = P1.ID AND P1.POS = "DEF" -- see execution plan
INNER JOIN Players as P2 ON M2.PlayerId = P2.ID AND P2.POS = "FOR"
We get the expected output:
+------+----+---------+----------+
| From | To | Sender | Receiver |
+------+----+---------+----------+
| 1 | 2 | James | Bob |
| 5 | 6 | Michael | Bob |
+------+----+---------+----------+
The problem
While this query is executed virtually instantly in the mock football database, there appears to be a problem in the execution plan with this query. Here is the output of EXPLAIN for it:
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
| 1 | SIMPLE | P2 | ref | PRIMARY,POS | POS | 5 | const | 1 | Using index condition |
| 1 | SIMPLE | M2 | ref | PRIMARY,PlayerID | PlayerID | 4 | MOCK.P2.ID | 1 | Using index |
| 1 | SIMPLE | P1 | ALL | PRIMARY,POS | NULL | NULL | NULL | 9 | Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | M1 | ref | PlayerID | PlayerID | 4 | MOCK.P1.ID | 1 | Using where; Using index |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
I have to admit that I'm not very good at interpreting query execution plans. But it seems to me that the third row indicates a bottle neck for the join marked in the query above: apparently, a full scan is done for the P1 alias table, no key seems to be used even though POS and the primary key are available, and the join buffer (flat, BNL join) part is also very suspicious. I don't know what any of that means, but I usually don't find this with normal joins.
Perhaps due to this bottle neck, the query does not finish within any acceptable time span for my real database. My real equivalent to the mock Players table has ~60,000 rows, and the Possession equivalent has ~1,160,000 rows. I monitored the execution of the query via SHOW PROCESSLIST. After more than 600 seconds, the process was still tagged as Sending data, at which point I killed the process.
The query plan on this larger data set is rather similar to the one for the small mock data set. The third join appears to be problematic with no key used, a full table scan being performed, and the join buffer part that I don't really understand:
+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
| 1 | SIMPLE | P2 | ref | POS | POS | 1 | const | 1748 | Using index condition |
| 1 | SIMPLE | M2 | ref | PlayerId | PlayerId | 2 | REAL.P2.PlayerId | 7 | |
| 1 | SIMPLE | P1 | ALL | POS | NULL | NULL | NULL | 61917 | Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | M1 | ref | PlayerId | PlayerId | 2 | REAL.P1.PlayerId | 7 | Using where |
+------+-------------+-------+------+---------------+----------+---------+-----------------------+-------+-------------------------------------------------+
I tried forcing an index for the aliased table P1 by using Players AS P1 FORCE INDEX (POS) instead of Players AS P1 in the query shown above. This change does affect the execution plan. If I force POS to be used as the key, the third line in the output of EXPLAIN is very similar to the first line. The only difference is the number of rows, which is still very high (30912). Even this modified query did not complete after 600 seconds.
I don't think that this is a configuration issue. I have made up to 18 GB of RAM available to the MySQL server, and the server uses this memory for other queries. For the present query, memory consumption does not exceed 2 GB of RAM.
Back to the question
Thanks for staying this somewhat long-winded explanation up to this point!
Let's return to the initial question: What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?
My current query certainly is doing something wrong, as it didn't finish even after ten minutes. Is there something that I can change in my current query to make it useful for my larger real data set? If not: is there an alternative, better solution that I could use?
I believe the issue is that you only have single field indexes on the players table. MySQL can only use a single index per joined table.
In case of the player table 2 fields are key from performance point of view:
playerid, since it is used in the join;
pos, since you filter on it.
You seem to have standalone indexes on both fields, but this forces MySQL to choose whether to use index for joining the 2 tables or to filter based on the where criteria.
I would create a multi-column index on playerid, pos fields (in this order), which can satisfy both the join and the where. This way MySQL can use a single index to satisfy both the join and the where.
I would also use explicit join instead of the comma separated list of tables with the join condition in where for better readability.
Here's a general plan:
SELECT
#n := #n + 1 AS N, -- Now the rows will be numbered 1,2,3,...
...
FROM ( SELECT #n := 0 ) AS init
JOIN tbl
ORDER BY ... -- based on your definition of 'consecutive'
Then you can use that query as a subquery somewhere else.
SELECT ...
FROM ( the above query ) AS x
GROUP BY ceiling(N/2) -- 1&2 will be grouped together; 3&4; etc
You can use `IF((N % 2) = 1, ..., ...) to different things with first versus second item in each pair.
You mentioned JOINing to another table. If possible, avoid doing the JOIN until this last SELECT.

joining table in mysql not using index properly?

I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Optimizing ENORMOUS MySQL View [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does MySQL view always do full table scan?
Running SELECT * FROM vAtom LIMIT 10 never returns (I aborted it after 48 hours);
explain select * from vAtom limit 10 :
+----+-------------+---------------+--------+-------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+-------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------+-----------+---------------------------------+
| 1 | SIMPLE | A | ALL | primary_index,atom_site_i_3,atom_site_i_4 | NULL | NULL | NULL | 571294166 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ref | primary_index | primary_index | 12 | PDB.A.Structure_ID | 1 | Using index |
| 1 | SIMPLE | C | eq_ref | PRIMARY,chain_i_1,sid_type,detailed_type | PRIMARY | 24 | PDB.A.Structure_ID,PDB.A.auth_asym_id | 1 | Using where |
| 1 | SIMPLE | AT | eq_ref | primary_index | primary_index | 24 | PDB.A.Structure_ID,PDB.A.type_symbol | 1 | Using index |
| 1 | SIMPLE | entityResidue | ref | PRIMARY | PRIMARY | 52 | PDB.S.Structure_ID,PDB.A.label_entity_id,PDB.A.label_seq_id,PDB.A.label_comp_id,PDB.C.Chain_ID | 1 | Using where; Using index |
| 1 | SIMPLE | E | ref | primary_index | primary_index | 12 | PDB.AT.Structure_ID | 1 | Using where |
+----+-------------+---------------+--------+-------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------+-----------+---------------------------------+
6 rows in set (0.00 sec)
You don't have to tell me that 600M rows is a lot. What I want to know is why it's slow when I only want 10 rows, and what can I do from here.
I'll be glad to post show create for anything per requests (don't want to make this post 7 pages long)
Tables can have a built-in sort order, this default kicks in on any query where you don't specify your own sorting. So your query is still trying to sort those 570+ million rows so it can find the first 10.
I'm not really surprised. Consider the case where you are simply joining 2 tables A and B and are limiting the result set; it may be that only the last N rows from table A have matching, then the database would have to go through all the rows in 'A' to get the N matching rows.
This would unavoidably be the case if there are lots of rows in 'B'.
You'd like to think that it would work the other way around when there are only a few rows in B - but obviously that's not the case here. Indeed, IIRC LIMIT has no influence on the generation of a query plan - even if it did, mysql does not seem to cope with push-predicates for views.
If you provided details of the underlying tables, the number of rows in each and the view it should be possible to write a query referencing the tables directly which runs a lot more efficiently. Alternatively depending on how the view is used, you may be able to get the desired behaviour using hints.
It claims to be using a filesort. The view must have an ORDER BY or DISTINCT on an unindexed value, or the index is not specific enough to help.
To fix it, either change the view so that it does not need to sort, or change the underlying tables so that they have an index that will make the sort fast.
I think show create would be useful. It looks like you have a full table scan on vAtom. Maybe if you put an ORDER BY clause, after an indexed field, it would perform better.