What's the order among null columns in mysql? - mysql

In my rails 3 application, there is a model called Book,
Book(id: integer, link_index: integer, publish_status: integer, link_page: integer, created_at: datetime, updated_at: datetime)
link_index is allowed to be NULL, and others are not allowed to be NULL, when I query like this:
Book.where(link_page: 1).published.order('link_index DESC').limit(5).pluck(:id)
it returns [518, 331, 486, 488, 493].
but when I use map instead of pluck,
Book.where(link_page: 1).published.order('link_index DESC').limit(5).map(&:id)
it returns [518, 512, 516, 534, 566].
All we know is that: only the column where id = 518 has link_index = 4, all other columns' link_index IS NULL. So the result is right: 518 is returned as the first element.
But in above two ways, why the order among NULL elements is different?
UPDATED:
Maybe it's not about map and pluck, because I use SQL directly in mysql shell, it's always the same issue:
SELECT id FROM `books` WHERE `books`.`link_page` = 1 AND `books`.`publish_status` = 4 ORDER BY link_index DESC LIMIT 5;
returns:
+-----+
| id |
+-----+
| 518 |
| 331 |
| 486 |
| 488 |
| 493 |
+-----+
But
SELECT * FROM `books` WHERE `books`.`link_page` = 1 AND `books`.`publish_status` = 4 ORDER BY link_index DESC LIMIT 5;
returns:
+-----+------------+----------------+-----------+
| id | link_index | publish_status | link_page |
+-----+------------+----------------+-----------+
| 518 | 4 | 4 | 1 |
| 512 | NULL | 4 | 1 |
| 516 | NULL | 4 | 1 |
| 534 | NULL | 4 | 1 |
| 566 | NULL | 4 | 1 |
+-----+------------+----------------+-----------+
WHY?

map and pluck are completely different functions. map runs on the collection level where as pluck runs on db level.
http://guides.rubyonrails.org/active_record_querying.html#pluck

I suggest you to check MySQL EXPLAIN of those two. The difference will be in index used to retrieve the data or in usage of some temp table. First query returns only ID, so index scan or index merge on proper indexes can be used to get those IDs and then order depends on BTREE ordering of those. In case of the second one, plan will be different, using maybe different set of indexes or different order of them, so it picks row in other order - and if lot of values is NULL, then no "right" order exists (you did not define second column to be used in case of duplicate link_index) and mysql is free to pick what it finds best (least costly plan and other stuff hides in there).

Related

SQL - get records from many-to-many relations by the user itself -OR- his group

I have two database tables, one as the main table and the other as the relation table.
The first table is a table of contents and the second table is a table that connects to users or groups.
Some data may also be modified in this second table.
I'm not sure about the structure and performance.
for example, we have User Id 160 which is under group id 7
So for the first, we have a post Table.
id | title | content | cover | status
------------------------------------------------
1 | first | content 1 | /img/... | 1
2 | second | content 2 | /img/... | 1
3 | another | content 3 | /img/... | 1
4 | four | content 4 | /img/... | 1
5 | five | content 5 | /img/... | 1
and for the second we have a post_rel Table:
id | group_id | user_id | post_id | title | cover | sort | status
---------------------------------------------------------------------------
1 | 7 | NULL | 1 | g title | img/... | 1 | 1
2 | NULL | 160 | 1 | u title | NULL | 2 | 1 *** selected for user_id
3 | 7 | NULL | 2 | NULL | img/... | 6 | 0
4 | NULL | 160 | 2 | NULL | img/... | 4 | 1 *** selected for user_id
5 | NULL | 160 | 3 | some | img/... | 3 | 1 *** selected for user_id
6 | 7 | NULL | 4 | NULL | img/... | 9 | 1 *** selected for group_id
7 | NULL | 165 | 5 | NULL | img/... | 5 | 0
This is the basic query we have.
select
`post_rel`.`title` as `custom_title`,
`post_rel`.`cover` as `custom_cover`,
`post_rel`.`group_id`,
`post_rel`.`user_id`,
`post`.*
from
`post`
inner join `post_rel` on `post`.`id` = `post_rel`.`post_id`
where
`post`.`status` = 1
and `post_rel`.`status` = 1
and (
`post_rel`.`user_id` = 160
or (
`post_rel`.`group_id` = 7
and `post_rel`.`post_id` not in (
select
`post_rel`.`post_id`
from
`post_rel`
where
`post_rel`.`user_id` = 160
)
)
)
order by
`post_rel`.`sort` asc
So, what you think about the basic query? Especially in the subquery, won't performance drop in a large table? Is it possible to write a better and simpler query or change the structure?
Edit: this is sqlfiddle example of my code and structure http://sqlfiddle.com/#!9/ed9d4b/1
I would change it to use "not exists" instead of "not in" and would use aliases so I could pull it off like so:
select
b.`title` as `custom_title`,
b.`cover` as `custom_cover`,
b.`group_id`,
b.`user_id`,
a.*
from
`post` a
inner join `post_rel` b on a.`id` = b.`post_id`
where
a.`status` = 1
and b.`status` = 1
and (
b.`user_id` = 160
or (
b.`group_id` = 7
and not exists (
select
'x'
from
`post_rel` c
where
c.`user_id` = 160 and c.`post_id`=b.`post_id`
)
)
)
order by
b.`sort` asc
Typically when managing users and group, there's this notion of an exception user who directly can get assigned to assets just like the whole group. This seems to be an example of that.
From a modeling-only perspective, there are 2 ways to deal with that:
Ensure that every user exists in a group and that you only assign assets to groups. For the exception user, create a group. You could even enforce that every user belongs to only one group. This way your post_rel table deals with only groups. Unfortunately, the relationship between group and user is not understood well enough to weigh in appropriately.
Driven only by the need to eliminate null values towards a good model which also reduces overhead, the other option is to use name value pairs and allows the User and Group to exist in the same field with another field besides it, denoting Group or User.
These are the SQL Fiddle:
NOT EXISTS version: http://sqlfiddle.com/#!9/1af8cf/2
NOT IN version: http://sqlfiddle.com/#!9/1af8cf/1
Some reading on nulls https://dev.mysql.com/doc/refman/5.6/en/data-size.html
Specifically:
Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.

how to store a *sorted* SELECT result in another table?

In my projects I often need to store the result of a SELECT in another table (we call this a "resultset"). The reason is to dynamically display a large number of rows in a web application while loading only small chunks as necessary.
Typically, this is done by queries such as this one:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT "12345", #counter:=#counter+1, a.ID
FROM sometable a
JOIN bigtable b
WHERE (a.foo = b.bar)
ORDER BY a.whatever DESC;
The fixed "12345" value is just a value to identify the "resultset" as a whole and changes for each query. The second column is a incrementing index counter that is meant to allow direct access to a specific row in the result and the ID column references the specific row in the source data table.
When the application needs a certain range of the result I just join resultsetdata with the source table to get the detailed data - which is quick as opposed to the resultsetdata query above which may take 2-3 seconds to complete (which explains why I need this intermediary table).
The SELECT query itself is not relevant for this question.
resultsetdata has the following structure:
CREATE TABLE `resultsetdata` (
`ID` int(11) NOT NULL,
`ContIdx` int(11) NOT NULL,
`Value` int(11) NOT NULL,
PRIMARY KEY (`ID`,`ContIdx`)
) ENGINE=InnoDB;
This usually works like a charm but lately we noticed that in some cases the ORDER of the result is not correct. This depends on the query itself (for example, adding DISTINCT is a typical cause), the server version and the data contained in the source tables, so I guess one can say that the row order is unpredictable with this method. Probably it depends on internal optimizations.
However, the problem is now that I can't think of any alternative solution that gives me the expected result.
Since the resultset can get several thousands of rows, loading all data in memory and then manually INSERTing it is not feasible.
Any suggestions?
EDIT: For further clarification, have a look at these queries:
DROP TABLE IF EXISTS test;
CREATE TABLE test (ID INT NOT NULL, PRIMARY KEY(ID)) ENGINE=InnoDB;
INSERT INTO test (ID) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
SET #counter:=0;
SELECT "12345", #counter:=#counter+1, ID
FROM test
ORDER BY ID DESC;
This produces the following result as "expected":
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 1 | 10 |
| 12345 | 2 | 9 |
| 12345 | 3 | 8 |
| 12345 | 4 | 7 |
| 12345 | 5 | 6 |
| 12345 | 6 | 5 |
| 12345 | 7 | 4 |
| 12345 | 8 | 3 |
| 12345 | 9 | 2 |
| 12345 | 10 | 1 |
+-------+----------------------+----+
10 rows in set (0.00 sec)
As said, in some cases (I can't provide a testcase here, sorry), this may lead to a result similar to this:
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 10 | 10 |
| 12345 | 9 | 9 |
| 12345 | 8 | 8 |
| 12345 | 7 | 7 |
| 12345 | 6 | 6 |
| 12345 | 5 | 5 |
| 12345 | 4 | 4 |
| 12345 | 3 | 3 |
| 12345 | 2 | 2 |
| 12345 | 1 | 1 |
+-------+----------------------+----+
I'm not saying this is a MySQL bug and I fully understand that my method currently provides unpredictable results. Still, I don't know how to tweak this to get predictable results.
This is because the order that records are sorted when they are inserted is unrelated to the order when you retrieve them.
When you retrieve them a query plan will be created. If no ORDER BY is specified in your SELECT statement then the order will depend on the query plan produced. This is why it is unpredictable and adding DISTINCT can change the order.
The solution is to store enough data that you can retrieve them in the correct order using an ORDER BY clause. In your case you have ordered your data by a.whatever. Can a.whatever be stored in resultsetdata? If so then you can read the records out in the correct order.
Maybe you could wrap the select into another select:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT *, #counter := #counter + 1
FROM (
SELECT "12345", a.ID
FROM sometable a
JOIN bigtable b
WHERE a.foo = b.bar
ORDER BY a.whatever DESC
) AS tmp
... but you are still at the mercy of the dumbness of MySQL's optimizer.
That's all I found about this topic, but I couln't find a hard guarantee:
Pure-SQL Technique for Auto-Numbering Rows in Result Set
http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/
http://www.xaprb.com/blog/2005/09/27/simulating-the-sql-row_number-function/

Three Queries Faster than One -- What's Wrong with my Joins?

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?

How to sample rows in MySQL using RAND(seed)?

I need to fetch a repeatable random set of rows from a table using MySQL. I implemented this using the MySQL RAND function using the bigint primary key of the row as the seed. Interestingly this produces numbers that don't look random at all. Can anyone tell me whats going on here and how to get it to work properly?
select id from foo where rand(id) < 0.05 order by id desc limit 100
In one example out of 600 rows not a single one was returned. I change the select to include "id, rand(id)" and get rid of the rand clause in the where this is what I got:
| 163345 | 0.315191733944408 |
| 163343 | 0.814825518815616 |
| 163337 | 0.313726862253367 |
| 163334 | 0.563177533972242 |
| 163333 | 0.312994424545201 |
| 163329 | 0.312261986837035 |
| 163327 | 0.811895771708242 |
| 163322 | 0.560980224573035 |
| 163321 | 0.310797115145994 |
| 163319 | 0.810430896291911 |
| 163318 | 0.560247786864869 |
| 163317 | 0.310064677437828 |
Look how many 0.31xxx lines there are. Not at all random.
PS: I know this is slow but in my app the where clause limits the number of rows to a few 1000.
Use the same seed for all the rows to do that, like:
select id from foo where rand(42) < 0.05 order by id desc limit 100
See the rand() docs for why it works that way. Change the seed if you want another set of values.
Multiply the decimal number returned by id:
select id from foo where rand() * id < 5 order by id desc limit 100

Big SQL SELECT performance difference when using <= against using < on a DATETIME column

Given the following table:
desc exchange_rates;
+------------------+----------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| time | datetime | NO | MUL | NULL | |
| base_currency | varchar(3) | NO | MUL | NULL | |
| counter_currency | varchar(3) | NO | MUL | NULL | |
| rate | decimal(32,16) | NO | | NULL | |
+------------------+----------------+------+-----+---------+----------------+
I have added indexes on time, base_currency and counter_currency, as well as a composite index on (time, base_currency, counter_currency), but I'm seeing a big performance difference when I perform a SELECT using <= against using <.
The first SELECT is:
ExchangeRate Load (95.5ms)
SELECT * FROM `exchange_rates` WHERE (time <= '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
As you can see this is taking 95ms.
If I change the query such that I compare time using < rather than <= I see this:
ExchangeRate Load (0.8ms)
SELECT * FROM `exchange_rates` WHERE (time < '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
Now it takes less than 1 millisecond, which sounds right to me. Is there a rational explanation for this behaviour?
The output from EXPLAIN provides further details, but I'm not 100% sure how to intepret this:
-- Output from the first, slow, select
SIMPLE | 5,5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | index_merge | Using intersect(index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency); Using where | 813 | | index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency
-- Output from the second, fast, select
SIMPLE | 5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | ref | Using where | 4988 | const | index_exchange_rates_on_counter_currency
(Note: I'm producing these queries through ActiveRecord (in a Rails app) but these are ultimately the queries which are being executed)
In the first case, MySQL tries to combine results from all indexes. It fetches all records from both indexes and joins them on the value of the row pointer (table offset in MyISAM, PRIMARY KEY in InnoDB).
In the second case, it just uses a single index, which, considering LIMIT 1, is the best decision.
You need to create a composite index on (base_currency, counter_currency, time) (in this order) for this query to work as fast as possible.
The engine will use the index for filtering on the leading columns (base_currency, counter_currency) and for ordering on the trailing column (time).
It also seems you want to add something like ORDER BY time DESC to your query to get the last exchange rate.
In general, any LIMIT without ORDER BY should ring the bell.