Let's suppose that I have these tables:
[ properties ]
id (INT, PK)
name (VARCHAR)
[ properties_prices ]
id (INT, PK)
property_id (INT, FK)
date_begin (DATE)
date_end (DATE)
price_per_day (DECIMAL)
price_per_week (DECIMAL)
price_per_month (DECIMAL)
And my visitor runs a search like: List the first 10 (pagination) properties where the price per day (price_per_day field) is between 10 and 100 on the period for 1st may until 31 december
I know thats a huge query, and I need to paginate the results, so I must do all the calculation and login in only one query... that's why i'm here! :)
Questions about the problem
If there are gaps, would that be an acceptable property?
There're no gaps. All the possible dates are in the database.
If the price is between 10 and 100 in some sup-periods, but not in others, do you want to get that property?
In the perfect world, no... We'll need to calculate the "sum" of that type of price in that period considering all the variations/periods.
Also, what are the "first 10"? How are they ordered? Lowest price first? But there could be more than one price.
This is just an example of pagination with 10 results per page... Can be ordered by the FULLTEXT search that I'll add with keywords and these things... As I said, it's a pretty big query.
This is similar to the answer given by #mdma, but I use a condition in the join clause for the price range, instead of the HAVING trick.
SELECT p.id, MAX(p.name),
MIN(v.price_per_day) AS price_low,
MAX(v.price_per_day) AS price_high
FROM properties p
JOIN properties_prices v ON p.id = v.property_id
AND v.price_per_day BETWEEN 10 AND 100
AND v.date_begin < '2010-12-31' AND v.date_end > '2010-05-01'
GROUP BY p.id
ORDER BY ...
LIMIT 10;
I would also recommend creating a covering index:
CREATE INDEX prices_covering ON properties_prices
(property_id, price_per_day, date_begin, date_end);
This allows your query to run as optimally as possible, because it can read the values directly from the index. It won't have to read the rows of data from the table at all.
+----+-------------+-------+-------+-----------------+-----------------+---------+-----------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------+-----------------+---------+-----------+------+--------------------------+
| 1 | SIMPLE | p | index | PRIMARY | PRIMARY | 4 | NULL | 1 | |
| 1 | SIMPLE | v | ref | prices_covering | prices_covering | 4 | test.p.id | 6 | Using where; Using index |
+----+-------------+-------+-------+-----------------+-----------------+---------+-----------+------+--------------------------+
What you tell us is not precise enough. From your data structure and your question I assume:
the price of a property can change in that period, and there would be a properties_price entry for each sub-period
there should be no overlaps in the sub-periods, but the data structure does not guarantee that
there can be gaps in the sub-periods
But there are still questions:
If there are gaps, would that be an acceptable property?
If the price is between 10 and 100 in some sup-periods, but not in others, do you want to get that property?
Also, what are the "first 10"? How are they ordered? Lowest price first? But there could be more than one price.
Depending on the answers, there might be no single query doing the trick. But if you accept the gaps, that could return what you want:
SELECT *
FROM properties AS p
WHERE EXISTS -- property is available in the price range
(SELECT * FROM properties_prices AS pp1
WHERE p.id = pp1.property_id AND
pp1.price_per_day between 10 and 100 AND
(pp1.date_begin <= "2010-12-31" OR pp1.date_end >= "2010-05-01")) AND
NOT EXISTS -- property is in the price range in all sup-periods, but there might be gaps
(SELECT * FROM properties_prices AS pp2
WHERE p.id = pp2.property_id AND
pp2.price_per_day not between 10 and 100 AND
(pp2.date_begin <= "2010-12-31" OR pp2.date_end >= "2010-05-01"))
ORDER BY name --- ???
LIMIT 10
That query doesn't give you the prices or other details. That would need to go in an extra query. But perhaps my assumptions are all wrong anyway.
This can also be done as a GROUP BY, which I think will be quite efficient, and we get some aggregates as part of the package:
SELECT
prperty_id, MIN(price_per_day), MAX(price_per_day)
FROM
properties_prices
WHERE
date_begin <= "2010-12-31" AND date_end >= "2010-05-01"
GROUP BY
property_id
HAVING MIN(IF( (price_per_day BETWEEN 10 AND 100), 1, 0))=1
ORDER BY ...
LIMIT 10
(I don't have MySQL to hand so I haven't tested. I was unsure about the MIN(IF ...) but a mock-up using a CASE worked on SQLServer.)
Related
I have design an event where you register multiple fishes and I wanted a query to extract the top 3 heaviest fishes from different people. In case of tie, it should be decided by a third parameter: who registered it first. I've tested several ways I found here on stack overflow but none of them worked the way I needed.
My schema is the following:
id | playerid | playername | itemid | weight | date | received | isCurrent
Where:
id = PK, AUTO_INCREMENT - it's basically an index
playerid = the unique code of the person who registered the fish
playername = name of the person who registered the fish
itemid = the code of the fish
weight = the weight of the fish
date = pre-defined as CURRENT_TIMESTAMP, the exact time the fish was registered
received = pre-defined as 0, it really don't matter for this analysis
isCurrent = pre-defined as 1, basically every time this event runs it updates this field to 0, meaning the registers don't belong to the current version of the event.
Here you can see the data I'm testing with
my problem is: How to avoid counting the same playerid for this rank more than once?
Query 1:
SELECT `playerid`, `playername`, `itemid`, `weight`
FROM `event_fishing`
WHERE `isCurrent` = 1 AND `weight` IN (
SELECT * FROM
(SELECT MAX(`weight`) as `fishWeight`
FROM `event_fishing`
WHERE `isCurrent` = 1
GROUP BY `playerid`
LIMIT 3) as t)
ORDER BY `weight` DESC, `date` ASC
LIMIT 3
Query 2:
SELECT * FROM `event_fishing`
INNER JOIN
(SELECT playerid, MAX(`weight`) as `fishWeight`
FROM `event_fishing`
WHERE `isCurrent` = 1
GROUP BY `playerid`
LIMIT 3) as t
ON t.playerid = `event_fishing`.playerid AND t.fishWeight = `event_fishing`.weight
WHERE `isCurrent` = 1
ORDER BY weight DESC, date ASC
LIMIT 3
Keep in mind that I must return at least the fields: playerid, playername, itemid, weight, that the version of the event must be the actual (isCurrent = 1), one playerid per line with the heaviest weight he registered for this version of the event and the date is registered.
Expected output for the data I've sent:
id |playerid|playername|itemid|weight| date |received| isCurrent
7 | 3734 |Mago Xxx | 7963 | 1850 | 2018-07-26 00:17:41 | 0 | 1
14 | 228 |Night Wolf| 7963 | 1750 | 2018-07-26 19:45:49 | 0 | 1
8 | 3646 |Test Spell| 7159 | 1690 | 2018-07-26 01:16:51 | 0 | 1
Output I'm getting (with both queries):
playerid|playername|itemid|weight
3734 |Mago Xxx | 7963 | 1850
228 |Night Wolf| 7963 | 1750
228 |Night Wolf| 7963 | 1750
Thank you for the attention.
EDIT: I've followed How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? since my query is very similar to the accepted answer, in the comments I've found something that at a first glance seem to have solved my problem but I've found a case where the accepted answer fail. Check http://sqlfiddle.com/#!9/72aeef/1
If you take a look at data you'll notice that the id 14 was the first input of 1750 and therefore should be second place, but the MAX(id) returns the last input of the same playerid and therefore give us a wrong result.
Despite the problems seems alike, mine has a greater complexity and therefore the queries that were suggested doesn't work
EDIT 2:
I've managed to solve my problem with the following query:
http://sqlfiddle.com/#!9/d711c7/6
But I'll leave this question open because of two things:
1- I don't know if there's a case where this query might fail
2- Despite we limit a lot the first query, I still think this can be more optimized, so I'll leave it open to any one that might know a better way to solve the issue.
I have the following query. I picked it from mysql slow queries log:
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step ON item_step.item_id = item.id
WHERE
item_step.number = '2' AND
(IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2")) AND
item.time >= '2015-03-01 07:00:00' AND
item.time < '2015-05-01 07:00:00';
As usually I tried to inspect it using explain:
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| 1 | SIMPLE | item | ALL | PRIMARY,time | NULL | NULL | NULL | 790464 | 38.74 | Using where |
| 1 | SIMPLE | item_step | ref | number,item_id,result2_idx | item_id | 4 | debug_db.item.id | 1 | 100.00 | Using where |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
Adding index to table item on id and time gave nothing.
Actually time column has an index,tables are connected using foreign keys and have an indexes..
I have no idea about what to do here. Is it really impossible to optimize this query to avoid using join_type = ALL ?
Since you already seem to have a FK from item_step.item_id to item.item_id, the only option you have for improvement is focusing on the parts being used to filter out records.
Slightly reformatting your query we have :
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step
ON item_step.item_id = item.id
AND item_step.number = '2'
AND (IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2"))
WHERE item.time >= '2015-03-01 07:00:00'
AND item.time < '2015-05-01 07:00:00';
First thing to notice is IS_OK(item_step.result). I have no clue what's behind this function but I'm pretty sure it blocks the optimizer from using any index this field efficiently. If the formula is something that can be written in the query directly I would suggest to do so. (e.g. IN (1, 4, 9), or IN (SELECT OK FROM result_values) etc...)
Going by the field-names I'm going to assume that we FIRST want to reduce the item_id list to a minimum first and then use that reduced list to work on the item_step table. To do so you'll need an index on the time field first. I'm assuming that the item_id field is automatically included in the index as it's the PK field, but I'm no MySQL specialist and it might also depend on your storage engine. Anyay, in MSSQL that's how it would work, YMMV.
The second thing to do then is to go with this list of item_ids to the item_step table and reduce the number of records there. For this you'll want a compound index on item_id, number, result2, result. If you manage to write the IS_OK() function 'inline' into the query you might want to try swapping the last two fields around... something you'll need to test.
From what I read here and there, MySQL does not support something like INCLUDE on indexes in the same way as MSSQL does. A way around that would be to create a 'covering' index on time, duration on item. That way, everything can be done from the index directly, at the cost of more disk-space and CPU requirements when adding data to the item table.
In short:
add index on item on time, duration
add index on item_step on item_id, number, result2, result
see if you can inline the IS_OK() function.
I have a top list page on my website, which fetches 25 rows with highest values in a particular column. I have no issues fetching the top list if it is based on one column (score for instance), but when more columns are involved, I faced some performance issues.
In the problematic case I want to select 25 rows, ordered by a sum of two columns in a descending order.
SELECT username, rank1 + rank2 AS rank FROM users ORDER BY rank DESC LIMIT 25
The query works, but takes approximately 0.25 seconds to finish, in contrast to queries on single column which take about 0.0003. Below is the result for explain query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | accounts | ALL | NULL | NULL | NULL | NULL | 517874 | Using filesort
Both rank1 and rank2 are indexed, but clearly the indexes are not used for this query. Is there a way to improve the performance by somehow editing the query or the indexes?
MySQL does not handle this situation very well. Other databases (Oracle, Postgres, SQL Server, for example) offer some form of function-based indexes which can directly solve this problem. To do this in MySQL requires adding a new column to the table, then adding a trigger to keep it up-to-date. And finally an index on the new column. Perhaps a lot of work.
In some situations, you might be able to assume that the top XXX by the sum is going to be in the top YYY for each ranking. If this is true, then a query such as this will improve performance:
select ur1.*
from (select u.*
from users u
order by rank1 desc
limit 1000
) ur1 join
(select u.*
from users u
order by rank2 desc
limit 1000
) ur2
on ur1.username = ur2.username
order by ur1.rank1 + ur1.rank2 desc
limit 25;
This extracts the top 1000 (or whatever values) by each ranking and then identifies users common to the two lists. Hopefully there are 25 such users (for your application). At the very least, this should perform better than the overall query. You can first try this. If it returns 25 rows, then great. Otherwise, go for your original query.
I'm trying to figure out why is one of my query slow and how I can fix it but I'm a bit puzzled on my results.
I have an orders table with around 80 columns and 775179 rows and I'm doing the following request :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC LIMIT 200
which returns 38 rows in 4.5s
When removing the ORDER BY I'm getting a nice improvement :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL LIMIT 200
38 rows in 0.30s
But when removing the LIMIT without touching the ORDER BY I'm getting an even better result :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC
38 rows in 0.10s (??)
Why is my LIMIT so hungry ?
GOING FURTHER
I was trying a few things before sending my answer and after noticing that I had an index on creation_date (which is a datetime) I removed it and the first query now runs in 0.10s. Why is that ?
EDIT
Good guess, I have indexes on the others columns part of the where.
mysql> explain SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC LIMIT 200;
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
| 1 | SIMPLE | orders | index | id_state_idx,id_mp_idx | creation_date | 5 | NULL | 1719 | Using where |
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
1 row in set (0.00 sec)
mysql> explain SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC;
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
| 1 | SIMPLE | orders | range | id_state_idx,id_mp_idx | id_mp_idx | 3 | NULL | 87502 | Using index condition; Using where; Using filesort |
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
Indexes do not necessarily improve performance. To better understand what is happening, it would help if you included the explain for the different queries.
My best guess would be that you have an index in id_state or even id_state, id_mp that can be used to satisfy the where clause. If so, the first query without the order by would use this index. It should be pretty fast. Even without an index, this requires a sequential scan of the pages in the orders table, which can still be pretty fast.
Then when you add the index on creation_date, MySQL decides to use that index instead for the order by. This requires reading each row in the index, then fetching the corresponding data page to check the where conditions and return the columns (if there is a match). This reading is highly inefficient, because it is not in "page" order but rather as specified by the index. Random reads can be quite inefficient.
Worse, even though you have a limit, you still have to read the entire table because the entire result set is needed. Although you have saved a sort on 38 records, you have created a massively inefficient query.
By the way, this situation gets significantly worse if the orders table does not fit in available memory. Then you have a condition called "thrashing", where each new record tends to generate a new I/O read. So, if a page has 100 records on it, the page might have to be read 100 times.
You can make all these queries run faster by having an index on orders(id_state, id_mp, creation_date). The where clause will use the first two columns and the order by will use the last.
Same problem happened in my project,
I did some test, and found out that LIMIT is slow because of row lookups
See:
MySQL ORDER BY / LIMIT performance: late row lookups
So, the solution is:
(A)when using LIMIT, select not all columns, but only the PK columns
(B)Select all columns you need, and then join with the result set of (A)
SQL should likes:
SELECT
*
FROM
orders O1 <=== this is what you want
JOIN
(
SELECT
ID <== fetch the PK column only, this should be fast
FROM
orders
WHERE
[your query condition] <== filter record by condition
ORDER BY
[your order by condition] <== control the record order
LIMIT 2000, 50 <== filter record by paging condition
) as O2
ON
O1.ID = O2.ID
ORDER BY
[your order by condition] <== control the record order
in my DB,
the old SQL which select all columns using "LIMIT 21560, 20", costs about 4.484s.
the new sql costs only 0.063s. The new one is about 71 times faster
I had a similar issue on a table of 2.5 million records. Removing the limit part the query took a few seconds. With the limit part it stuck forever.
I solved with a subquery. In your case it would became:
SELECT *
FROM
(SELECT *
FROM orders
WHERE id_state = 2
AND id_mp IS NOT NULL
ORDER BY creation_date DESC) tmp
LIMIT 200
I noted that the original query was fast when the number of selected rows was greater than the limit parameter. Se the query became extremely slow when the limit parameter was useless.
Another solution is trying forcing index. In your case you can try with
SELECT *
FROM orders force index (id_mp_idx)
WHERE id_state = 2
AND id_mp IS NOT NULL
ORDER BY creation_date DESC
LIMIT 200
Problem is that mysql is forced to sort data on the fly. My query of deep offset like:
ORDER BY somecol LIMIT 99990, 10
Took 2.5s.
I fixed it by creating a new table, which has presorted data by column somecol and contains only ids, and there the deep offset (without need to use ORDER BY) takes 0.09s.
0.1s is not still enough fast though. 0.01s would be better.
I will end up creating a table that holds the page number as special indexed column, so instead of doing limit x, y i will query where page = Z.
i just tried it and it is fast as 0.0013. only problem is, that the offseting is based on static numbers (presorted in pages by 10 items for example.. its not that big problem though.. you can still get out any data of any pages.)
i'm trying to get optimize an very old query that i can't wrap my head around. the result that i want to archive is that i want to recommend the visitor on a web shop what other customers have shown interest in, i.e. what else they have bought together with the product that the visitor is looking at.
i have a subquery but it's very slow, takes ~15s on ~8 000 000 rows.
the layout is that all products that are put in a users basket are kept in a table wsBasket and separated by a basketid (which in another table is associated with a member).
in this example i want to list all the most popular products that users have bought together with productid 427, but not list the productid 427 itself.
SELECT productid, SUM(quantity) AS qty
FROM wsBasket
WHERE basketid IN
(SELECT basketid
FROM wsBasket
WHERE productid=427) AND productid!=427
GROUP by productid
ORDER BY qty
DESC LIMIT 0,4;
any help is much appreciated! hope this makes any sense at all to at least someone :)
UPDATE 1:
thanks for your comments guys here are my answers, they didn't fit in the comments-field.
Using EXPLAIN on the above query i got the fllowing. Please note, I do not have any indexes on the table (except for primary key on the id-field), i want to modify the query to benefit from indexes and place indexes on the right keys.
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
| 1 | PRIMARY | wsBasket | ALL | NULL | NULL | NULL | NULL | 2821 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | wsBasket | ALL | NULL | NULL | NULL | NULL | 2821 | Using where |
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
Two obvious indexes to add: one on basketid and a second on productid: then retry the query and a new EXPLAIN to see that the indexes are being used
As well as ensuring that suitable indexes exist on productid and basketid, you will often benefit from structuring your query as a simple join rather than a subquery, especially in MySQL.
SELECT b1.productid, SUM(b1.quantity) AS qty
FROM wsBasket AS b0
JOIN wsBasket AS b1 ON b1.basketid=b0.basketid
WHERE b0.productid=427 AND b1.productid<>427
GROUP BY b1.productid
ORDER BY qty DESC
LIMIT 4
For me, on a possibly-similar dataset, the join resulted in two select_type: SIMPLE rows in the EXPLAIN output, whereas the subquery method spat out a horrible-for-performance DEPENDENT SUBQUERY. Consequently the join was well over an order of magnitude faster.
The two fields you mainly use for searching in this query are productid and basketid.
When you search for records having productid equal to 427, Database has no clue where to find this record. It doesn't even know that if it does find one matching, that there will not be another matching one, so it has to look through the entire table, potentially thousands of records.
An index is a separate file that is sorted, and contains only the field/s you're interested in sorting on. so creating an index saves a immense amount of time!