I have two tables:
module_339 (id,name,description,etc)
module_339_schedule(id,itemid,datestart,dateend,timestart,timeend,days,recurrent)
module_339_schedule.itemid points to module_339
fist table holds conferences
second one keeps the schedules of the conferences
module_339 has 3 items
module_339_schedule has 4000+ items - almost evenly divided between the 3 conferences
I have a stored function - "getNextDate_module_339" - which will compute the "next date" for a specified conference, in order to be able to display it, and also sort by it - if the user wants to. This stored procedure will just take all the schedule entries of the specified conference and loop through them, comparing dates and times. So it will do one simple read from module_339_schedule, then loop through the items and compare dates and times.
The problem: this query is very slow:
SELECT
distinct(module_339.id)
,min( getNextDate_module_339(module_339.id,1,false)) AS ND
FROM
module_339
LEFT JOIN module_339_schedule on module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE 1=1 AND module_339.is_system_preview<=0
group by
module_339.id
order by
module_339.id asc
If I remove either the function call OR the LEFT JOIN, it is fast again.
What am I doing wrong here? Seems to be some kind of "collision" between the function call and the left join.
I think the group by part can be removed from this query, thus enabling you to remove the min function as well. Also, there is not much point of WHERE 1=1 AND..., so I've changed that as well.
Try this:
SELECT DISTINCT module_339.id
,getNextDate_module_339(module_339.id,1,false) AS ND
FROM module_339
LEFT JOIN module_339_schedule ON module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE module_339.is_system_preview<=0
ORDER BY module_339.id
Note that this might not have a lot of impact on performance.
I think that the worst part performance-wise is probably the getNextDate_module_339 function.
If you can find a way to get it's functionallity without using a function as a sub query, your sql statement will probably run alot faster then now, with or without the left join.
If you need help doing this, please edit your question to include the function and hopefully I (or someone else) might be able to help you with that.
From the MySQL reference manual:
The best way to improve the performance of SELECT operations is to create indexes on one or more of the columns that are tested in the query. The index entries act like pointers to the table rows, allowing the query to quickly determine which rows match a condition in the WHERE clause, and retrieve the other column values for those rows. All MySQL data types can be indexed.
Although it can be tempting to create an indexes for every possible column used in a query, unnecessary indexes waste space and waste time for MySQL to determine which indexes to use. Indexes also add to the cost of inserts, updates, and deletes because each index must be updated. You must find the right balance to achieve fast queries using the optimal set of indexes.
As a first step I suggest checking that the joined columns are both indexed. Since primary keys are always indexed by default, we can assume that module_339 is already indexed on the id column, so first verify that module_339_schedule is indexed on the itemid column. You can check the indexes on that table in MySQL using:
SHOW INDEX FROM module_339_schedule;
If the table does not have an index on that column, you can add one using:
CREATE INDEX itemid_index ON module_339_schedule (itemid);
That should speed up the join component of the query.
Since your query also references module_339.is_system_preview you might also consider adding an index to that column using:
CREATE INDEX is_system_preview_index ON module_339 (is_system_preview);
You might also be able to optimize the stored procedure, but you haven't included it in your question.
Related
Please assume the following query:
select dic.*, d.syllable
from dictionary dic
join details d on d.word = dic.word
As you know (and I heard), MySQL uses only one index per query. Now I want to know, in query above, which index would be better?
dictionary(word)
details(word)
In another word, when there is a join (two tables are involved), the index of which one would be affected? Should I create both of them (on the columns on the on clause) and MySQL itself decides using which one is better?
As you know (and I heard), MySQL uses only one index per query.
In general, most databases will only use one index per table, per query. This isn't always the case, but is at least a decent rule of thumb. For your particular example, you can rely on this.
Now I want to know, in query above, which index would be better?
The query you wrote is actually an inner join. This means that either of the two tables could appear on the left side of the join, and the result sets would be logically equivalent. As a result of this, MySQL is therefore free to write the join in any order it chooses. The plan that gets chosen will likely place the larger table on the left hand side, and the smaller table on the right hand side. If you knew the actual execution order of the tables, then you would just index the right table. Given that you may not know this, then both of your suggested indices are reasonable:
CREATE INDEX dict_idx ON dictionary (word);
CREATE INDEX details_idx ON details (word);
We could even try to improve on the above indices by covering the columns which appear in the select clause. For example, the index on details could be expanded to:
CREATE INDEX details_idx ON details (word, syllable);
This would let MySQL use the above index exclusively to satisfy the query plan, without requiring a seek back to the original table. You select dictionary.*, so covering this with a single index might not be possible or desirable, but at least this gets the point across.
MySQL would use the most selective index (the one giving the fewest rows). This means it depends on the data, and also optimizations like this could change between versions of the database.
I know there are several questions similar to this one, but those I've found do not relate directly to my problem.
Some initial context: I have a facts table, called ft_booking, with around 10MM records. I have a dimension called dm_date, with around 11k records, which are dates. These tables are related through foreign keys, as usual. There are 3 date foreign keys in the table ft_booking, one for boarding, one for booking, and other for cancellation. All columns have the very same definition, and the amount of distinct records for each is similar (ranging from 2.5k to 3k distinct values in each column).
There I go:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_booking
WHERE date (db.date) = '2018-05-05'
As you can see, index is being used in the table booking, and the query runs really fast, even though, in my filter, I'm using the date() function. For brevity, I'll just state that the same happens using the column fk_date_boarding. But, check this out:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_cancellation
WHERE date (db.date) = '2018-05-05';
For some mysterious reason, the planner chooses not to use the index. Now, I understand that using some function over a column kind of forces the database to perform a full table scan, in order to be able to apply that function over the column, thus bypassing the index. But, in this case, the function is not over the actual foreign key column, which is where the lookup in the booking table should be ocurring.
If I remove the date() function, the index will be used in any of those columns, as expected. One might say, then, "well, why don't you just get rid of the date() function?" - I use metabase, an Interface which allow users to use a graphical interface in order to build queries without knowing MySQL, and one of the current limitations of that tool is that it always uses the date() function when building queries not directly written in MySQL - hence, I have no way to remove the function in the queries I'm running.
Actual question: why does MySQL use index in the first two cases, but doesn't in the latter, considering the amount of distinct values is pretty much the same for all columns and they have the exact smae definition, apart from the name? Am I missing something here?
EDIT: Here is the CREATE statment of each table involved. There are some more, but we just need here tables ft_booking and dm_date (first two tables of the file).
You are "hiding date in a function call". If db.date is declared a DATE, then
date (db.date) = '2018-05-05'
can be simply
db.date = '2018-05-05'
If db.date is declared a DATETIME, then change to
db.date >= '2018-05-05'
AND db.date < '2018-05-05' + INTERVAL 1 DAY
In either case, be sure there is an index on db.date.
If by "I have a dimension called dm_date", you mean you built a dimension table to hold just dates, and then you are JOINing to the main table with some id, ... To put it bluntly, don't do that! Do not normalize "continuous" things such as DATE, DATETIME, FLOAT, or other numeric values.
If you need to discuss this further, please provide SHOW CREATE TABLE for the relevant table(s). (And please use text, not screen shots.)
Why??
The simple answer is that the Optimizer does not know how to unravel any function. Perhaps it could; perhaps it should. But it does not. Perhaps the answer involves not wanting to see how the function result will be used... comparing against a DATE? against a DATETIME? being used as a string? other?
Still, I suggest the real performance killer is the existence of dm_date rather than indexing and using the date in the main table.
Furthermore, the main table is bigger than it needs to be! fk_date_booking is a 4-byte INT SIGNED instead of a 3-byte DATE.
I am trying to merge two tables on the condition that the value for a row in the first is greater than a row in the second table. Or in code:
select * from Computers join Locations
ON Computers.ip_number > Locations.ipLower;
All of the columns referenced (ip_number and ipLower) are indices in their respective tables with very high cardinality. However, calling explain on the statement states that no indices are used in the call. How can I force MySQL to use indices on the join statement?
Additional info:
I am using MySQL version 5.6.17. The query correctly uses indices if the join condition is equality instead of greater than. The indices are binary tree type.
Edit: The ip_number variable referenced is an integer which is derived from an IP address, not the IP address itself.
MySQL uses indexes when its query planner judges that there is performance to be gained by doing so.
It is not surprising that it doesn't make that judgement in this case; each row of your first table is, by your ON condition, joined to a great many rows of your second table.
Don't worry about what indexes get used by the query planner until you have a query that makes sense in your application. It's not clear the one you've shown makes sense. Its result set will be quite large.
This query might make more sense. It might also use range scans on the index on Computers.ip_number.
select *
from Computers
join Locations ON Computers.ip_number BETWEEN Locations.ipLower AND Locations.ipLupper
But, you probably should enumerate the columns you want in the result set and avoid SELECT * if you want decent performance.
Also, don't forget that IP addresses in dotted quad form 192.168.167.66 don't collate reasonably. That is, using inequalities like < or BETWEEN to compare them with each other doesn't really work.
I am building an analytics platform where users can create reports and such against a MySQL database. Some of the tables in this database are pretty huge (billions of rows), so for all of the features so far I have indexes built to speed up each query.
However, the next feature is to add the ability for a user to define their own query so that they can analyze data in ways that we haven't pre-defined. They have full read permission to the relevant database, so basically any SELECT query is a valid query for them to enter. This creates problems, however, if a query is defined that filters or joins on a column we haven't currently indexed - sometimes to the point of taking over a minute for a simple query to execute - something as basic as:
SELECT tbl1.a, tbl2.b, SUM(tbl3.c)
FROM
tbl1
JOIN tbl2 ON tbl1.id = tbl2.id
JOIN tbl3 ON tbl1.id = tbl3.id
WHERE
tbl1.d > 0
GROUP BY
tbl1.a, tbl1.b, tbl3.c, tbl1.d
Now, assume that we've only created indexes on columns not appearing in this query so far. Also, we don't want too many indexes slowing down inserts, updates, and deletes (otherwise the simple solution would be to build an index on every column accessible by the users).
My question is, what is the best way to handle this? Currently, I'm thinking that we should scan the query, build indexes on anything appearing in a WHERE or JOIN that isn't already indexed, execute the query, and then drop the indexes that were built afterwards. However, the main things I'm unsure about are a) is there already some best practice for this sort of use case that I don't know about? and b) would the overhead of building these indexes be enough that it would negate any performance gains the indexes provide?
If this strategy doesn't work, the next option I can see working is to collect statistics on what types of queries the users run, and have some regular job periodically check what commonly used columns are missing indexes and create them.
If using MyISAM, then performing an ALTER statement on tables with large (billions of rows) in order to add an index will take a considerable amount of time, probably far longer than the 1 minute you've said for the statement above (and you'll need another ALTER to drop the index afterwards). During that time, the table will be locked meaning other users can't execute their own queries.
If your tables use the InnoDB engine and you're running MySQL 5.1+, then CREATE / DROP index statements shouldn't lock the table, but it still may take some time to execute.
There's a good rundown of the history of ALTER TABLE [here][1].
I'd also suggest that automated query analysis to identify and build indeces would quite difficult to get right. For example, what about cases such as selecting by foo.a but ordering by foo.b? This kind of query often needs a covering index over multiple columns, otherwise you may find your server tries a filesort on a huge resultset which can cause big problems.
Giving your users an "explain query" option would be a good first step. If they know enough SQL to perform custom queries then they should be able to analyse EXPLAIN in order to best execute their query (or at least realise that a given query will take ages).
So, going further with my idea, I propose you segment your datas into well identified views. You used abstract names so I can't reuse your business model, but I'll take a virtual example.
Say you have 3 tables:
customer (gender, social category, date of birth, ...)
invoice (date, amount, ...)
product (price, date of creation, ...)
you would create some sorts of materialized views for specific segments. It's like adding a business layer on top of the very bottom data representation layer.
For example, we could identify the following segments:
seniors having at least 2 invoices
invoices of 2013 with more than 1 product
How to do that? And how to do that efficiently? Regular views won't help your problem because they will have poor explain plans on random queries. What we need is a real physical representation of these segments. We could do something like this:
CREATE TABLE MV_SENIORS_WITH_2_INVOICES AS
SELECT ... /* select from the existing tables */
;
/* add indexes: */
ALTER TABLE MV_SENIORS_WITH_2_INVOICES ADD CONSTRAINT...
... etc.
So now, your guys just have to query MV_SENIORS_WITH_2_INVOICES instead of the original tables. Since there are less records, and probably more indexes, the performances will be better.
We're done! Oh wait, no :-)
We need to refresh these datas, a bit like a FAST REFRESH in Oracle. MySql does not have (not that I know... someone corrects me?) a similar system, so we have to create some triggers for that.
CREATE TRIGGER ... AFTER INSERT ON `seniors`
... /* insert the datas in MV_SENIORS_WITH_2_INVOICES if it matches the segment */
END;
Now we're done!
One of my stored procedures was taking too long execute. Taking a look at query execution plan I was able to locate the operation taking too long. It was a nested loop physical operator that had outer table (65991 rows) and inner table (19223 rows). On the nested loop it showed estimated rows = 1,268,544,993 (multiplying 65991 by 19223) as below:
I read a few articles on physical operators used for joins and got a bit confused whether nested loop or hash match would have been better for this case. From what i could gather:
Hash Match - is used by optimizer when no useful indexes are available, one table is substantially smaller than the other, tables are not sorted on the join columns. Also hash match might indicate more efficient join method (nested loops or merge join) could be used.
Question: Would hash match be better than nested loops in this scenario?
Thanks
ABSOLUTELY. A hash match would be a huge improvement. Creating the hash on the smaller 19,223 row table then probing into it with the larger 65,991 row table is a much smaller operation than the nested loop requiring 1,268,544,993 row comparisons.
The only reason the server would choose the nested loops is that it badly underestimated the number of rows involved. Do your tables have statistics on them, and if so, are they being updated regularly? Statistics are what enable the server to choose good execution plans.
If you've properly addressed statistics and are still having a problem you could force it to use a HASH join like so:
SELECT *
FROM
TableA A -- The smaller table
LEFT HASH JOIN TableB B -- the larger table
Please note that the moment you do this it will also force the join order. This means you have to arrange all your tables correctly so that their join order makes sense. Generally you would examine the execution plan the server already has and alter the order of your tables in the query to match. If you're not familiar with how to do this, the basics are that each "left" input comes first, and in graphical execution plans, the left input is the lower one. A complex join involving many tables may have to group joins together inside parentheses, or use RIGHT JOIN in order to get the execution plan to be optimal (swap left and right inputs, but introduce the table at the correct point in the join order).
It is generally best to avoid using join hints and forcing join order, so do whatever else you can first! You could look into the indexes on the tables, fragmentation, reducing column sizes (such as using varchar instead of nvarchar where Unicode is not required), or splitting the query into parts (insert to a temp table first, then join to that).
I would not recommend trying to "fix" the plan by forcing the hints in one direction or another. Instead, you need to look to your indexes, statistics and the TSQL code to understand why you have a Table spool loading up 1.2billion rows from 19000.