MySQL index usage on join - mysql

I know there are several questions similar to this one, but those I've found do not relate directly to my problem.
Some initial context: I have a facts table, called ft_booking, with around 10MM records. I have a dimension called dm_date, with around 11k records, which are dates. These tables are related through foreign keys, as usual. There are 3 date foreign keys in the table ft_booking, one for boarding, one for booking, and other for cancellation. All columns have the very same definition, and the amount of distinct records for each is similar (ranging from 2.5k to 3k distinct values in each column).
There I go:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_booking
WHERE date (db.date) = '2018-05-05'
As you can see, index is being used in the table booking, and the query runs really fast, even though, in my filter, I'm using the date() function. For brevity, I'll just state that the same happens using the column fk_date_boarding. But, check this out:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_cancellation
WHERE date (db.date) = '2018-05-05';
For some mysterious reason, the planner chooses not to use the index. Now, I understand that using some function over a column kind of forces the database to perform a full table scan, in order to be able to apply that function over the column, thus bypassing the index. But, in this case, the function is not over the actual foreign key column, which is where the lookup in the booking table should be ocurring.
If I remove the date() function, the index will be used in any of those columns, as expected. One might say, then, "well, why don't you just get rid of the date() function?" - I use metabase, an Interface which allow users to use a graphical interface in order to build queries without knowing MySQL, and one of the current limitations of that tool is that it always uses the date() function when building queries not directly written in MySQL - hence, I have no way to remove the function in the queries I'm running.
Actual question: why does MySQL use index in the first two cases, but doesn't in the latter, considering the amount of distinct values is pretty much the same for all columns and they have the exact smae definition, apart from the name? Am I missing something here?
EDIT: Here is the CREATE statment of each table involved. There are some more, but we just need here tables ft_booking and dm_date (first two tables of the file).

You are "hiding date in a function call". If db.date is declared a DATE, then
date (db.date) = '2018-05-05'
can be simply
db.date = '2018-05-05'
If db.date is declared a DATETIME, then change to
db.date >= '2018-05-05'
AND db.date < '2018-05-05' + INTERVAL 1 DAY
In either case, be sure there is an index on db.date.
If by "I have a dimension called dm_date", you mean you built a dimension table to hold just dates, and then you are JOINing to the main table with some id, ... To put it bluntly, don't do that! Do not normalize "continuous" things such as DATE, DATETIME, FLOAT, or other numeric values.
If you need to discuss this further, please provide SHOW CREATE TABLE for the relevant table(s). (And please use text, not screen shots.)
Why??
The simple answer is that the Optimizer does not know how to unravel any function. Perhaps it could; perhaps it should. But it does not. Perhaps the answer involves not wanting to see how the function result will be used... comparing against a DATE? against a DATETIME? being used as a string? other?
Still, I suggest the real performance killer is the existence of dm_date rather than indexing and using the date in the main table.
Furthermore, the main table is bigger than it needs to be! fk_date_booking is a 4-byte INT SIGNED instead of a 3-byte DATE.

Related

MySql date inside varchar (select correct date)?

I have "varchar" field in my database where I have stored records like:
(11.1.2015) Log info 1
(17.4.2015) Log info 2
(22.5.2015) Log info 3
(25.5.2015) Log info 3
...
Now I would like to make SELECT WHERE date inside () is the same or larger than today and select the first one (so in this sample and todays date I should get 22.5.205). I just can't figure out how to do that, so I need some help.
In principle I agree with Pekka웃 on this one.
You should always strive to use proper data types for your data.
This also means never use one column to store 2 different data segments.
However, from the comments to Pekka웃's answer I understand that changing the table is not possible, so here's my attempt to do it.
Assuming your dates are always at the start of the varchar, and always surrounded by parenthesis, you can probably do something like this:
SELECT *
FROM (
SELECT
CAST(SUBSTR(Log_data, 2, LOCATE(')', Log_data)-1) as date) LogDate,
SUBSTR(Log_data, LOCATE(')', Log_data)+1, CHAR_LENGTH(Log_data)) LogData
FROM logs_table
) NormalizedLogTable
WHERE LogDate >= CURDATE()
Limit 1
See sql fiddle here.
Note #1: This is a workaround for your specific situation.
If you ever get the chance, you should normalize your table.
Note #2 I'm not a MySql guy. Most of my sql experience is with Sql server.
You can probably find a better way to convert strings to date then just using cast, to overcome the ambiguity of values like 1.3.2015.
This is likely to be impossible to do with a varchar field, or hellishly complex. While you can theoretically use Regex functions in mySQL to match patterns, you are looking for a date range. Even if it were possible to build a query somehow, it would be a gigantic pile of work and there is no way mySQL could optimize for any aspect of it.
The normal, fast, and straightforward way to go about this is normalizing your table.
In your case, you would probably create a new table named "logs" and connect it to your existing table through an ID field that shows which parent record each log entry belongs to.
Querying for a certain date range for log entries belonging to a specific parent then become as easy as
SELECT log_date FROM logs WHERE parent = 155 AND log_date >= CURDATE()
It's painful to do at first (as you have to rebuild parts of your structure and likely, your app) and makes some everyday queries more complex, but cases like this become much easier and faster.

mysql: stored function call + left join = very very slow

I have two tables:
module_339 (id,name,description,etc)
module_339_schedule(id,itemid,datestart,dateend,timestart,timeend,days,recurrent)
module_339_schedule.itemid points to module_339
fist table holds conferences
second one keeps the schedules of the conferences
module_339 has 3 items
module_339_schedule has 4000+ items - almost evenly divided between the 3 conferences
I have a stored function - "getNextDate_module_339" - which will compute the "next date" for a specified conference, in order to be able to display it, and also sort by it - if the user wants to. This stored procedure will just take all the schedule entries of the specified conference and loop through them, comparing dates and times. So it will do one simple read from module_339_schedule, then loop through the items and compare dates and times.
The problem: this query is very slow:
SELECT
distinct(module_339.id)
,min( getNextDate_module_339(module_339.id,1,false)) AS ND
FROM
module_339
LEFT JOIN module_339_schedule on module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE 1=1 AND module_339.is_system_preview<=0
group by
module_339.id
order by
module_339.id asc
If I remove either the function call OR the LEFT JOIN, it is fast again.
What am I doing wrong here? Seems to be some kind of "collision" between the function call and the left join.
I think the group by part can be removed from this query, thus enabling you to remove the min function as well. Also, there is not much point of WHERE 1=1 AND..., so I've changed that as well.
Try this:
SELECT DISTINCT module_339.id
,getNextDate_module_339(module_339.id,1,false) AS ND
FROM module_339
LEFT JOIN module_339_schedule ON module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE module_339.is_system_preview<=0
ORDER BY module_339.id
Note that this might not have a lot of impact on performance.
I think that the worst part performance-wise is probably the getNextDate_module_339 function.
If you can find a way to get it's functionallity without using a function as a sub query, your sql statement will probably run alot faster then now, with or without the left join.
If you need help doing this, please edit your question to include the function and hopefully I (or someone else) might be able to help you with that.
From the MySQL reference manual:
The best way to improve the performance of SELECT operations is to create indexes on one or more of the columns that are tested in the query. The index entries act like pointers to the table rows, allowing the query to quickly determine which rows match a condition in the WHERE clause, and retrieve the other column values for those rows. All MySQL data types can be indexed.
Although it can be tempting to create an indexes for every possible column used in a query, unnecessary indexes waste space and waste time for MySQL to determine which indexes to use. Indexes also add to the cost of inserts, updates, and deletes because each index must be updated. You must find the right balance to achieve fast queries using the optimal set of indexes.
As a first step I suggest checking that the joined columns are both indexed. Since primary keys are always indexed by default, we can assume that module_339 is already indexed on the id column, so first verify that module_339_schedule is indexed on the itemid column. You can check the indexes on that table in MySQL using:
SHOW INDEX FROM module_339_schedule;
If the table does not have an index on that column, you can add one using:
CREATE INDEX itemid_index ON module_339_schedule (itemid);
That should speed up the join component of the query.
Since your query also references module_339.is_system_preview you might also consider adding an index to that column using:
CREATE INDEX is_system_preview_index ON module_339 (is_system_preview);
You might also be able to optimize the stored procedure, but you haven't included it in your question.

Can Postgres use a function in a partial index where clause?

I have a large Postgres table where I want to partial index on 1 of the 2 columns indexed. Can I and how do I use a Postgres function in the where clause of a partial index and then have the select query utilize that partial index?
Example Scenario
First column is "magazine" and the second column is "volume" and the third column is "issue". All the magazines can have same "volume" and "issue" #'s but I want the index to only contain the two most recent volumes for that magazine. This is because a magazine could be older than others and have higher volume numbers than younger magazines.
Two immutable strict functions were created to determine the current and last years volumes for a magazine f_current_volume('gq') and f_previous_volume('gq'). Note: current/past volume # only changes once per year.
I tried creating a partial index with the functions however when using explain on a query it only does a seq scan for a current volume magazine.
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE volume IN (f_current_volume(magazine), f_previous_volume(magazine));
-- Both these do seq scans.
select * from issue where magazine = 'gq' and volume = 100;
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
What am I doing wrong to get this work? And if it is possible why does it need to be done that way for Postgres to use the index?
-- UPDATE: 2013-06-17, the following surprisingly used the index.
-- Why would using a field name rather than value allow the index to be used?
select * from issue where magazine = 'gq' and volume = f_current_volume(magazine);
Immutability and 'current'
If your f_current_volume function ever changes its behaviour - as is implied by its name, and the presence of an f_previous_volume function, then the database is free to return completely bogus results.
PostgreSQL would've refused to let you create the index, complaining that you can only use IMMUTABLE functions. The thing is, marking a function IMMUTABLE means that you are telling PostgreSQL something about the function's behaviour, as per the documentation. You're saying "I promise this function's results won't change, feel free to make assumptions on that basis."
One of the biggest assumptions made is when building an index. If the function returns different outputs for different inputs on multiple invocations, things go splat. Or possibly boom if you're unlucky. In theory you can kind-of get away with changing an immutable function by REINDEXing everything, but the only really safe way is to DROP every index that uses it, DROP the function, re-create the function with its new definition and re-create the indexes.
That can actually be really useful to do if you have something that changes only infrequently, but you really have two different immutable functions at different points in time that just happen to have the same name.
Partial index matching
PostgreSQL's partial index matching is pretty dumb - but, as I found when writing test cases for this, a lot smarter than it used to be. It ignores a dummy OR true. It uses an index on WHERE (a%100=0 OR a%1000=0) for a WHERE a = 100 query. It even got it with a non-inline-able identity function:
regress=> CREATE TABLE partial AS SELECT x AS a, x
AS b FROM generate_series(1,10000) x;
regress=> CREATE OR REPLACE FUNCTION identity(integer)
RETURNS integer AS $$
SELECT $1;
$$ LANGUAGE sql IMMUTABLE STRICT;
regress=> CREATE INDEX partial_b_fn_idx
ON partial(b) WHERE (identity(b) % 1000 = 0);
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
---------------------------------------------------------------------------------------
Index Only Scan using partial_b_fn_idx on partial (cost=0.00..13.05 rows=50 width=4)
(1 row)
However, it was unable to prove the IN clause match, eg:
regress=> DROP INDEX partial_b_fn_idx;
regress=> CREATE INDEX partial_b_fn_in_idx ON partial(b)
WHERE (b IN (identity(b), 1));
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on partial (cost=10000000000.00..10000000195.00 rows=50 width=4)
So my advice? Rewrite IN as an OR list:
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE (volume = f_current_volume(magazine) OR volume = f_previous_volume(magazine));
... and on a current version it might just work, so long as you keep the immutability rules outlined above in mind. Well, the second version:
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
might. Update: No, it won't; for it to be used, Pg would have to recognise that magazine='gq' and realise that f_current_volume('gq') was therefore equiavalent to f_current_volume(magazine). It doesn't attempt to prove equivalences on that level with partial index matching, so as you've noted in your update you have to write f_current_volume(magazine) directly. I should've spotted that. In theory PostgreSQL could use the index with the second query if the planner was smart enough, but I'm not sure how you'd go about efficiently looking for places where a substitution like this would be worthwhile.
The first example, volume = 100 will never use the index, since at query planning time PostgreSQL has no idea that f_current_volumne('gg'); will evaluate to 100. You could add an OR clause OR volume = 100 to your partial index WHERE clause and PostgreSQL would figure it out then, though.
First off, I'd like to volunteer a wild guess, because you're making it sound like your f_current_volume() function calculates something based on a separate table.
If so, be wary because this means your function volatile, in that it needs to be recalculated on every call (a concurrent transaction might be inserting, updating or deleting rows). Postgres won't allow to index those, and I presume you worked around this by declaring the function immutable. Not only is this incorrect, but you also run into the issue of the index containing garbage, because the function gets evaluated as you edit the row, rather than at run time. What you'd probably want instead -- again if my guess is correct -- is to store and maintain the totals in the table itself using triggers.
Regarding your specific question, partial indexes need to have their where condition be met in the query to prompt Postgres to use them. I'm quite sure that Postgres is smart enough to identify that e.g. 10 is between 5 and 15 and use a partial index with that clause. I'm very suspicious that it would know that f_current_volume('gq') is 100 in your case, however, considering the above-mentioned caveat.
You could try this query and see if the index gets used:
select *
from issue
where magazine = 'gq'
and volume in (f_current_volume('gq'), f_previous_volume('gq'));
(Though again, if your function is in fact volatile, you'll get a seq scan as well.)

mysql performace issue with code and table design

I need some options.
I have a table layed out as follows with about 78,000,000 rows...
id INT (Primary Key)
loc VARCHAR (Indexed)
date VARCHAR (Indexed)
time VARCHAR
ip VARCHAR
lookup VARCHAR
Here is an example of a query I have.
SELECT lookup, date, time, count(lookup) as count FROM dnstable
WHERE STR_TO_DATE(`date`, '%d-%b-%Y') >= '$date1' AND STR_TO_DATE(`date`, '%d-%b-%Y') <= '$date2' AND
time >= '$hour1%' AND time <= '$hour2%' AND
`loc` LIKE '%$prov%' AND
lookup REGEXP 'ca|com|org|net' AND
lookup NOT LIKE '%.arpa' AND
lookup NOT LIKE '%domain.ca' AND
ip NOT LIKE '192.168.2.1' AND
ip NOT LIKE '192.168.2.2' AND
ip NOT LIKE '192.168.2.3'
GROUP BY lookup
ORDER BY count DESC
LIMIT 100
I have my mysql server configured like a few high useage examples I found. The hardware is good, 4 cores, 8 gig rams.
This query takes about 180 seconds... Does anyone have some tips on making this more efficent?
There are a lot of things wrong here. A LOT of things. I would look to the other answers for query options (you use a lot of LIKES, NOT LIKES, and functions....and you're doing them on unkeyed columns...). If I were in your case, I'd redesign my entire database. It looks as though you're using this to store DNS entries - host names to IP addresses.
You may not have the option to redesign your database - maybe it's a customer database or something that you don't have control over. Maybe they have a lot of applications which depend on the current database design. However, if you can refactor your database, I would strongly suggest it.
Here's a basic rundown of what I'd do:
Store the TLDs (top-level-domains) in a separate column as an ENUM. Make it an index, so it's easily searchable, instead of trying to regex .com, .arpa, etc. TLDs are limited anyway, and they don't change often, so this is a great candidate for an ENUM.
Store the domain without the TLD in a regular column and a reversed column. You could index both columns, but depending on your searches, you may only need to index the reverse column. Basically, having a reverse column allows you to search for all hosts in one domain (ex. google) without having to do a fulltext search each time. MySQL can do a key search on the string "elgoog" in the reverse column. Because DNS is a hierarchy, this fits perfectly.
Change the date and time columns from VARCHAR to DATE and TIME, respectively. This one's an obvious change. No more str_to_time, str_to_date, etc. Absolutely no point in doing that.
Store the IP addresses differently. There's no reason to use a VARCHAR here - it's inefficient and doesn't make sense. Instead, use four separate columns for each octet (this is safe because all IPv4 addresses have four octets, no more, no less) as unsigned TINYINT values. This will give you 0-255, the range you need. (Each IP octet is actually 8 bits, anyway.) This should make searches much faster, especially if you key the columns.
ex: select * from table where octet1 != 10; (this would filter out all 10.0.0.0/8 private IP space)
The basic problem here is that your database design is flawed - and your query is using columns that aren't indexed, and your queries are inefficient.
If you're stuck with the current design....I'm not sure if I can really help you. I'm sorry.
I bet the really big problem here are the STR_TO_DATE functions.
If possible then try date column to really have a DATE datatype. (DATE, DATETIME, TIMESTAMP)
Having this new or altered column (with date datatype) indexed would speed up the selection over that column significant. You have to avoid the date parsing which currently lacked by wrong datatype for column 'date'. This parsing/converting avoids MySQL from using the index on the 'date' column.
Conclusion: Make 'date' column having a Date datatype, have this column indexed and do not use STR_TO_DATE in your statement.
I insinuate that these local ip addresses are not very selective when used with negation, right? (This depends on the typical data in the table.)
Since ip-column is not indexed, selections on that column always result in full table scan. When unequal (<>) selection on ip is very selective then consider putting an index on it and change statement to not use 'like' but <> instead. But I do not think that unequal-selection on ip is very selective.
Conclusion: I do not think you can win anything significant here.
The problem is that a LIKE will mean full table traversal! Which is why you are seeing this.
First thing I would suggest is get rid of LIKE '192.168.2.1' since really that is the same as ='192.168.2.1'
Also the fact that you set the LIMIT 100 at the end means that the query will run against all the records then select only the first 100 -- how about instead you do a SELECT which only involves all the other operations but not LIKE and limit this one and then have a second SELECT which uses LIKE?
Some Tips
Use != instead of NOT LIKE
Avoid REGEXP in mysql query
Avoid STR_TO_DATE(date, '%d-%b-%Y') >= '$date1' try passing the MySQL formatted date to query rather converting them with STR_TO_DATE
lookup should be indexed if you have to use group by on it.
Try caching the query results(if possible).

Best way to handle MySQL date for performance with thousands of users

I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.