MYSQL slow duration or fetch time depending on "distinct" command - mysql

I have a pretty small, simple MYSQL table for holding precalculated financial data. The table looks like:
refDate | instrtument | rate|startDate |maturityDate|carry1|carry2|carry3
with 3 indices defined as:
unique unique_ID(refDate,instrument)
refDate (refDate)
instrument (instrument)
rows right now is about 10 million, though for each refDate, there are only about 5000 distinct instruments right now
I have a query that self joins on this table to generate an output like:
refDate|rate instrument=X | rate instrument = Y| rate instrument=Z|....
basically returning time series data which I can then do my own analytics in.
Here is the problem: my original query looked like:
Select distinct AUDSpot1yFq.refDate,AUDSpot1yFq.rate as 'AUDSpot1yFq',
AUD1y1yFq.rate as AUD1y1yFq
from audratedb AUDSpot1yFq inner join audratedb AUD1y1yFq on
AUDSpot1yFq.refDate=AUD1y1yFq.refDate
where AUDSpot1yFq.instrument = 'AUDSpot1yFq' and
AUD1y1yFq.instrument = 'AUD1y1yFq'
order by AUDSpot1yFq.refDate
Note, in this particular query for timing below, I was actually getting 10 different instruments, which means the query was much longer but followed this same pattern of naming, inner joins, and where statements.
This was slow, in workbench I time it as 7-8 second duration (but near 0 fetch time, as I have workbench on the machine running the server). When I stripped the distinct, the duration drops to 0.25-0.5 seconds (far more manageable) and when I stripped the "order by" it got even faster (<0.1 seconds, at which point I don't care). But my Fetchtime exploded to ~7 seconds. So in total, I gain nothing but it has all become a Fetch time issue. When I run this query from the python scripts that will be doing the lifting and work, I get roughly the same timing whether I include distinct or not.
when I run an explain on my cut down query (which has the horrid fetch time) I get:
1 SIMPLE AUDSpot1yFq ref unique_ID,refDate,instrument instrument 39 const 1432 100.00 Using where
1 SIMPLE AUD1y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD2y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD3y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD4y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD5y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD6y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD7y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD8y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
1 SIMPLE AUD9y1yFq ref unique_ID,refDate,instrument unique_ID 42 historicalratesdb.AUDSpot1yFq.refDate,const 1 100.00 Using where
I now realize distinct is not required, and order by is something I can throw out and sort in pandas when I get the output to a dataframe. That is great. But I don't know how to get the Fetch time down. I'm not going to win any competency competitions on this website, but I have searched as much as I can and can't find a solution for this issue. Any help is greatly appreciated.
~cocoa

(I had to simplify the table aliases in order to read it:)
Select distinct
s.refDate,
s.rate as AUDSpot1yFq,
y.rate as AUD1y1yFq
from audratedb AS s
join audratedb AS y on s.refDate = y.refDate
where s.instrument = 'AUDSpot1yFq'
and y.instrument = 'AUD1y1yFq'
order by s.refDate
Index needed:
INDEX(instrument, refDate) -- To filter and sort, or
INDEX(instrument, refDate, rate) -- to also "cover" the query.
That assumes the query is not more complex than you said. I see that the EXPLAIN already has many more tables. Please provide SHOW CREATE TABLE audratedb and the entire SELECT.
Back to your questions...
DISTINCT is done one of two ways: (1) sort the table, then dedup, or (2) dedup in a hash in memory. Keep in mind that you are dedupping all 3 columns (refDate, s.rate, y.rate).
ORDER BY is a sort after gathering all the data. However, with the suggested index (not the indexes you had), the sort is not needed, since the index will get the rows in the desired order.
But... Having both DISTINCT and ORDER BY may confuse the Optimizer to the point where it does something 'dumb'.
You say that (refDate,instrument) is UNIQUE, but you do not mention a PRIMARY KEY, nor have you mentioned which Engine you are using. If you are using InnoDB, then PRIMARY KEY(instrument, refDate), in that order, would further speed things up, and avoid the need for any new index.
Furthermore, it is redundant to have (a,b) and also (a). That is, your current schema does not need INDEX(refDate), but by changing the PK, you would not need INDEX(instrument), instead.
Bottom line: Only
PRIMARY KEY(instrument, refDate),
INDEX(refDate)
and no other indexes (unless you can show some query that needs it).
More on the EXPLAIN. Notice how the Rows column says 1432, 1, 1, ... That means that it scanned an estimated 1432 rows of the first table. This is probably far more than necessary because of lack of a proper index. Then it needed to look at only 1 row in each of the other tables. (Can't get better than that.)
How many rows in the SELECT without the DISTINCT or ORDER BY? That tells you how much work was needed after doing the fetching and JOINing. I suspect it is only a few. A "few" is really cheap for DISTINCT and ORDER BY; hence I think you were barking up the wrong tree. Even 1432 rows would be very fast to process.
As for the buffer_pool... How big is the table? Do SHOW TABLE STATUS. I suspect the table is more than 1GB, hence it cannot fit in the buffer_pool. Hence raising that cache size would let the query run in RAM, not hitting the disk (at least after it gets cached). Keep in mind that running a query on a cold cache will have lots of I/O. As the cache warms up, queries will run faster. But if the cache is too small, you will continue to need I/O. I/O is the slowest part of the processing.
I hope you have at least 6GB of RAM; otherwise, 2G could be dangerously large. Swapping is really bad for performance.

The question doesn't mention existing indexes, or show the output from an EXPLAIN for any of the queries.
The quick answer to improve performance is to add an index:
... ON audratedb (instrument,refdate,rate)
To answer why we'd want to add that index, we'd need to understand how MySQL processes SQL statements, what operations are possible, and which are required. To see how MySQL is actually processing your statement, you need to use EXPLAIN to see the query plan.

Related

Optimize MySQL query with many joins and filters

I'm having a MySQL query that count number of emails that passes certain filters (time filter and free search)
The query is currently taking at least 30 seconds on my server (time interval of only 12 days) and so I want to make it more efficient.
I don't have lot of experience with MySQL so please be gentle with me.
The current query is:
SELECT
count(distinct emls.EML_ID) as count FROM origins
JOIN emls ON emls.EML_ID = origins.source_id
JOIN email2addresses ON emls.EML_ID = email2addresses.EML_ID
JOIN email_addresses ON email_addresses.Email_ID = email2addresses.Email_ID
JOIN files ON files.Origin_ID = origins.Origin_ID
JOIN unique_files ON unique_files.Unique_File_ID = files.Unique_File_ID
WHERE origins.insert_date BETWEEN FROM_UNIXTIME(1533323333) and FROM_UNIXTIME(1534323333)
and (origins.source_id LIKE "%%" or emls.Subject LIKE "%%"
or email_addresses.Email_Address LIKE "%%" or files.File_Name LIKE "%%"
or files.File_ID LIKE "%%" or unique_files.File_Hash LIKE "%%");
When running explain before the query I get:
1 SIMPLE origins index PRIMARY,Source_ID_index Source_ID_index 5 10699008 11.11 Using where; Using index
1 SIMPLE emls eq_ref PRIMARY PRIMARY 4 origins.Source_ID 1 100.00
1 SIMPLE files ref Unique_File_ID_index,Origin_ID_index Origin_ID_index 5 origins.Origin_ID 1 100.00 Using where
1 SIMPLE unique_files ref PRIMARY PRIMARY 4 files.Unique_File_ID 1 100.00
1 SIMPLE email2addresses ref EML_ID_index,Email_ID_index EML_ID_index 5 origins.Source_ID 4 100.00 Using where
1 SIMPLE email_addresses eq_ref PRIMARY PRIMARY 4 email2addresses.Email_ID 1 100.00 Using where
What I'm doing in the query is basically building a huge table (many joins) and then applying the filters on that huge table, I believe it's really bad practice.
To be more specific, the questions are:
How can I rewrite this query so that
at first, time filter is applied to the origins table and only after, all the joins take effect on the partial origins table (only on the records that matches time filter)?
On the first row at the explain output, I get under row the value 10699008 , that is the expected amount of records that MySQL will need to go through, right? If I understand correctly that I should try and lower it in order to gain speed, is there a best practice regarding how to do this?
Are there any other improvements I should apply to this query to make it faster?
Thanks.

Calculating frequency of password hashes efficiently in MySQL

For my bachelor thesis I have to analyze a password leak and I have a table with 2 colums MEMBER_EMAIL and MEMBER_HASH
I want to calculate the frequency of each hash efficiently
So that the output looks like:
Hash | Amount
----------------
2e3f.. | 345
2f2e.. | 288
b2be.. | 189
My query until now was straight forward:
SELECT MEMBER_HASH AS hashed, count(*) AS amount
FROM thesis.fulllist
GROUP BY hashed
ORDER BY amount DESC
While it works fine for smaller tables, i have problems computing the query on the whole list (112 mio. entries), where it takes me over 2 days, ending in a weird connection timeout error even if my settings regarding that are fine.
So I wonder if there is a better way to calculate (as i can't really think of any), would appreciate any help!
Your query can't be optimized as it's quite simple. The only way I think to improve the way the query is executed is to index the "MEMBER_HASH".
This is how you can do it :
ALTER TABLE `table` ADD INDEX `hashed` (`MEMBER_HASH`);

How to improve MySQL query with JOINED tables and ORDER BY and OFFSET

games_releases is a table that combines game information. Infos like a game title, the game publisher or developer are the same for many different games, so they are saved in different tables that are later joined together.
The example below only joins the games_titles table for easier understanding (but in reality there are a few more tables joined following the same principle).
The games_releases table:
id int(11) <- unique
title_id int(11) <- index
developer_id int(11)
... more game relevant data
Some typical rows of games_releases would look like:
id title_id developer_id ... ...
--------------------------------------------
1 17 265
2 23 41
3 31 3
4 42 15
5 17 123
The games_titles table:
id int(11) <- unique
title varchar(128)
created int(11)
Some typical rows of games_titles would look like:
id title created
----------------------------------------
17 Pac-Man [some unix timestamp]
23 Defender [some unix timestamp]
31 Scramble [some unix timestamp]
42 Q*bert [some unix timestamp]
99 Phoenix [some unix timestamp]
NOW: Let's assume a user wants to see all games in alphabetical order (24 at a time), then this query would be executed...
SELECT
id AS release_id, t.`title` AS title
FROM
games_releases
LEFT JOIN games_titles t ON t.`id`=`games_releases`.`title_id`
ORDER BY title
LIMIT 24
This would be returned
release_id title
-----------------------------
2 Defender
1 Pac-Man
5 Pac-Man
4 Q*Bert
3 Scramble
So basically the resulting table features the strings rather the ids.
The challenge: This query will take 0.2 seconds to run which is way to slow (games_releases has around 80.000 items listed, but imagine the database grows to 1.000.000 items).
Here is what explain tells me (games_releases has an index title_id):
id select_type table partitions type possible_keys key key_len ref rows Extra
1 SIMPLE games_releases NULL index NULL title_id 4 NULL 76669 Using index; Using temporary; Using filesort
1 SIMPLE t NULL eq_ref PRIMARY PRIMARY 4 phoenix.games_releases.title_id 1
Any chance to optimize this?
EDIT: The question has been answered. A wrong "LEFT JOIN" instead of "JOIN" was the problem.
But: What would I do to conquer longer execution times with growing OFFSET?
Although having read loads about it, I struggle to understand how to set indexes efficiently when doing multiple JOIN.
Having a "title" index for games_titles does not seem to have any effect.
For future reference: questions about query performance generally must present the output of SHOW CREATE TABLE tablename of each table involved in the query. Table structure makes a difference to performance.
It looks, from your query, like you want to show the first 24 titles, alphabetically, from the games_titles table where there's any match at all in the games_releases table. I don't understand the logic of your LEFT JOIN though. Do you want the titles repeated if there's more than one row for the title in games_releases? What do you want done with rows in games_releases that are unmatched by rows in games_titles?
I think you can get the result you want as follows:
SELECT DISTINCT t.id, t.title
FROM games_titles t
JOIN games_releases r ON t.id = r.title_id
ORDER BY t.title
LIMIT 24
This gives distinct rows from your titles table matching anything in the releases table. This will probably be optimal in its performance. I wonder, though, what is significant about the first 24 titles, alphabetically, in your application, and why that is important to put into a view.
SELECT lots, of, stuff .... ORDER BY something LIMIT number is a notorious performance antipattern. Why? MySQL has to sort lots of data, only to discard a small amount of it. The limitations of view definition make it hard for you to do something more efficient in a view.
You didn't tell us whether games_titles.id is indexed. It needs to be indexed. If it's the primary key it is indexed.

SQL where clause performance

I have to create a table that assigns an user id and a product id to some data (models 2 one to many relationships). I will make a lot of queries like
select * from table where userid = x;
The first thing that I am interested is how big should the table get before the query starts to be observable (let's say it takes more than 1 second).
Also, how this can be optimised?
I know that this might depend on the implementation. I will use mysql for this specific project, but I am interested in more general answers as well.
It all depends on the horse power of your machine. The make that query more efficient, create an index with "userid"
how big should the table get before the query starts to be observable (let's say it takes more than 1 second)
There are too many factors to deterministically measure run time. CPU speed, memory, I/O speed, etc. are just some of the external factors.
how this can be optimized?
That's more staightforward. If there is an index on userid then the query will likely to an index seek which is about as fast as you can get as far as finding the record. If the userid is a clustered index then it will be faster because it won't have to use the position from the index to find the record in data pages - the data is physically organized as part of the index.
let's say it takes more than 1 second
With an index on userid, Mysql will manage to find the correct row in (worst case) Oh (log n). In "seconds" it now depends on the performance of your machine.
It is impossible to give you an exact number, without considering how long one operation takes.
As an Example: Assuming you have a database with 4 records. This requires 2 operations worst case. Any time, you double your data, one more operation is required.
for example:
# records | # operations to find entry in worst case
2 1
4 2
8 3
16 4
...
4096 12
...
~1 B 30
~2 B 31
So, with a huge amount of records - time almost remains constant. For 1 Billion records, you would need to perform ~ 30 operations.
And like that it continues: 2 Billion records, 31 operations.
so, let's say your query executes in 0.001 second for 4096 entries (12 ops)
it would take arround (0.001 / 12 * 30) 0.0025 seconds for 1 Billion records.
Heavy Sidenode: this is just considering the runtime complexity of the binary search, but it shows how the complexity would scale.
In a nutshell: Your database would be unimpressed by a single query on an indexed value. However, if you run a heavy amount of those queries at the same time, time increases ofc.

mysql optimize data content: multi column or simple column hash data

I actually have a table with 30 columns. In one day this table can get around 3000 new records!
The columns datas look like :
IMG Name Phone etc..
http://www.site.com/images/image.jpg John Smith 123456789 etc..
http://www.site.com/images/image.jpg Smith John 987654321 etc..
I'm looking a way to optimize the size of the table but also the response time of the sql queries. I was thinking of doing something like :
Column1
http://www.site.com/images/image.jpg|John Smith|123456789|etc..
And then via php i would store each value into an array..
Would it be faster ?
Edit
So to take an example of the structure, let's say i have two tables :
package
package_content
Here is the structure of the table package :
id | user_id | package_name | date
Here is the structure of the table package_content :
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
The thing is for each package i can get up to 16rows of content. For example :
id | user_id | package_name | date
260 11 Package 260 2013-7-30 10:05:00
id | package_id | content_name | content_description | content_price | content_color | etc.. > 30columns
1 260 Content 1 Content 1 desc 58 white etc..
2 260 Content 2 Content 2 desc 75 black etc..
3 260 Content 3 Content 3 desc 32 blue etc..
etc...
Then with php i make like that
select * from package
while not EOF {
show package name, date etc..
select * from package_content where package_content.package_id = package.id and package.id = package_id
while not EOF{
show package_content name, desc, price, color etc...
}
}
Would it be faster? Definitely not. If you needed to search by Name or Phone or etc... you'd have to pull those values out of Column1 every time. You'd never be able to optimize those queries, ever.
If you want to make the table smaller it's best to look at splitting some columns off into another table. If you'd like to pursue that option, post the entire structure. But note that the number of columns doesn't affect speed that much. I mean it can, but it's way down on the list of things that will slow you down.
Finally, 3,000 rows per day is about 1 million rows per year. If the database is tolerably well designed, MySQL can handle this easily.
Addendum: partial table structures plus sample query and pseudocode added to question.
The pseudocode shows the package table being queried all at once, then matching package_content rows being queried one at a time. This is a very slow way to go about things; better to use a JOIN:
SELECT
package.id,
user_id,
package_name,
date,
package_content.*
FROM package
INNER JOIN package_content on package.id = package_content.id
WHERE whatever
ORDER BY whatever
That will speed things up right away.
If you're displaying on a web page, be sure to limit results with a WHERE clause - nobody will want to see 1,000 or 3,000 or 1,000,000 packages on a single web page :)
Finally, as I mentioned before, the number of columns isn't a huge worry for query optimization, but...
Having a really wide result row means more data has to go across the wire from MySQL to PHP, and
It isn't likely you'll be able to display 30+ columns of information on a web page without it looking terrible, especially if you're reading lots of rows.
With that in mind, you'll be better of picking specific package_content columns in your query instead of picking them all with a SELECT *.
Don't combine any columns, this is no use and might even be slower in the end.
You should use indexes on a column where you query at. I do have a website with about 30 columns where atm are around 600.000 results. If you use EXPLAIN before a query, you should see if it uses any indexes. If you got a JOIN with 2 values and a WHERE at the same table. You should make a combined index with the 3 columns, in order from JOIN -> WHERE. If you join on the same table, you should see this as a seperate index.
For example:
SELECT p.name, p.id, c.name, c2.name
FROM product p
JOIN category c ON p.cat_id=c.id
JOIN category c2 ON c.parent_id=c2.id AND name='Niels'
WHERE p.filterX='blaat'
You should have an combined index at category
parent_id,name
AND
id (probably the AI)
A index on product
cat_id
filterX
With this easy solution you can optimize queries from NOT DOABLE to 0.10 seconds, or even faster.
If you use MySQL 5.6 you should step over to INNODB because MySQL is better with optimizing JOINS and sub queries. Also MySQL will try to run them into MEMORY which will make it a lot faster aswel. Please keep in mind that backupping INNODB tables might need some extra attention.
You might also think about making MEMORY tables for super fast querieing (you do still need indexes).
You can also optimize by making integers size 4 (4 bytes, not 11 characters). And not always using VARCHAR 255.