I have very large table with 17,044,833 Rows and 6.4 GB in size. I am running the simple query below and it takes like 5 seconds. Any ideas what optimizations can I do to improve the speed of this query?
SELECT
`stat_date`,
SUM(`adserver_impr`),
SUM(`adserver_clicks`)
FROM `dfp_stats` WHERE
`stat_date` >= '2014-02-01'
AND
`stat_date` <= '2014-02-28'
MySQL Config:
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
innodb_buffer_pool_size = 10G
Server:
Memory: 48GB
Disk: 480GB
UPDATE
ORIGINAL QUERY:
EXPLAIN
SELECT
DS.`stat_date` 'DATE',
DC.`name` COUNTRY,
DA.`name` ADVERTISER,
DOX.`id` ORDID,
DOX.`name` ORDNAME,
DLI.`id` LIID,
DLI.`name` LINAME,
DLI.`is_ron` ISRON,
DOX.`is_direct` ISDIRECT,
DSZ.`size` LISIZE,
PUBSITE.`id` SITEID,
SUM(DS.`adserver_impr`) 'DFPIMPR',
SUM(DS.`adserver_clicks`) 'DFPCLCKS',
SUM(DS.`adserver_rev`) 'DFPREV'
FROM `dfp_stats` DS
LEFT JOIN `dfp_adunit1` AD1 ON AD1.`id` = DS.`dfp_adunit1_id`
LEFT JOIN `dfp_adunit2` AD2 ON AD2.`id` = DS.`dfp_adunit2_id`
LEFT JOIN `dfp_adunit3` AD3 ON AD3.`id` = DS.`dfp_adunit3_id`
LEFT JOIN `dfp_orders` DOX ON DOX.`id` = DS.`dfp_order_id`
LEFT JOIN `dfp_advertisers` DA ON DA.`id` = DOX.`dfp_advertiser_id`
LEFT JOIN `dfp_lineitems` DLI ON DLI.`id` = DS.`dfp_lineitem_id`
LEFT JOIN `dfp_countries` DC ON DC.`id` = DS.`dfp_country_id`
LEFT JOIN `dfp_creativesize` DSZ ON DSZ.`id` = DS.`dfp_creativesize_id`
LEFT JOIN `pubsites` PUBSITE
ON AD1.`pubsite_id` = PUBSITE.`id`
OR AD2.`pubsite_id` = PUBSITE.`id`
WHERE
DS.`stat_date` >= '2014-02-01'
AND DS.`stat_date` <= '2014-02-28'
AND PUBSITE.`id` = 6
GROUP BY DLI.`id`,DS.`stat_date`;
RESULTS OF EXPLAIN: (This is after adding the COVERING INDEX)
http://i.stack.imgur.com/vhVeB.png
If you haven't, you might want to index the stat_date field for faster lookups. Here's the syntax:
ALTER TABLE TABLE_NAME ADD INDEX (COLUMN_NAME);
Read more about indexing and optimizations here: https://dev.mysql.com/doc/refman/5.5/en/optimization-indexes.html
For best performance of this query, create a covering index:
... ON `dfp_stats` (`stat_date`,`adserver_impr`,`adserver_clicks`)
The output from EXPLAIN should show "Using index". This means that the query can be satisfied entirely from the index, without needing to visit any pages in the underlying table. (The term "covering index" refers to an index that includes all of the columns referenced by a query.)
At a minimum, you'll want an index with a leading column of stat_date so that the query can use an index range scan operation. An index range scan can essentially skip over boatloads of rows, and more quickly locate the rows that actually need to be checked.
As far as changes to the configuration of the MySQL instance, that really depends on whether the table is InnoDB or MyISAM.
FOLLOWUP
For InnoDB, memory is still king. If there's memory available on the server, then you can increase innodb_buffer_pool.
Also consider enabling the MySQL query cache. (We have the query cache enabled only for queries that are specifically enabled to use the cache with the SQL_CACHE keyword i.e. SELECT SQL_CACHE t.foo,, so we don't clutter up the cache with queries that don't give us benefit. For other queries, we avoid running the extra code (that would otherwise be required) to search the cache and maintain the cache contents.
The place we get a benefit from the query cache is from "expensive" queries (which look at a lot of rows and do a lot of joins) against tables that are relatively static, and that return small resultsets. (I'd consider a query that gets a single row with a SUMs from a whole boatload of rows would be a good candidate for the query cache, if the table is infrequently updated, or if the same query is going to be run several times before a DML operation on the table invalidates the cache.)
It's a bit odd that your query is returning a non-aggregate that isn't in a GROUP BY clause.
If your query is using an index on stat_date, it's likely the query is returning the lowest value of stat_date within the range specified by the predicate; so it's likely that you would get an equivalent result using SELECT MIN(stat_date) AS stat_date.
A more complicated approach would be to setup a "summary" table, and refresh that periodically with the results from a query, and then have the application query the summary table. (A data warehouse type approach.) This doesn't work if you need "up-to-the-minute" accuracy. To get that, you'd likely need to introduce triggers on the target table, to maintain the summary table on INSERT, UPDATE and DELETE operations.
If I went that route, I'd probably opt for storing a summary row for each stat_date, so it could accommodate queries on any range or set of dates...
CREATE TABLE dfp_stats_summary
( stat_date DATE NOT NULL PRIMARY KEY
, adserver_impr BIGINT
, adserver_clicks BIGINT
) ENGINE=InnoDB ;
-- refresh
INSERT INTO dfp_stats_summary (stat_date, adserver_impr, adserver_clicks)
SELECT t.stat_date
, SUM(t.adserver_impr) AS adserver_impr
, SUM(t.adserver_clicks) AS adserver_clicks
FROM dfp_stats
GROUP BY t.stat_date
ON DUPLICATE KEY
UPDATE adserver_impr = VALUES(adserver_impr)
, adserver_clicks = VALUES(adserver_clicks)
;
The refresh query will crank; you might want to specify a date range in a WHERE clause to do a month or two at a time, and loop through all the possible months.
With the summary table populated, just change the original query to reference the new summary table, rather than the detail table. It would be a lot faster to add up 28 summary rows than several hundred thousands detail rows.
Related
I am facing performance issue when querying a table with approx 700,000 records.
The query takes more than 10 seconds execute for the first time for a specific item_id, if I change the item_id value in the query the query takes nearly the same amount of time to execute. However, subsequent query for the same item_id is fast unless the server is restarted.
The query I am trying to execute is -
select SQL_NO_CACHE item_id, item_rate_id, invoice_type, sum(qty_computed) as qty
from transaction_item
left join transaction_customer
on transaction_item.invoice_id = transaction_customer.invoice_id
where item_id = 17179
group by item_rate_id, invoice_type
My table (InnoDB) structure is -
Table: transaction_item (No primary Key, INDEX: item_id, Contains approx 700,000 rows)
Table transaction_customer (Primary Key: invoice_id, contains approx 100,000 rows)
Running explain on the above query gives the following output:
my.ini config
[mysqld]
query_cache_size=0
query_cache_type=0
innodb_buffer_pool_size = 1G
Any help on fine tuning MySQL config/db schema will be highly appreciated.
Your indexing isn't too bad for the query you have described. What is harming your performance is both of these tables have a significant amount of data in each row. The query needs elements from each table that isn't in the secondary index and therefore large chunks for the table relevant to the specified item need to be in the innodb buffer pool. I haven't looked at the exact numbers however 1G doesn't seem to be enough and your descriptions of the query becoming quicker the second time seem to support this (especially with SQL_NO_CACHE and the query cache disabled (good that its disabled).
Recommendation 1: Increase the innodb_buffer_pool size. Look at SHOW GLOBAL STATUS LIKE 'innodb_buffer_pool%' and look at the number of items purged from the buffer between queries.
If you are really stuck with the RAM available, and following the theme of #Drapp recommendations on indexes, will allow for a innodb buffer pool to be used with only indexes rather than the complete table. This innodb_buffer_pool is being competed against by other queries so the following have limited global impact.
Recommendation 2: (if #1 cannot be done)
ALTER TABLE transaction_item
DROP INDEX item_id
ADD INDEX item_id (item_id, item_rate_id, qty_computed );
ALTER TABLE transaction_customer
ADD INDEX id_type (invoice_id, invoice_type);
Note: Removed sorting, was necessary for GROUP BY. Thanks Rick
Formatted query for readability, but also added aliasing so someone in the future does not have to guess which columns come from which table.
Anyhow, to help optimize the query, you need a composite index to help the where, join and order by.
I would create an index on your Transaction_Item table on (item_id, item_rate_id, invoice_id )
Also, on your Transaction_Customer table, have an index on (Invoice_id, Invoice_Type )
select SQL_NO_CACHE
ti.item_id,
ti.item_rate_id,
tc.invoice_type,
sum(ti.qty_computed) as qty
from
transaction_item ti
left join transaction_customer tc
on ti.invoice_id = tc.invoice_id
where
ti.item_id = 17179
group by
ti.item_rate_id,
tc.invoice_type
Both SQL, return the same results. The first my joins are on the subqueries the second the final queryis a join with a temporary that previously I create/populate them
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN (
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa
) hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
DROP TEMPORARY TABLE IF EXISTS hasRemittance;
CREATE TEMPORARY TABLE hasRemittance
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa;
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
Which will have better performance for a few thousand records?
The two formulations are identical except that your explicit temp table version is 3 sql statements instead of just 1. That is, the overhead of the back and forth to the server makes it slower. But...
Since the implicit temp table is in a LEFT JOIN, that subquery may be evaluated in one of two ways...
Older versions of MySQL were 'dump' and re-evaluated it. Hence slow.
Newer versions automatically create an index. Hence fast.
Meanwhile, you could speed up the explicit temp table version by adding a suitable index. It would be PRIMARY KEY(collegiate_id). If there is a chance of that INNER JOIN producing dups, then say SELECT DISTINCT.
For "a few thousand" rows, you usually don't need to worry about performance.
Oracle has a zillion options for everything. MySQL has very few, with the default being (usually) the best. So ignore the answer that discussed various options that you could use in MySQL.
There are issues with
AND IF(notCollegiate,
c.collegiate_id NOT IN (notCollegiate),
'1=1')
I can't tell which table notCollegiate is in. notCollegiate cannot be a list, so why use IN? Instead simply use !=. Finally, '1=1' is a 3-character string; did you really want that?
For performance (of either version)
remittances needs INDEX(type_id, name, remittance_id) with remittance_id specifically last.
collegiateRemittances needs INDEX(remittance_id) (unless it is the PK).
collegiates needs INDEX(typePayment, active, exentFee , approvedBoard) in any order.
Bottom line: Worry more about indexes than how you formulate the query.
Ouch. Another wrinkle. What is getFee()? If it is a Stored Function, maybe we need to worry about optimizing it?? And what is dateS?
It depends actually. You'll have to test performance of every option. On my website I had 2 tables with articles and comments to them. It turned out it's faster to call comment counts 20 times for each article, than using a single union query. MySQL (like other DBs) caches queries, so small simple queries can run amazingly fast.
I did not saw that you have tagged the question as mysql so I initialy aswered for Oracle. Here is what I think about mySQL.
MySQL
There are two options when it comes to temporary tables Memory or Disk. And for Disk you can have MyIsam - non transactional and InnoDB transactional. Of course you can expect better performance for non transactional type of storage.
Additionaly you need to figure out how big resultset are you dealing with. For small resultset the memory option would be faster for large resultset the disk option would be faster.
Again at the end as in my original answer you need to figure out what performance is good enough and go for the most descriptive and easy to read option.
Oracle
It depends on what kind of temporary table you are dealing with.
You can have session based temporary tables - data is held until logout, or transaction based - data is held until commit . On top of this they can support transaction logging or not support it. Depending on configuration you can get better performance from a temporary table.
As everything in the world performance is relative therm. Most probably for few thousand records it will not do significant difference between the two queries. In which case I would go not for the most performant on but for the most easier to read and understand one.
I have this query which basically goes through a bunch of tables to get me some formatted results but I can't seem to find the bottleneck. The easiest bottleneck was the ORDER BY RAND() but the performance are still bad.
The query takes from 10 sec to 20 secs without ORDER BY RAND();
SELECT
c.prix AS prix,
ST_X(a.point) AS X,
ST_Y(a.point) AS Y,
s.sizeFormat AS size,
es.name AS estateSize,
c.title AS title,
DATE_FORMAT(c.datePub, '%m-%d-%y') AS datePub,
dbr.name AS dateBuiltRange,
m.myId AS meuble,
c.rawData_id AS rawData_id,
GROUP_CONCAT(img.captionWebPath) AS paths
FROM
immobilier_ad_blank AS c
LEFT JOIN PropertyFeature AS pf ON (c.propertyFeature_id = pf.id)
LEFT JOIN Adresse AS a ON (c.adresse_id = a.id)
LEFT JOIN Size AS s ON (pf.size_id = s.id)
LEFT JOIN EstateSize AS es ON (pf.estateSize_id = es.id)
LEFT JOIN Meuble AS m ON (pf.meuble_id = m.id)
LEFT JOIN DateBuiltRange AS dbr ON (pf.dateBuiltRange_id = dbr.id)
LEFT JOIN ImageAd AS img ON (img.commonAd_id = c.rawData_id)
WHERE
c.prix != 0
AND pf.subCatMyId = 1
AND (
(
c.datePub > STR_TO_DATE('01-04-2016', '%d-%m-%Y')
AND c.datePub < STR_TO_DATE('30-04-2016', '%d-%m-%Y')
)
OR date_format(c.datePub, '%d-%m-%Y') = '30-04-2016'
)
AND a.validPoint = 1
GROUP BY
c.id
#ORDER BY
# RAND()
LIMIT
5000
Here is the explain query:
Visual Portion:
And here is a screenshot of mysqltuner
EDIT 1
I have many indexes Here they are:
EDIT 2:
So you guys did it. Down to .5 secs to 2.5 secs.
I mostly followed all of your advices and changed some of my.cnf + runned optimized on my tables.
You're searching for dates in a very suboptimal way. Try this.
... c.datePub >= STR_TO_DATE('01-04-2016', '%d-%m-%Y')
AND c.datePub < STR_TO_DATE('30-04-2016', '%d-%m-%Y') + INTERVAL 1 DAY
That allows a range scan on an index on the datePub column. You should create a compound index for that table on (datePub, prix, addresse_id, rawData_id) and see if it helps.
Also try an index on a (valid_point). Notice that your use of a geometry data type in that table is probably not helping anything.
To begin with you have quite a lot of indexes but many of them are not useful. Remember more indexes means slower inserts and updates. Also mysql is not good at using more than one index per table in complex queries. The following indexes have a cardinality < 10 and probably should be dropped.
IDX_...E88B
IDX....62AF
IDX....7DEE
idx2
UNIQ...F210
UNIQ...F210..
IDX....0C00
IDX....A2F1
At this point I got tired of the excercise, there are many more
Then you have some duplicated data.
point
lat
lng
The point field has the lat and lng in it. So the latter two are not needed. That means you can lose two more indexes idxlat and idxlng. I am not quite sure how idxlng appears twice in the index list for the same table.
These optimizations will lead to an overall increase in performance for INSERTS and UPDATES and possibly for all SELECTs as well because the query planner needs to spend less time deciding which index to use.
Then we notice from your explain that the query does not use any index on table Adresse (a). But your where clause has a.validPoint = 1 clearly you need an index on it as suggested by #Ollie-Jones
However I suspect that this index may have low cardinality. In that case I recommend that you create a composite index on this column + another.
The problem is your join with (a). The table has an index, but the index can't be used, more than likely due to the sort (/group by), or possibly incompatible types. The EXPLAIN shows three quarters of a million rows examined, this means that index lookup was not possible.
When designing a query, look for the smallest possible result set - search by that index, and then join from there. Perhaps "c" isn't the best table for the primary query.
(You could try using FORCE INDEX (id) on table a, if it doesn't work, the error may give you more information).
As others have pointed out, you need an index on a.validPoint but what about c.datePub that is also used in the WHERE clause. Why not a multiple column index on datePub, address_id the index on address_id is already used, so a multiple column index will be better here.
SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).
I have this MySQL query which seems to be very very slow. It takes 3 seconds to run. This is trying to get all the posts from either who they're following or any interests they have. Also its trying to make sure it doesn'tshow any duplicate shares that match any post_id. What do you guys think I should do?
SELECT p.*,
IFNULL(post_data_share, UUID()) AS unq_share,
UNIX_TIMESTAMP(p.post_time) AS a
FROM posts p
LEFT JOIN users_interests i ON (i.user_id=1
AND p.post_interest = i.interest)
LEFT JOIN following f ON (f.user_id=1
AND p.post_user_id = f.follower_id)
WHERE (post_user_id=1
OR f.follower_id IS NOT NULL
OR i.interest IS NOT NULL)
AND (POST_DATA_SHARE NOT IN
(SELECT POST_ID
FROM posts p
LEFT JOIN following f ON f.user_id=1
AND p.post_user_id = f.follower_id
LEFT JOIN users_interests i ON (i.user_id=1
AND p.post_interest = i.interest)
WHERE (post_user_id=1
OR f.follower_id IS NOT NULL
OR i.interest IS NOT NULL))
OR POST_DATA_SHARE IS NULL)
GROUP BY unq_share
ORDER BY `post_id` DESC LIMIT 10;
Below are the Performance tips will definitely make difference.
Try Explain Statement.
Alter the Table by Normalize your tables by adding Primary Key and Foreign Key
Add Index for Repeated Values.
Avoid select * from table. Mention the specify column name.
Convert IS NULL to (='')
Convert IS NOT NULL to (!='')
Avoid More OR Condition.
MySQL Configurations to explore
key_buffer_size
innodb_buffer_pool_size
query_cache_size
thread_cache
Much more refer this SO Answer Best my.cnf configuration for a 8GB MySQL server with MyISAM use only
I would start by looking at the execution plan for the query. Here is a link to MySQL documentation on the EXPLAIN keyword to show you how the optimizer is structuring your query: http://dev.mysql.com/doc/refman/5.5/en/using-explain.html
If CPU usage is low, likely the bottleneck is disk access for large table scans.
The way the query is executed is often different from how it was written. Once you see how the execution plan is structured, you are probably going to create indexes on the largest joins. Every table should have one clustered index (often it is created by default), but other fields can often benefit from unclustered indexes.
If the problem is extremely bad and this is vital to your application, you may want to consider reorganizing the database.