SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).
Related
It takes around 5 seconds to get result of query from a table consisting 1.5 million row. Query is "select * from table where code=x"
Is there a setting to increase speed ? Or should I jump to another database apart from MySQL ?
You could index the code column. Note that the trade off is that inserting new rows or updating the code column on existing rows will be slowed down a bit since the index also needs to be updated. In any event, you should benchmark the improvement to make sure it's worth it.
WHERE code=x -- needs INDEX(code)
SELECT * when many of the columns are bulky: Large columns are stored "off-record". Hence they take longer to fetch. So, explicitly list the columns you really need, hoping to leave out some of the bulky columns.
When a GROUP BY or LIMIT is involved, it is sometimes best to do
SELECT lots of columns
FROM ( SELECT id FROM t WHERE ... group-by or limit ) AS x
JOIN t AS y USING(id)
etc.
That is, start by finding just the ids as simply as possible, then JOIN back to the original table and other table(s). (This is not the case you presented, but I worry that you over-simplified it.)
I am reading High performance MySQL and I am a little confused about deferred join.
The book says that the following operation cannot be optimized by index(sex, rating) because the high offset requires them to spend most of their time scanning a lot of data that they will then throw away.
mysql> SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;
While a deferred join helps minimize the amount of work MySQL must do gathering data that it will only throw away.
SELECT <cols> FROM profiles INNER JOIN (
SELECT <primary key cols> FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
) AS x USING(<primary key cols>);
Why a deferred join will minimize the amount of gathered data.
The example you presented assumes that InnoDB is used. Let's say that the PRIMARY KEY is just id.
INDEX(sex, rating)
is a "secondary key". Every secondary key (in InnoDB) includes the PK implicitly, so it is really an ordered list of (sex, rating, id) values. To get to the "data" (<cols>), it uses id to drill down the PK BTree (which contains the data, too) to find the record.
Fast Case: Hence,
SELECT id FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
will do a "range scan" of 100010 'rows' in the index. This will be quite efficient for I/O, since all the information is consecutive, and nothing is wasted. (No, it is not smart enough to jump over 100000 rows; that would be quite messy, especially when you factor in the transaction_isolation_mode.) Those 100010 rows probably fit in about 1000 blocks of the index. Then it gets the 10 values of id.
With those 10 ids, it can do 10 joins ("NLJ" = "Nested Loop Join"). It is rather likely that the 10 rows are scattered around the table, possibly requiring 10 hits to the disk.
Let's "count the disk hits" (ignoring non-leaf nodes in the BTrees, which are likely to be cached anyway): 1000 + 10 = 1010. On ordinary disks, this might take 10 seconds.
Slow Case: Now let's look at the original query (SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;). Let's continue to assume INDEX(sex, rating) plus the implicit id on the end.
As before, it will index scan through the 100010 rows (est. 1000 disk hits). But as it goes, it is too dumb to do what was done above. It will reach over into the data to get the <cols>. This often (depending on caching) requires a random disk hit. This could be upwards of 100010 disk hits (if the table is huge and caching is not very useful).
Again, 100000 are tossed and 10 are delivered. Total 'cost': 100010 disk hits (worst case), which might take 17 minutes.
Keep in mind that there are 3 editions of High performance MySQL; they were written over the past 13 or so years. You are probably using a much newer version of MySQL than they covered. I do not happen to know if the optimizer has gotten any smarter in this area. These, if available to you, may give clues:
EXPLAIN FORMAT=JSON SELECT ...;
OPTIMIZER TRACE...
My favorite "Handler" trick for studying how things work may be helpful:
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%'.
You are likely to see numbers like 100000 and 10, or small multiples of such. But, keep in mind that a fast range scan of the index counts as 1 per row, and so does a slow random disk hit for a big set of <cols>.
Overview: To make this technique work, the subquery need a "covering" index, with the columns correctly ordered.
"Covering" means that (sex, rating, id) contains all the columns touched. (We are assuming that <cols> contains other columns, perhaps bulky ones that won't work in an INDEX.)
"Correct" ordering of the columns: The columns are in just the right order to get all the way through the query. (See also my cookbook.)
First come any WHERE columns compared with = to constants. (sex)
Then comes the entire ORDER BY, in order. (rating)
Finally it is 'covering'. (id)
From the description below from official (https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html):
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
We can see that they should have no difference.
But the percona suggest this, and give test data. But give no reason, I think there maybe exist some "bug" in mysql when deal with this kind of case. So we just regard this as a useful experience.
I have very large table with 17,044,833 Rows and 6.4 GB in size. I am running the simple query below and it takes like 5 seconds. Any ideas what optimizations can I do to improve the speed of this query?
SELECT
`stat_date`,
SUM(`adserver_impr`),
SUM(`adserver_clicks`)
FROM `dfp_stats` WHERE
`stat_date` >= '2014-02-01'
AND
`stat_date` <= '2014-02-28'
MySQL Config:
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
innodb_buffer_pool_size = 10G
Server:
Memory: 48GB
Disk: 480GB
UPDATE
ORIGINAL QUERY:
EXPLAIN
SELECT
DS.`stat_date` 'DATE',
DC.`name` COUNTRY,
DA.`name` ADVERTISER,
DOX.`id` ORDID,
DOX.`name` ORDNAME,
DLI.`id` LIID,
DLI.`name` LINAME,
DLI.`is_ron` ISRON,
DOX.`is_direct` ISDIRECT,
DSZ.`size` LISIZE,
PUBSITE.`id` SITEID,
SUM(DS.`adserver_impr`) 'DFPIMPR',
SUM(DS.`adserver_clicks`) 'DFPCLCKS',
SUM(DS.`adserver_rev`) 'DFPREV'
FROM `dfp_stats` DS
LEFT JOIN `dfp_adunit1` AD1 ON AD1.`id` = DS.`dfp_adunit1_id`
LEFT JOIN `dfp_adunit2` AD2 ON AD2.`id` = DS.`dfp_adunit2_id`
LEFT JOIN `dfp_adunit3` AD3 ON AD3.`id` = DS.`dfp_adunit3_id`
LEFT JOIN `dfp_orders` DOX ON DOX.`id` = DS.`dfp_order_id`
LEFT JOIN `dfp_advertisers` DA ON DA.`id` = DOX.`dfp_advertiser_id`
LEFT JOIN `dfp_lineitems` DLI ON DLI.`id` = DS.`dfp_lineitem_id`
LEFT JOIN `dfp_countries` DC ON DC.`id` = DS.`dfp_country_id`
LEFT JOIN `dfp_creativesize` DSZ ON DSZ.`id` = DS.`dfp_creativesize_id`
LEFT JOIN `pubsites` PUBSITE
ON AD1.`pubsite_id` = PUBSITE.`id`
OR AD2.`pubsite_id` = PUBSITE.`id`
WHERE
DS.`stat_date` >= '2014-02-01'
AND DS.`stat_date` <= '2014-02-28'
AND PUBSITE.`id` = 6
GROUP BY DLI.`id`,DS.`stat_date`;
RESULTS OF EXPLAIN: (This is after adding the COVERING INDEX)
http://i.stack.imgur.com/vhVeB.png
If you haven't, you might want to index the stat_date field for faster lookups. Here's the syntax:
ALTER TABLE TABLE_NAME ADD INDEX (COLUMN_NAME);
Read more about indexing and optimizations here: https://dev.mysql.com/doc/refman/5.5/en/optimization-indexes.html
For best performance of this query, create a covering index:
... ON `dfp_stats` (`stat_date`,`adserver_impr`,`adserver_clicks`)
The output from EXPLAIN should show "Using index". This means that the query can be satisfied entirely from the index, without needing to visit any pages in the underlying table. (The term "covering index" refers to an index that includes all of the columns referenced by a query.)
At a minimum, you'll want an index with a leading column of stat_date so that the query can use an index range scan operation. An index range scan can essentially skip over boatloads of rows, and more quickly locate the rows that actually need to be checked.
As far as changes to the configuration of the MySQL instance, that really depends on whether the table is InnoDB or MyISAM.
FOLLOWUP
For InnoDB, memory is still king. If there's memory available on the server, then you can increase innodb_buffer_pool.
Also consider enabling the MySQL query cache. (We have the query cache enabled only for queries that are specifically enabled to use the cache with the SQL_CACHE keyword i.e. SELECT SQL_CACHE t.foo,, so we don't clutter up the cache with queries that don't give us benefit. For other queries, we avoid running the extra code (that would otherwise be required) to search the cache and maintain the cache contents.
The place we get a benefit from the query cache is from "expensive" queries (which look at a lot of rows and do a lot of joins) against tables that are relatively static, and that return small resultsets. (I'd consider a query that gets a single row with a SUMs from a whole boatload of rows would be a good candidate for the query cache, if the table is infrequently updated, or if the same query is going to be run several times before a DML operation on the table invalidates the cache.)
It's a bit odd that your query is returning a non-aggregate that isn't in a GROUP BY clause.
If your query is using an index on stat_date, it's likely the query is returning the lowest value of stat_date within the range specified by the predicate; so it's likely that you would get an equivalent result using SELECT MIN(stat_date) AS stat_date.
A more complicated approach would be to setup a "summary" table, and refresh that periodically with the results from a query, and then have the application query the summary table. (A data warehouse type approach.) This doesn't work if you need "up-to-the-minute" accuracy. To get that, you'd likely need to introduce triggers on the target table, to maintain the summary table on INSERT, UPDATE and DELETE operations.
If I went that route, I'd probably opt for storing a summary row for each stat_date, so it could accommodate queries on any range or set of dates...
CREATE TABLE dfp_stats_summary
( stat_date DATE NOT NULL PRIMARY KEY
, adserver_impr BIGINT
, adserver_clicks BIGINT
) ENGINE=InnoDB ;
-- refresh
INSERT INTO dfp_stats_summary (stat_date, adserver_impr, adserver_clicks)
SELECT t.stat_date
, SUM(t.adserver_impr) AS adserver_impr
, SUM(t.adserver_clicks) AS adserver_clicks
FROM dfp_stats
GROUP BY t.stat_date
ON DUPLICATE KEY
UPDATE adserver_impr = VALUES(adserver_impr)
, adserver_clicks = VALUES(adserver_clicks)
;
The refresh query will crank; you might want to specify a date range in a WHERE clause to do a month or two at a time, and loop through all the possible months.
With the summary table populated, just change the original query to reference the new summary table, rather than the detail table. It would be a lot faster to add up 28 summary rows than several hundred thousands detail rows.
We're doing an update query between two database tables and it is ridiculously slow. As in: it would take 30 days to perform the query.
One table, lab.list, contains about 940,000 records, the other, mind.list about 3,700,000 (3.7 million)
The update sets a field when two BETWEEN conditions are met. This is the query:
UPDATE lab.list L , mind.list M SET L.locId = M.locId WHERE L.longip BETWEEN M.startIpNum AND M.endIpNum AND L.date BETWEEN "20100301" AND "20100401" AND L.locId = 0
As it is now, the query is performing with about 1 update every 8 seconds.
We also tried it with the mind.list table in the same database, but that doesn't matter for the query time.
UPDATE lab.list L, lab.mind M SET L.locId = M.locId WHERE longip BETWEEN M.startIpNum AND M.endIpNum AND date BETWEEN "20100301" AND "20100401" AND L.locId = 0;
Is there a way to speed up this query? Basically IMHO it should make two subsets of the databases:
mind.list.longip BETWEEN M.startIpNum AND M.endIpNum
lab.list.date BETWEEN "20100301" AND "20100401"
and then update the values for these subsets. Somewhere along the line I think I made a mistake, but where? Maybe there is a faster query possible?
We tried log_slow_queries, but that shows that it is indeed examining 100s of millions of rows, probably going up all the way to 3331 gigarows.
Tech info:
Server version: 5.5.22-0ubuntu1-log (Ubuntu)
lab.list has indexes on locId, longip, date
lab.mind has indexes on locId, startIpNum AND M.endIpNum
hardware: 2x xeon 3.4 GHz, 4GB RAM, 128 GB SSD (so that should not be a problem!)
I would first of all try to index mind on startIpNum, endIpNum, locId in this order. locId is not used in SELECTing from mind, even if it is used for the update.
For the same reason I'd index lab on locId, date and longip (which isn't used in the first chunking, which should run on date) this order.
Then what kind of datatype is assigned to startIpNum and endIpNum? For IPv4, it's best to convert to INTEGER and use INET_ATON and INET_NTOA for user I/O. I assume you already did this.
To run the update, you might try to segment the M database using temporary tables. That is:
* select all records of lab in the given range of dates with locId = 0 into a temporary table TABLE1.
* run an analysis on TABLE1 grouping IP addresses by their first N bits (using AND with a suitable mask: 0x80000000, 0xC0000000, ... 0xF8000000... and so on, until you find that you have divided into a "suitable" number of IP "families". These will, by and large, match with startIpNum (but that's not strictly necessary).
* say that you have divided in 1000 families of IP.
* For each family:
* select those IPs from TABLE1 to TABLE3.
* select the IPs matching that family from mind to TABLE2.
* run the update of the matching records between TABLE3 and TABLE2. This should take place in about one hundred thousandth of the time of the big query.
* copy-update TABLE3 into lab, discard TABLE3 and TABLE2.
* Repeat with next "family".
It is not really ideal, but if the slightly improved indexing does not help, I really don't see all that many options.
In the end, the query was too big or cumbersome for mysql to fill. Even after indexing. Testing the same query with the same data on a high-end Sybase server, also took 3 hours.
So we abandoned the do it all on the database server thought, and went back to scripting languages.
We did the following in python:
load a chunk of 100000 records of the 3.7 million records, and loop over the rows
for each row, set the locId and fill in the rest of the columns
All these updates together take about 5 minutes, so a huge improvement!
Conclusion:
think outside of the database box!
I have about 1 million rows on users table and have columns A AA B BB C CC D DD E EE F FF by example to count int values 0 & 1
SELECT
CityCode,SUM(A),SUM(B),SUM(C),SUM(D),SUM(E),SUM(F),SUM(AA),SUM(BB),SUM(CC),SUM(DD),SUM(EE),SUM(FF)
FROM users
GROUP BY CityCode
Result 8 rows in set (24.49 sec).
How to make my statement more faster?
Use explain to to know the excution plan of your query.
Create atleast one or more Index. If possible make CityCode primary key.
Try this one
SELECT CityCode,SUM(A),SUM(B),SUM(C),SUM(D), SUM(E),SUM(F),SUM(AA),SUM(BB),SUM(CC),SUM(DD),SUM(EE),SUM(FF)
FROM users
GROUP BY CityCode,A,B,C,D,E,F,AA,BB,CC,DD,EE,FF
Create an index on the CityCode column.
I believe it is not because of SUM(), try to say select CityCode from users group by CityCode; it should take neary the same time...
Use better hardware
increase caching size - if you use InnoDB engine, then increase the innodb_buffer_pool_size value
refactor your query to limit the number of users (if business logic permits that, of course)
You have no WHERE clause, which means the query has to scan the whole table. This will make it slow on a large table.
You should consider how often you need to do this and what the impact of it being slow is. Some suggestions are:
Don't change anything - if it doesn't really matter
Have a table which contains the same data as "users", but without any other columns that you aren't interested in querying. It will still be slow, but not as slow, especially if there are bigger ones
(InnoDB) use CityCode as the first part of the primary key for table "users", that way it can do a PK scan and avoid any sorting (may still be too slow)
Create and maintain some kind of summary table, but you'll need to update it each time a user changes (or tolerate stale data)
But be sure that this optimisation is absolutely necessary.