We're doing an update query between two database tables and it is ridiculously slow. As in: it would take 30 days to perform the query.
One table, lab.list, contains about 940,000 records, the other, mind.list about 3,700,000 (3.7 million)
The update sets a field when two BETWEEN conditions are met. This is the query:
UPDATE lab.list L , mind.list M SET L.locId = M.locId WHERE L.longip BETWEEN M.startIpNum AND M.endIpNum AND L.date BETWEEN "20100301" AND "20100401" AND L.locId = 0
As it is now, the query is performing with about 1 update every 8 seconds.
We also tried it with the mind.list table in the same database, but that doesn't matter for the query time.
UPDATE lab.list L, lab.mind M SET L.locId = M.locId WHERE longip BETWEEN M.startIpNum AND M.endIpNum AND date BETWEEN "20100301" AND "20100401" AND L.locId = 0;
Is there a way to speed up this query? Basically IMHO it should make two subsets of the databases:
mind.list.longip BETWEEN M.startIpNum AND M.endIpNum
lab.list.date BETWEEN "20100301" AND "20100401"
and then update the values for these subsets. Somewhere along the line I think I made a mistake, but where? Maybe there is a faster query possible?
We tried log_slow_queries, but that shows that it is indeed examining 100s of millions of rows, probably going up all the way to 3331 gigarows.
Tech info:
Server version: 5.5.22-0ubuntu1-log (Ubuntu)
lab.list has indexes on locId, longip, date
lab.mind has indexes on locId, startIpNum AND M.endIpNum
hardware: 2x xeon 3.4 GHz, 4GB RAM, 128 GB SSD (so that should not be a problem!)
I would first of all try to index mind on startIpNum, endIpNum, locId in this order. locId is not used in SELECTing from mind, even if it is used for the update.
For the same reason I'd index lab on locId, date and longip (which isn't used in the first chunking, which should run on date) this order.
Then what kind of datatype is assigned to startIpNum and endIpNum? For IPv4, it's best to convert to INTEGER and use INET_ATON and INET_NTOA for user I/O. I assume you already did this.
To run the update, you might try to segment the M database using temporary tables. That is:
* select all records of lab in the given range of dates with locId = 0 into a temporary table TABLE1.
* run an analysis on TABLE1 grouping IP addresses by their first N bits (using AND with a suitable mask: 0x80000000, 0xC0000000, ... 0xF8000000... and so on, until you find that you have divided into a "suitable" number of IP "families". These will, by and large, match with startIpNum (but that's not strictly necessary).
* say that you have divided in 1000 families of IP.
* For each family:
* select those IPs from TABLE1 to TABLE3.
* select the IPs matching that family from mind to TABLE2.
* run the update of the matching records between TABLE3 and TABLE2. This should take place in about one hundred thousandth of the time of the big query.
* copy-update TABLE3 into lab, discard TABLE3 and TABLE2.
* Repeat with next "family".
It is not really ideal, but if the slightly improved indexing does not help, I really don't see all that many options.
In the end, the query was too big or cumbersome for mysql to fill. Even after indexing. Testing the same query with the same data on a high-end Sybase server, also took 3 hours.
So we abandoned the do it all on the database server thought, and went back to scripting languages.
We did the following in python:
load a chunk of 100000 records of the 3.7 million records, and loop over the rows
for each row, set the locId and fill in the rest of the columns
All these updates together take about 5 minutes, so a huge improvement!
Conclusion:
think outside of the database box!
Related
I have a view that has a duration time of ~0.2 seconds when I do a simple SELECT * from it, but has a duration time of ~25 seconds when I do simply SELECT COUNT(*) from it. What would cause this? It seems like if it takes 0.2 seconds to compute the output data then it could run a simple length calculation on that dataset in a trivial amount of time. MySQL 5.7. Details below.
mysql> select count(*) from Lots;
+----------+
| count(*) |
+----------+
| 4136666 |
+----------+
1 row in set (25.29 sec)
In MySQL workbench, the following query produces durations like: 0.217 sec
select * from Lots;
The fetch time is significant given the amount of data, but my understanding is the "Duration" is how long it takes to compute the output dataset of the view.
Definition of Lots view:
select
lot.*,
coalesce(overrides.streetNumber, address.streetNumber, lot.rawStreetNumber) as streetNumber,
coalesce(overrides.street, address.street, lot.rawStreet) as street,
coalesce(overrides.postalCode, address.postalCode, lot.rawPostalCode) as postalCode,
coalesce(overrides.city, address.city, lot.rawCity) as city
from LotsData lot
left join Address address on address.lotNumber = lot.lotNumber
left join Override overrides on overrides.lotId = lot.lotNumber
The data in VIEW objects isn't materialized. That is, it doesn't exist in any sort of tabular form in your database server. Rather, the server pulls it together from its tables when a query (like your COUNT query) references the VIEW. So, there's no simple metadata hanging around in the server that can satisfy your COUNT query instantaneously. The server has to pull together all your joined tables to generate a row count. It takes a while. Remember, your database server may have other clients concurrently INSERTing or DELETEing rows to one or more of the tables in your view.
It's worse than that. In the InnoDB storage engine, even COUNTing the rows of a table is slow. To achieve high concurrency InnoDB doesn't attempt to store any kind of precise row count. So the database server has to count those rows one-by-one as well. (The older MyISAM storage engine does maintain precise row count metadata for tables, but it offers less concurrency.)
Wise data programmers avoid using COUNT(*) on whole tables or views composed from them in production for those reasons.
The real question is why your SELECT * FROM view is so fast. It's unlikely that your database server can compose and deliver a 4-megarow view from its JOINs in less than a second, nor is it likely that Workbench can absorb that many rows in that time. Like #ysth said, many GUI-based SQL client programs, like Workbench and HeidiSQL, sometimes silently append something like LIMIT 1000 to interactive operations calling for the display of whole tables or views. You might look for evidence of that.
SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).
By simple logic Id think yeah, is faster because the DBMS brings less info and needs less memory...however, I dont have a valid argument why could be faster.
If for example, I want to have a select from 2 related tables, with index and everything.
But I want to know why select tableA.field, tableA.field2, tableA.field3, tableBfield1, tableB,field2 from tableA, tableB
is actually faster than
select * from tableA,tableB
Both tables have about 3 million records and table A has about 14 fields and tableB got 18.
Any idea?
Thanks.
Reducing the number of fields selected means that less data has to be transmitted from the server to the client. It also reduces the amount of memory that the server and client have to use to hold the data selected. So these should improve performance once the server determines which rows should be in the result set.
It's not likely to have any significant impact on the speed of processing the query itself within the database server. That's dominated by the cost of joining the tables, filtering the rows based on the WHERE clause, and performing any calculations specified in the SELECT clause. These are all independent of the columns being selected. If you use EXPLAIN on the two queries, you won't see any difference.
you are joining two tables with 3 million rows each with no filter. that will make 9x10^12 rows. generating and transmitting to the client a resultset of a few fields, against all 32 fields will make a difference.
If you select all fields in the first query it's the same thing because you request the same amount of data. Check this http://sqlfiddle.com/#!9/27987/2
Maybe the difference of perfomance has another reason...like...other selects in running.
Essentially select * from tableA,tableB is the equivalent of the Cartesian product of the two tables, for a total of 3million x 3 million of rows.
Therefore:
select * from tableA,tableB
With the wildcards * you retrieve a table of 9million x 28 columns, while
select tableA.field, tableA.field2, tableA.field3, tableB.field1, tableB.field2 from tableA, tableB
with the explicit form you have a table of 9million x 5 columns...so less data!
I am using MySQL 5.1.34, all tables were in Innodb Engine.
I have 3 tables as below:
TableA (1M rows)
-ID (Auto Increment PK)
-TableB_ID
-Date varchar(indexed)
-Other Fields
TableB (60M rows)
-ID (Auto Increment PK)
-TableC_ID
-Other Fields
TableC (10M rows)
-ID (Auto Increment PK)
-Other Fields
My objective is to join 3 tables which matching the "date" in TableA. The "date" column is indexed and a simple WHERE clause can be complete within second. E.g.
SELECT * FROM TableA where date = '2015-03-13';
10000 rows in set (0.1 sec)
However when I try to join TableB and TableC with the SQL below, the process become extreme slow.
SELECT A.*, C.Something FROM TableA A JOIN TableB B on A.TableB_ID = B.ID JOIN TableC C on B.TableC_ID = C.ID WHERE A.date = '2015-03-13';
10000 rows in set (20 sec)
I've tried to troubleshoot the slowness using EXPLAIN Command, output as follow.
What could be the reason? Please help!
As I said, it's probably a disk seeking issue. As you measured, once the records are in the memory, the query is fast, which, to me, confirms the issue.
Fetching a random location from the disk takes about ~10ms as the disk head has to move. The relevant records are probably clustered on the disk and the server had to do approx. 20s/10ms = 20.000 seeks.
The are a couple of obvious approaches:
Use an SSD. No seeking.
Add enough RAM to the server to avoid disk access, or use a dedicated server for these queries (16GB looks like more than enough (though I don't know the size of the records) so I guess there are other frequently used massive tables).
Cache the result per date - and store it in a database (memcached/redis/..). This could work really well if the records are static for the old dates as you don't have to worry about cache invalidation.
Anyways, it's probably good idea to do some back-of-the-envelope calculations and figuring out the memory requirements.
You have left out a lot of important information by not providing SHOW CREATE TABLE, but I will make some guesses.
Are you really fetching 10K rows? What are you going to do with all of them? It takes time to get that many rows.
If you are using MyISAM, then B would benefit greatly from INDEX(ID, TableC_ID). InnoDB should not benefit from it, unless the table is quite 'wide'.
SHOW VARIABLES LIKE '%buffer%'; How much RAM do you have?
Is the Query cache in use? If it is, that would explain why it was so fast the second time. Most production system find it better to have the QC turned off.
I have indices of 800M and told MySQL to use 1500M of RAM. After starting MySQL it uses 1000M on Windows 7 x64.
I want to execute this query:
SELECT oo.* FROM table o
LEFT JOIN table oo ON (oo.order = o.order AND oo.type="SHIPPED")
WHERE o.type="ORDERED" and oo.type IS NULL
This finds all items not yet shipped . The execution plan tells me this:
My indices are:
type_order: Multiple index with type and order
order_type with order as first index value, followed by type
So MySQL should use the index type_order from RAM and then pick out the few entries with the order_type index. I'm expecting only about 1000 non shipped items, so this query should be really fast, but it isn't. Disks are going crazy....
What am I doing wrong?
The query says SELECT sometable.*, so for 1000 matching rows, there will be 1000 fetches of all the fields from the table. Whether the WHERE part indexes are fully loaded into ram or not would only help some. The data fields still have to be retrieved. Odds are, they are scattered all over the disk. So, of course the disk(s) will be doing a thousand small reads.