I have a dataset of about 32Million rows that I'm trying to export to provide some data for an analytics project.
Since my final data query will be large, I'm trying limit the number of rows I have to work with initially. I'm doing this by running a create table on the main table (32Million) records with a join on another table that's about 5k records. I made indexes on the columns where the JOIN is taking place, but not on the other where conditions. This query has been running for over 4 hours now.
What could I have done to speed this up and if there is something, would it be worth it to stop this query, do it, and start over? The data set is static and I'm not worried about preserving anything or proper database design long-term. I just need to get the data out and will discard the schema.
A simplified version of the query is below
CREATE TABLE RELEVANT_ALERTS
SELECT a.time, s.name,s.class, ...
FROM alerts a, sig s
WHERE a.IP <> 0
AND a.IP not between x and y
AND s.class in ('c1','c2','c3')
Try explain select to see what is going on first of all. Are your indexes properly setup?
Also you are not joining the two tables with their primary keys, is that on purpose? Where is your primary key and foreign key?
Can you also provide us with a table schema?
Also, could your hardware be the problem? How much does RAM and processing power does it have? I hope you are not running this on single core processor as that is bound to take a long time
I have a table with 2,000,000,000 (2 billion rows, 219 Gig) and it doesn't take more than 0.3 seconds to execute similar query to yours with properly setup indexes. This is on a 8 (2ghz) core processor with 64gb ram. So not the beefiest setup for the size of the database, but the indexes are held in the memory, so the queries can be fast.
It should not take that long. Can you please make sure you have indexes on the a.IP And s.class.
Also cant you put a.IP <> = 0 comparison after a.IP not between x and y, so you already have a filtered set for 0 comparison (as that will compare every single record I believe)
You can move s.class as the first comparison depending on how many rows s table has to really speed up the comparison.
Your join is a full cross-join it seems. That will take really really long in any case. Is there no common field in both tables? Why do you need this join? If you really want to do this, you should first create two tables from alerts and sig that fulfill your WHERE conditions and then join the resulting tables if you must.
Agree with Vish.
In addition, depending on your query workload, you could probably change the internal storage engine to MyISAM if it is currently InnoDB, since Mysiam is more optimized for read-only queries.
ALTER TABLE my_table ENGINE = MyISAM;
Also, you could change the isolation level of your database. For example, to set isolation level to read uncommitted:
SET tx_isolation = 'READ-UNCOMMITTED';
first try "explain select" to see what is slowing it down then try to add some indexes if you don't have any
Trust me, 4 hours is very normal: because you have a table of 32 millions rows and with the join you juste multiply 32 millions with 5000 so your query have a complexity of 320000000 * 5000 ...
to avoid that i suggest you to use an ETL WORFLOW ... Like Microsoft SSIS...
Withh SSIS you can reduce a lot the query's TIME...
Related
I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.
Overview:
I have a system that builds the query statements. Some of which must join some tables to others based on set parameters passed into the system. When running some performance tests on the queries created I noticed that some of the queries were doing FULL TABLE SCANS, which in many cases, from what I've read is not good for large tables.
What I'm trying to do:
1 - Remove the full table scans
2 - Speed up the Query
3 - Find out if there is a more efficient query I can have the system build instead
The Query:
SELECT a.p_id_one, b.p_id_two, b.fk_id_one, c.fk_id_two, d.fk_id_two,
d.id_three, d.fk_id_one
FROM ATable a
LEFT JOIN BTable b ON a.p_id_one = b.fk_id_one
LEFT JOIN CTable c ON b.p_id_two = c.fk_id_two
LEFT JOIN DTable d ON b.p_id_two = d.fk_id_two
WHERE a.p_id_one = 1234567890
The Explain
Query Time
Showing rows 0 - 10 (11 total, Query took 0.0016 seconds.)
Current issues:
1 - Query time for my system/DBMS (phpmyadmin) takes between 0.0013 seconds and 0.0017 seconds.
What have I done to fix?
The full table scans or 'ALL' type queries are being ran on tables ('BTable', 'DTable') so I've tried to use FORCE INDEX on the appropriate ids.
Using FORCE INDEX removes the full table scans but it doesn't speed up the
performance.
I double checked my fk_constraints and index relationships to ensure I'm not missing anything. So far everything checks out.
2 - Advisor shows multiple warnings a few relate back to the full table scans and the indexes.
Question(s):
Assume all indexes are available and created
1 - Is there a better way to perform this query?
2 - How many joins are too many joins?
3 - Could the joins be the problem?
4 - Does the issue rest within the WHERE clause?
5 - What optimize technique/tool could I have missed?
6 - How can I get this query to perform at a speed between 0.0008 and 0.0001?
If images and visuals are needed to help clarify my situation please do ask in a comment below. I appreciate any and all assistance.
Thank you =)
"p_id_one" does not tell us much. Is this an auto_increment? Real column names sometimes gives important clues of cardinality and intent. As Willem said, "there must be more to this issue" and "what is the overall problem".
LEFT -- do you need it? It prevents certain forms of optimizations; remove it if the 'right' table row is not optional.
WHERE a.p_id_one = 1234567890 needs INDEX(p_id_one). Is that the PRIMARY KEY already? In that case, an extra INDEX is not needed. (Please provide SHOW CREATE TABLE.)
Are those really the columns/expressions you are SELECTing? It can make a difference -- especially when suggesting a "covering index" as an optimization.
Please provide the output from EXPLAIN SELECT ... (That is not the discussion you did provide.) That output would help with clues of 1:many, cardinality, etc.
If these are FOREIGN KEYs, you already have indexes on b.fk_id_one, c.fk_id_two, d.fk_id_two; so that is nothing more to do there.
1.6ms is an excellent time for a query involving 4 tables. Don't plan on speeding it up significantly. You probably handle hundreds of connections doing thousands of similar queries per second. Do you need more than that?
Are you using InnoDB? That is better at concurrent access.
Your example does not seem to have any full table scans; please provide an example that does.
ALL on a 10-row table is nothing to worry about. On a million-row table it is a big deal. Will your tables grow significantly? You should note this when worrying about ALL: A full table scan is sometimes faster than using the 'perfect' index. The optimizer decide on the scan when the estimated number of rows is more than about 20% of the table. A table scan is efficient because it is scanning straight through the table, even if skipping 80% of the rows. Using an index is more complex -- the index is scanned, but for each row found in the index, a lookup is needed into the data to find the row. If you see ALL when you don't think you should, then probably the index is not very selective. Don't worry.
Don't use FORCE INDEX -- although it may help the query with today's values, it may hurt tomorrow's query.
SELECT COUNT(*) AS count_all, products.id AS products_id
FROM `products`
INNER JOIN `product_device_facilities`
ON `product_device_facilities`.`product_id` = `products`.`id`
INNER JOIN `product_vendors`
ON `product_vendors`.`ProductId` = `products`.`id`
INNER JOIN `screen_shots`
ON `screen_shots`.`ProductVendorId` = `product_vendors`.`id`
WHERE ( (DownloadCount >= 10 or DownloadCount is NULL)
and (Minsdk <= 10 or Minsdk is null))
GROUP BY products.id
HAVING GROUP_CONCAT(device_facility_id ORDER BY device_facility_id ASC ) IN (0)
This is taking 10 seconds for 100k records.
How to improve the performance?
There are a few things that you can try.
Use persistent connections to the database to avoid connection overhead
Check that all of your tables have Primary keys on the key tables e.g. (product_id)
Use less RAM per row by declaring columns only as large as they need to be to hold the values stored in them. i.e. as #manurajhada said don't use count(*) use count(primary key)
Using simpler permissions when you issue GRANT statements enables MySQL to reduce permission-checking overhead.
use indices on the references between different tables. just remember not to index too many columns, simple rule of thumb, if you never refer to a column in comparisons, there’s no need to index it.
Try using ANALYZE TABLE to help mysql to better optimize the query.
You can speed up a query a tiny bit by making sure all coumns which are not null are declared NOT NULL — thus you speed up table traversing a bit.
Tune MySQL caching: allocate enough memory for the buffer (e.g. SET GLOBAL query_cache_size = 1000000) and define query_cache_min_res_unit depending on average query result set size.
I know it sounds counter intuitive but sometimes it is worth de-normalising tables i.e. duplicate some data in several tables to avoid JOINs which are expensive. You can support data integrity with foreign keys or triggers.
and if all else fails
upgrade your hardware if you can, more RAM, faster HDD can make a significant difference to the speed of the database, and when you have done that allocate more memory to mysql.
EDIT
Another option if you do not require the results live as #ask-bjorn-hansen suggested you could use a background task (cron job) once a day, and store the result of the query in a separate table, then in your application all you have to do is check that table for the returned result, that way you get around having to query 100k results and would be able to run queries that take hours and not overly impact your users.
Do indexing on join columns on tables and instead of count(*) use count(some indexed primary key column).
Are minsdk and download count in the same table? If so adding an index on those two might help.
It could be that it's just a hard/impossible query to do quickly. Without seeing your full schema and the data it's hard to be specific, but it's possible that splitting it up into several easier to execute queries would be faster. Or as Amadeus suggested maybe denormalize the data a bit.
Another variation would be to just live with it taking 10 seconds, but make sure it's always done periodically in the background (with cron or similar) and never while a user is waiting. Then take the time to fix it if/when it takes minutes instead of seconds or otherwise puts an unacceptable burden on your user experience or servers.
I have a table now containing over 43 million records. To execute SELECT, I usually select records with the same field, say A. Will it be more efficient to divide the table into several tables by different A and save in the database? How much can I gain?
I have one table named entry: {entryid (PK), B}, containing 6 thousand records, and several other tables with the similar structure T1: {id(PK), entryid, C, ...}, containing over millions of records. Do the following two processes have the same efficiency?
SELECT id FROM T1, entry WHERE T1.entryid = entry.entryid AND entry.B = XXX
and
SELECT entryid FROM entry WHERE B = XXX
//format a string S as (entryid1, entryid2, ... )
//then run
SELECT id FROM T1 WHERE entryid IN S
You will get performance improvement. You don't have to do that manually, but use built in MySQL partitioning. How much will you get really depends on your configuration and it would be the best for you to test it. For example, if you have monster server, 43M records is nothing and you will not get that much with partitioning (but you should get improvement anyway).
As for this question, I would say that first query will be a lot faster.
But it would be the best to measure your results because it may depend on your hardware coonfiguration, indexes (use EXPLAIN to check if you have correct indexes), your MySQL settings like query cache size, and engine you are using (MYISAM, InnoDB)...
Use the EXPLAIN Command to check your queries.
dev.mysql.com/doc/refman/5.0/en/explain.html
Here is an explaination
http://www.slideshare.net/phpcodemonkey/mysql-explain-explained
You need to make sure first and foremost you have the right indexes for a table that size especially for queries that join with other tables.
I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().