SQL big query optimalization - mysql

I have the following query:
SELECT `Product_has_ProductFeature`.`productId`, `Product_has_ProductFeature`.`productFeatureId`, `Product_has_ProductFeature`.`productFeatureValueId`
FROM `Product_has_ProductFeature`
WHERE `productId` IN (...'18815', '18816', '18817', '18818', '18819', '18820', '18821', '18822', '18823', '18824', '18825', '18826', '18827', '18828', '18829', '18830', '18831', '18832', '18833', '18834'..)
I have around 50000 productId's. The execution is 20 seconds long. How can I make it faster?

This is more of a comment.
Returning 50,000 rows can take time, depending on your application, the row size, the network, and how busy the server is.
When doing comparisons, you should be sure the values are of the same type. So, if productId is numeric, then drop the single quotes.
If the values are all consecutive, then eschew the list and just do:
where productid >= 18815 and productid <= 18834
Finally, an index on productid is usually recommended. However, in some cases, an index can make performance worse. This depends on the size of your data and the size of memory available for the page and data cache.
MySQL implements in efficiently (using a binary tree). It is possible that much of the overhead is in compiling the query rather than executing it. If you have the values in a table, a join is probably going to be more efficient.

Related

Aurora MySQL 5.7 ignores primary key index [duplicate]

I am observing weird behaviour which I am trying to understand.
MySQL version: 5.7.33
I have the below query:
select * from a_table where time>='2022-05-10' and guid in (102,512,11,35,623,6,21,673);
a_table has primary key on time,guid and index on guid
The query I wrote above has very good performance and as per explain plan is using index condition; using where; using MRR
As I increase the number of value in my in clause, the performance is impacted significantly.
After some dry runs, I was able to get a rough number. For less than ~14500 values the explain plan is same as above. For number of values higher than this, explain plan only uses where and it takes forever to run my query.
In other words, for example, if I put 14,000 values in my in clause, the explain plan has 14,000 rows as expected. However, if I put 15,000 values in my in clause, the explain has 221200324 rows. I dont even have these many rows in my whole table.
I am trying to understand this behaviour and to know if there is any way to fix this.
Thank you
Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an IN() predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).
If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set range_optimizer_max_mem_size = 0 to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.

MySQL 'IN' operator on large number of values

I am observing weird behaviour which I am trying to understand.
MySQL version: 5.7.33
I have the below query:
select * from a_table where time>='2022-05-10' and guid in (102,512,11,35,623,6,21,673);
a_table has primary key on time,guid and index on guid
The query I wrote above has very good performance and as per explain plan is using index condition; using where; using MRR
As I increase the number of value in my in clause, the performance is impacted significantly.
After some dry runs, I was able to get a rough number. For less than ~14500 values the explain plan is same as above. For number of values higher than this, explain plan only uses where and it takes forever to run my query.
In other words, for example, if I put 14,000 values in my in clause, the explain plan has 14,000 rows as expected. However, if I put 15,000 values in my in clause, the explain has 221200324 rows. I dont even have these many rows in my whole table.
I am trying to understand this behaviour and to know if there is any way to fix this.
Thank you
Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an IN() predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).
If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set range_optimizer_max_mem_size = 0 to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.

No real performance gain after index rebuild on SQL Server 2008

I have a table whose compound clustered index (int, DateTime) was 99% fragmented.
After defragmenting and making sure that statistics were updated, I still get the same response time when I run this query:
SELECT *
FROM myTable
WHERE myIntField = 1000
AND myDateTimeField >= '2012-01-01'
and myDateTimeField <= '2012-12-31 23:59:59.999'
Well, I see a small response time improvement (like 5-10%) but I was really expected to burst my queries after that index rebuild and stats update.
The estimated execution plan is:
SELECT Cost: 0%
Clustered Index Seek (Clustered)[MyTable].[IX_MyCompoundIndex] Cost: 100%
Is this because the index is a clustered index? Am I missing something?
You should avoid SELECT * - probably even if you do need all of the columns in the table (which is rare).
Also, you are doing something very dangerous here. Did you know that your end range rounds up, so you may be including data from 2013-01-01 at midnight? Try:
AND myDateTimeColumn >= '20120101'
AND myDateTimeColumn < '20130101'
(This won't change performance, but it is easier to generate and is guaranteed to be accurate no matter what the underlying data type is.)
To eliminate network delays from your analysis of query time, you could consider SQL Sentry Plan Explorer - which allows you to generate an actual plan by running the query against the server, but discards the results, so that isn't an interfering factor.
Disclaimer: I work for SQL Sentry.
The execution time of the query is going to be spent reading enough pages of the index's btree to generate the result. Defragmenting the index will put adjacent rows together, reducing the number of pages that need to be read. It can also benefit from turning a largely random io pattern into a sequential one.
If your rows are wide and you don't get many rows per page you won't see much reduction in the number of rows.
If your index fill factor is low, you'll not get as many rows per page.
If your pages are in cache, you won't see any streaming v random IO benefit.
If you have spare CPU capacity on the machine, you may benefit from using page compression. This essentially trades more CPU for less IO.

MySQL takes 10 seconds for counting with conditions on for 100k records

SELECT COUNT(*) AS count_all, products.id AS products_id
FROM `products`
INNER JOIN `product_device_facilities`
ON `product_device_facilities`.`product_id` = `products`.`id`
INNER JOIN `product_vendors`
ON `product_vendors`.`ProductId` = `products`.`id`
INNER JOIN `screen_shots`
ON `screen_shots`.`ProductVendorId` = `product_vendors`.`id`
WHERE ( (DownloadCount >= 10 or DownloadCount is NULL)
and (Minsdk <= 10 or Minsdk is null))
GROUP BY products.id
HAVING GROUP_CONCAT(device_facility_id ORDER BY device_facility_id ASC ) IN (0)
This is taking 10 seconds for 100k records.
How to improve the performance?
There are a few things that you can try.
Use persistent connections to the database to avoid connection overhead
Check that all of your tables have Primary keys on the key tables e.g. (product_id)
Use less RAM per row by declaring columns only as large as they need to be to hold the values stored in them. i.e. as #manurajhada said don't use count(*) use count(primary key)
Using simpler permissions when you issue GRANT statements enables MySQL to reduce permission-checking overhead.
use indices on the references between different tables. just remember not to index too many columns, simple rule of thumb, if you never refer to a column in comparisons, there’s no need to index it.
Try using ANALYZE TABLE to help mysql to better optimize the query.
You can speed up a query a tiny bit by making sure all coumns which are not null are declared NOT NULL — thus you speed up table traversing a bit.
Tune MySQL caching: allocate enough memory for the buffer (e.g. SET GLOBAL query_cache_size = 1000000) and define query_cache_min_res_unit depending on average query result set size.
I know it sounds counter intuitive but sometimes it is worth de-normalising tables i.e. duplicate some data in several tables to avoid JOINs which are expensive. You can support data integrity with foreign keys or triggers.
and if all else fails
upgrade your hardware if you can, more RAM, faster HDD can make a significant difference to the speed of the database, and when you have done that allocate more memory to mysql.
EDIT
Another option if you do not require the results live as #ask-bjorn-hansen suggested you could use a background task (cron job) once a day, and store the result of the query in a separate table, then in your application all you have to do is check that table for the returned result, that way you get around having to query 100k results and would be able to run queries that take hours and not overly impact your users.
Do indexing on join columns on tables and instead of count(*) use count(some indexed primary key column).
Are minsdk and download count in the same table? If so adding an index on those two might help.
It could be that it's just a hard/impossible query to do quickly. Without seeing your full schema and the data it's hard to be specific, but it's possible that splitting it up into several easier to execute queries would be faster. Or as Amadeus suggested maybe denormalize the data a bit.
Another variation would be to just live with it taking 10 seconds, but make sure it's always done periodically in the background (with cron or similar) and never while a user is waiting. Then take the time to fix it if/when it takes minutes instead of seconds or otherwise puts an unacceptable burden on your user experience or servers.

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().