How to optimize MySQL query when field values are not distinct - mysql

Suppose I have a MySQL table, with a indexed field called balance. However, the table contains 95% of rows with balance = 0. So if I was to go:
Select * from mytable where balance > 0.02
the query would take quite a while, if the table had 1mil+ rows, as the BTree index does not have a distinct set of values for balance.
In this situation, without changing the data, how would one optimize the SQL query

First, your query is likely to be returning a lot of rows. That is going to take time.
If you only need a few, you can add limit:
Select *
from mytable
where balance > 0.02
limit 100;
Second, if you have any particularly large columns, then those could dominate the time for returning rows. If this is an issue, then select only the columns you really need.
Third, an index might help. If very few rows satisfy the where clause then an index on balance should speed the query. However, if a lot of rows match the filter condition, then you are returning a lot of data -- and that can take time.
Also, this assumes that something called mytable is really a table. If it is a view, then all bets are off. You need to optimize the view and not the query.

This is a radical approach, but if this query is very critical you could add a partition to the table in the balance field:
EDIT: For some reason MySQL partition are restricted to INT values, maybe this workaround will work:
ALTER TABLE mytable
PARTITION BY RANGE( CEILING(balance) ) (
PARTITION p0 VALUES LESS THAN (1),
PARTITION p1 VALUES LESS THAN MAXVALUE
);
NOTE: This approach will only work if balance is declared as a Decimal type, not a Float type.

Related

MYSQL: how to speed up an sql Query for getting data

I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint

Fastest result when checking date range

User will select a date e.g. 06-MAR-2017 and I need to retrieve hundred thousand of records for date earlier than 06-MAR-2017 (but it could vary depends on user selection).
From above case, I am using this querySELECT col from table_a where DATE_FORMAT(mydate,'%Y%m%d') < '20170306' I feel that the record is kind of slow. Are there any faster or fastest way to get date results like this?
With 100,000 records to read, the DBMS may decide to read the table record for record (full table scan) and there wouldn't be much you could do.
If on the other hand the table contains billions of records, so 100,000 would just be a small part, then the DBMS may decide to use an index instead.
In any way you should at least give the DBMS the opportunity to select via an index. This means: create an index first (if such doesn't exist yet).
You can create an index on the date column alone:
create index idx on table_a (mydate);
or even provide a covering index that contains the other columns used in the query, too:
create index idx on table_a (mydate, col);
Then write your query such that the date column is accessed directly. You have no index on DATE_FORMAT(mydate,'%Y%m%d'), so above indexes don't help with your original query. You'd need a query that looks up the date itself:
select col from table_a where mydate < date '2017-03-06';
Whether the DBMS then uses the index or not is still up to the DBMS. It will try to use the fastest approach, which very well can still be the full table scan.
If you make a function call in any column at the left side of comparison, MySql will make a full table scan.
The fastest method would be to have an index created on mydate, and make the right side ('20170306') the same datatype of the column (and the index)

Mysql query slow performance

I have a table with 500k rows. I have specific table which takes really long time to run every query.
One of the queries is:
SELECT *
FROM player_data
WHERE `user_id` = '61120'
AND `opzak` = 'ja'
ORDER BY opzak_nummer ASC
the opzak_nummer column is a tinyint with a number.
EXPLAIN:
Is there any way to improve this query performance and the general of this query/table?
The table name is player_data and includes about 25 columns, most of them are integers with values of stats.
The index is id AUTO_INCREMENT.
You need to run that query, it will alter table and add index. You can read more details here http://dev.mysql.com/doc/refman/5.7/en/drop-index.html
ALTER TABLE pokemon_speler ADD INDEX index_name (user_id, opzak);
The optimal index for that query is either of these:
INDEX(user_id, opzak, opzak_nummer)
INDEX(opzak, user_id, opzak_nummer)
The first two columns do the filtering; the last avoids a tmp table and sort by consuming the ORDER BY.
Is any combination of columns 'unique' (other than id)? If so, we might be able to make it run even faster.

how does MySQL know which partition to look up?

Let's analyse the simplest possible example of MySQL paritioning by hash (slightly modified version of http://dev.mysql.com/doc/refman/5.5/en/alter-table-partition-operations.html):
CREATE TABLE t1 (
id INT,
year_col INT
);
ALTER TABLE t1
PARTITION BY HASH(year_col)
PARTITIONS 8;
Let's say we put there millions of records. The question is - if a specific query comes (e.g. SELECT * FROM t1 WHERE year_col = 5) then how does MySQL know which partition to look up? There are 8 partitions. I guess that the hash function is calculated and MySQL recognizes that it matches thepartitioning key and then MySQL knows which one that is. But what is the query is SELECT * FROM t1 WHERE year_col IN (5, 45, 5435)? How about other non-trivial queries? Is there any general algorithm for that?
This is called Partition pruning:
The optimizer can perform pruning whenever a WHERE condition can be reduced to either one of the following two cases:
partition_column = constant
partition_column IN (constant1, constant2, ..., constantN)
In the first case, the optimizer simply evaluates the partitioning expression for the value given, determines which partition contains that value, and scans only this partition. (...)
In the second case, the optimizer evaluates the partitioning expression for each value in the list, creates a list of matching partitions, and then scans only the partitions in this partition list. (...)
MySQL can apply partition pruning to SELECT, DELETE, and UPDATE statements. INSERT statements currently cannot be pruned.
Pruning can also be applied to short ranges, which the optimizer can convert into equivalent lists of values. (...)

MySQL group by query complexity analysis

Which is the complexity of the "group by" statement in MySQL?
I am managing vaery big tables and I also would like to know if there is any method to calculate how much time a query is going to take.
This question is impossible to answer with knowledge of what the entire query looks like. Some group bys can be prohibitively expensive while others are very cheap, it all depends on how the indexes in the database are set up, if the value you group by can be cached etc.
For example, this is a very cheap group by:
CREATE TABLE t (a INT, KEY(a));
SELECT * FROM WHERE 1 GROUP BY a;
Since a is an index.
But for something like this, it's very expensive since it would require a table scan.
CREATE TABLE t (a INT);
SELECT * FROM WHERE 1 GROUP BY a;
Generally if a key is not available, the database will creates a temporary table in memory for group by clauses, go through all the values, insert each value into the temporary table with an index to the corresponding row in the result set, then it will select from the temporary table, pick the first row from each column and send that back as the result. Depending on if you use the "extra" rows per group by clause (ie. using MAX(), GROUP_CONCAT() or similar) it will need to fetch all rows again.
You can use EXPLAIN to figure out what strategy MySQL will use, the 'Extra' (in ascending order of cost to execute) 'Using index' if an index can be used, 'Using filesort' if reading all rows from disk will be necessary, and column will contain 'Using Temporary' if a temporary will be required