I am using MySQL 5.1 on a Windows Server 2008 (with 4GB RAM) and have the following configuration:
I have 2 MyISAM tables. One is in 1 database (DB1) and has 14 columns, which are mostly varchar. This table has about 5,000,000 rows and is the DB1.games table below. It has a primary key on GameNumber (int(10)).
The other table is the DB2.gameposition and consists of the columns GameNumber (links to
DB1.games) and PositionCode (int(10)). This table has about 400,000,000 rows and there is an index IX_PositionCode on PositionCode.
These 2 databases are on the same server.
I want to run a query on DB2.gameposition to find a particular PositionCode, and have the results sorted by the linking DB1.games.Yr field (smallint(6) - this represents a Year). This sorting of results I only introduced recently. There is an index on this Yr field in DB1.games.
In my stored procedure, I perform the following:
CREATE TEMPORARY TABLE tblGameNumbers(GameNumber INT UNSIGNED NOT NULL PRIMARY KEY);
INSERT INTO tblGameNumbers(GameNumber)
SELECT DISTINCT gp.GameNumber
FROM DB2.gameposition gp
WHERE PositionCode = var_PositionCode LIMIT 1000;
I just get 1000 to make it quicker
And then join it to the DB1.games table.
In order to generate an EXPLAIN from that, I took out the temporary table (I use in the stored procedure) and refactored it as seen in the inner subquery below:
EXPLAIN
SELECT *
FROM DB1.games g
INNER JOIN (SELECT DISTINCT gp.GameNumber
FROM DB2.gameposition gp
WHERE PositionCode = 669312116 LIMIT 1000
) B ON g.GameNumber = B.GameNumber
ORDER BY g.Yr DESC
LIMIT 0,28
Running the EXPLAIN above, I see the following:
1, 'PRIMARY', '', 'ALL', '', '', '', '', 1000, 'Using temporary; Using filesort'
1, 'PRIMARY', 'g', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'B.GameNumber', 1, ''
2, 'DERIVED', 'gp', 'ref', 'IX_PositionCode', 'IX_PositionCode', '4', '', 1889846, 'Using temporary'
The query used to be almost instant before I introduced the ORDER BY clause. Now, sometimes it is quick (depending on different PositionCode), but other times it can take up to 10 seconds to return the rows. Before I introduced the sorting, it was always virtually instantaneous. Unfortunately, I am not too proficient in interpreting the EXPLAIN output. Or how to make the query faster.
Any help would be greatly appreciated!
Thanks in advance,
Tim
Without the order by, your limit means the first 28 results are returned and then the query stops. With order by, all results need to be retrieved so they can be sorted and the first 28 returned.
The explain shows what MySql is doing:
sort 5000000 games records by yr
for each games record from sorted list
get the games record by primary key (to get all the columns)
read gamepositions by position code
if it does not match gamenumber, discard it
when 1000 matches found, stop reading
end read
end for
Try this instead:
select distinct ... from gameposition gp
inner join games g on g.gamenumber = gp.gamenumber
where gp.positioncode = ...
order by g.yr limit ...
Related
I have a MySQL table or around 150,000 rows and a good half of them have a blob (image) stored in a longblob field. I'm trying to create a query to select rows and include a field that simply indicates that the longblob (image) is exists. Basically
select ID, address, IF(house_image != '', 1, 0) AS has_image from homes where userid='1234';
That query times out after 300 seconds. If I remove the 'IF(house_image != '', 1, 0)' it completes in less than a second. I've also tried the following, but they all time out.
IF(ISNULL(house_image),0,1) as has_image
LEFT (house_image,1) AS has_image
SUBSTRING(house_image,0,1) AS has_image
I am not a DBA (obviously), but I'm suspecting that the query is selecting the entire longblob to know if it's empty or null.
Is there an efficient way to know if a field is empty?
Thanks for any assistance.
I had similar problem long time ago and the workaround I ended up with was to move all blob/text columns into a separate table (bonus: this design allows multiple images per home). So once you've changed the design and moved the data around you could do this:
select id, address, (
select 1
from home_images
where home_images.home_id = homes.id
limit 1
) as has_image -- will be 1 or null
from homes
where userid = 1234
PS: I make no guarantees. Depending on storage engine and row format, the blobs could get stored inline. If that is the case then reading the data will take much more disk IO than needed even if you're not "select"ing the blob column.
It looks to me like you are treating the house_image column as a string when really you should be checking it for NULL.
select ID, address, IF(house_image IS NOT NULL, 1, 0) AS has_image
from homes where userid='1234';
LONGBLOBs can be indexed in MariaDB / MySQL, but the indexes are imperfect: they are so-called prefix indexes, and only consider the first bytes of the BLOB.
Try creating this compound index with a 20-byte prefix on your BLOB.
ALTER TABLE homes ADD INDEX user_image (userid, house_image(20));
Then this subquery will, efficiently, give you the IDs of rows with empty house_image columns.
SELECT ID
FROM homes
WHERE userid = '1234'
AND (house_image IS NULL OR house_image = '')
The prefix index can satisfy (house_image IS NULL OR house_image = '') directly without inspecting the BLOBs. That saves a whole mess of IO and CPU on your database server.
You can then incorporate your subquery into a main query to get your result.
SELECT h.ID, h.address,
CASE WHEN empty.ID IS NULL 1 ELSE 0 END has_image
FROM homes h
LEFT JOIN (
SELECT ID
FROM homes
WHERE userid = '1234'
AND (house_image IS NULL OR house_image = '')
) empty ON h.ID = empty.ID
WHERE h.userid = '1234'
The IS NULL ... LEFT JOIN trick means "any rows that do NOT show up in the subquery have images."
I have this query in mysql with very poor performance.
select `notifiables`.`notification_id`
from `notifiables`
where `notifiables`.`notification_type` in (2, 3, 4)
and ( ( `notifiables`.`notifiable_type` = 16
and `notifiables`.`notifiable_id` = 53642)
or ( `notifiables`.`notifiable_type` = 17
and `notifiables`.`notifiable_id` = 26358)
or ( `notifiables`.`notifiable_type` = 18
and `notifiables`.`notifiable_id` = 2654))
order by `notifiables`.`id` desc limit 20
Is this query can be optimized in any way. Please help
This table has 2M rows. and taking upto 1-4 seconds in searching
Updated indexes and Explain select
Possible solutions:
Turning OR into UNION (see #hongnhat)
Row constructors (see #Akina)
Adding
AND notifiable_type IN (16, 17, 18)
Index hint. I dislike this because it often does more harm than good. However, the Optimizer is erroneously picking the PRIMARY KEY(id) (because of the ORDER BY instead of some filter which, according to the Cardinality should be very good.
INDEX(notification_type, notifiable_type, notifiable_id, id, notification_id) -- This is "covering", which can help because the index is probably 'smaller' than the dataset. When adding this index, DROP your current INDEX(notification_type) since it distracts the Optimizer.
VIEW is very unlikely to help.
More
Give this a try: Add this to the beginning of the WHERE
WHERE notifiable_id IN ( 53642, 26358, 2654 )
AND ... (all of what you have now)
And be sure to have an INDEX starting with notifiable_id. (I don't see one currently.)
Use the next syntax:
SELECT notification_id
FROM notifiables
WHERE notification_type IN (2, 3, 4)
AND (notifiable_type, notifiable_id) IN ( (16, 53642), (17, 26358), (18, 2654) )
ORDER BY id DESC LIMIT 20
Create index by (notification_type, notifiable_type, notifiable_id) or (notifiable_type, notifiable_id, notification_type) (depends on separate conditions selectivity).
Or create covering index ((notification_type, notifiable_type, notifiable_id, notification_id) or (notifiable_type, notifiable_id, notification_type, notification_id)).
You can make different kinds of "VIEW" from the data you want and then join them.
The sql throws timeout exception in the PRD environment.
SELECT
COUNT(*) totalCount,
SUM(IF(t.RESULT_FLAG = 'success', 1, 0)) successCount,
SUM(IF(b.ERROR_CODE = 'Y140', 1, 0)) unrecognizedCount,
SUM(IF(b.ERROR_CODE LIKE 'Y%' OR b.ERROR_CODE = 'E008', 1, 0)) connectCall,
SUM(IF(b.ERROR_CODE = 'N004', 1, 0)) hangupUnconnect,
SUM(IF(b.ERROR_CODE = 'Y001', 1, 0)) hangupConnect
FROM
lbl_his b LEFT JOIN lbl_error_code t ON b.TASK_ID = t.TASK_ID AND t.CODE = b.ERROR_CODE
WHERE
b.TASK_ID = "5f460e4ffa99f51697ad4ae3"
AND b.CREATE_TIME BETWEEN "2020-07-01 00:00:00" AND "2020-10-28 00:00:00"
The size of table lbl_his is super large. About 20,000,000 rows data which occupied 20GB disk.
The size of table lbl_error_code is small. Only 305 rows.
The indexes of table lbl_his:
TASK_ID
UPDATE_TIME
CREATE_TIME
RECORD_ID
The union indexes of table lbl_his:
TASK_ID, ERROR_CODE, UPDATE_TIME
TASK_ID, CREATE_TIME
There are no index created for table lbl_error_code.
I ran EXPLAIN SELECT and found the sql hit the index of lbl_his.TASK_ID and lbl_error_code.primary.
How to avoid to execute timeout?
For an index solution on lbl_his, try putting a non-clustered index on
firstly the things you filter on by exact match
then the things you filter on as ranges (or inexact matches)
e.g., the initial part of the index should be TASK_ID then CREATE_TIME. Putting these first is very important as it means the engine can do one seek to get the data.
Then include any other fields in use (either as part of index, or includes - doesn't matter) - in this case, ERROR_CODE. This makes your index a covering index.
Therefore your final new non-clustered index on lbl_his should be (TASK_ID, CREATE_TIME, ERROR_CODE)
I have a report that pulls information from a summary table and ideally will pull from two periods at once, the current period and the previous period. My table is structured thusly:
report_table
item_id INT(11)
amount Decimal(8,2)
day DATE
The primary key is item_id, day. This table currently holds 37k records with 92 different items and 1200 different days. I am using Mysql 5.1.
Here is my select statement:
SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`,
sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day`
FROM `client_location_item` AS `cla`
INNER JOIN `client_location` AS `cl`
INNER JOIN `report_item_day` AS `r`
INNER JOIN `report_item_day` AS `r2`
WHERE (r.item_id = cla.item_id)
AND (cla.location_id = cl.location_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
AND (cl.location_code = 'LOCATION')
group by month(r.day);
At present this query takes 2.2 seconds in my environment. The explain plan is:
'1', 'SIMPLE', 'cl', 'ALL', 'PRIMARY', NULL, NULL, NULL, '33', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'cla', 'ref', 'PRIMARY,location_id,location_id_idxfk', 'location_id', '4', 'cl.location_id', '1', 'Using index'
'1', 'SIMPLE', 'r', 'ref', 'PRIMARY', 'PRIMARY', '4', cla.asset_id', '211', 'Using where'
'1', 'SIMPLE', 'r2', 'ALL', NULL, NULL, NULL, NULL, '37602', 'Using where; Using join buffer'
If I add an index to the "day" column, instead of my query running faster, it runs in 2.4 seconds. The explain plan for the query at that time is:
'1', 'SIMPLE', 'r2', 'range', 'report_day_day_idx', 'report_day_day_idx', '3', NULL, '1092', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'r', 'range', 'PRIMARY,report_day_day_idx', 'report_day_day_idx', '3', NULL, '1180', 'Using where; Using join buffer'
'1', 'SIMPLE', 'cla', 'eq_ref', 'PRIMARY,location_id,location_id_idxfk', 'PRIMARY', '4', 'r.asset_id', '1', 'Using where'
'1', 'SIMPLE', 'cl', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', cla.location_id', '1', 'Using where'
According to the MySQL documentation the most efficient group by execution is when there is an index to retrieve the grouping columns. But it also states that the only functions that can really make use of the indexes are min() and max(). Does anyone have any ideas what I can do to further optimize my query? Or why, my 'indexed' version runs more slowly despite having fewer rows overall than the non-indexed version?
Create table:
CREATE TABLE `report_item_day` (
`item_id` int(11) NOT NULL,
`amount` decimal(8,2) DEFAULT NULL,
`day` date NOT NULL,
PRIMARY KEY (`item_id`,`day`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Of course the other option I have is to make 2 db calls, one for each time period. If I do that, straight away the query for each drops to 0.031s. Still I feel like there should be a way to optimize this query to achieve comparable results.
Three things:
1) I don't see in the WHERE clause something for r2.item_id. Without it, r2 is factored in via a Cartesian Product and will sum up other item_ids as well.
Change your original query to look like this:
SELECT r.day
,sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`
,sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day`
FROM `client_location_item` AS `cla`
INNER JOIN `client_location` AS `cl`
INNER JOIN `report_item_day` AS `r`
INNER JOIN `report_item_day` AS `r2`
WHERE (r.item_id = cla.item_id) AND (r2.item_id = cla.item_id) AND (cla.location_id = cl.location_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
AND (cl.location_code = 'LOCATION')
group by month(r.day);
See if the EXPLAIN PLAN changes after this.
2) Do this : ALTER TABLE report_itme_day ADD INDEX (date,item_id);
This will index scan the date instead of the item id.
See if the EXPLAIN PLAN changes after this.
3) Last resort : Refactor the query
SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`, sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day` FROM
(SELECT CLA.item_id FROM client_location CL,client_location_item CLA WHERE CLA.location_code = 'LOCATION' AND CLA.location_id=CL.location_id) A,
report_item_day r,
report_item_day r2,
WHERE (r.item_id = A.item_id)
AND (r2.item_id = A.item_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
group by month(r.day);
This can definitely be refactored further. I just refactorted it a littte.
Give it a Try !!!
Why you are selecting day when you are grouping on month? I don't entirely what you would like the output of your query to look like.
I hate MySQL for allowing that!
I will show you two approaches to query for 2 periods in one go. The first one is a union all query. It should do what your 2-query approach already does. It will return 2 rows, one for each period.
select sum(r.amount) / (count(distinct r.item_id) * count(r.day) ) as curr_avg
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
union all
select sum(r.amount) / (count(distinct r.item_id) * count(r.day) ) as prev_avg
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and r.day between from_unixtime(1291161600) and from_unixtime(1293839999)
The following approach is potentially faster than the above, but it is much uglier and harder to read.
select period
,sum(amount) / (count(distinct item_id) * count(day) ) as avg_day
from (select case when r.day between from_unixtime(1293840000) and from_unixtime(1296518399) then 'Current'
when r.day between from_unixtime(1291161600) and from_unixtime(1293839999) then 'Previous'
end as period
,r.amount
,r.item_id
,r.day
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and ( r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
or r.day between from_unixtime(1291161600) and from_unixtime(1293839999)
)
) v
group
by period;
Note 1: You didn't give us DDL, so I can't test if the syntax is correct
Note 2: Consider creating a calendar table, keyed by DATE. Add appropriate columns such as MONTH, WEEK, FINANCIAL_YEAR etcetera, to be able to support the reporting you are doing. The queries will be much much easier to write and understand.
First of all (and this might be just aesthetics), why aren't you using ON / USING clauses in your INNER JOIN ? Why make the JOIN on the WHERE clause instead of the actual part, in the FROM?
Second, my guess with the indexed vs non-indexed issue is that now it has to check against an index first for the records that matches said range, whereas in the non-indexed version memory goes faster than disk. But I can't be too sure.
Now, for the query. Here's part of the doc. on JOINs:
The `conditional_expr` used with ON is any conditional expression of the form
that can be used in a WHERE clause. Generally, you should use the ON clause for
conditions that specify how to join tables, and the WHERE clause to restrict
which rows you want in the result set.
So yeah, move the join conditions to the FROM clause. Also, you might be interested in the Index hint syntax: http://dev.mysql.com/doc/refman/5.0/en/index-hints.html
And lastly, you could try using a view, but be wary of performance issues: http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/
Good luck.
I'm joining two tables.
Table unique_nucleosome_re has about 600,000 records.
Another table has 20,000 records.
The strange thing is the performance and the answer from EXPLAIN is different depending
on the condition in the WHERE clause.
When it was
WHERE n.chromosome = 'X'
it took about 3 minutes.
When it was
WHERE n.chromosome = '2L'
it took more than 30 minutes and the connection is gone.
SELECT n.name , t.transcript_start , n.start
FROM unique_nucleosome_re AS n
INNER JOIN tss_tata_range AS t
ON t.chromosome = n.chromosome
WHERE (t.transcript_start > n.end AND t.ts_minus_250 < n.start )
AND n.chromosome = 'X'
ORDER BY t.transcript_start
;
This is the answer from EXPLAIN.
when the WHERE is n.chromosome = 'X'
'1', 'SIMPLE', 'n', 'ALL', 'start_idx,end_idx,chromo_start', NULL, NULL, NULL, '606096', '48.42', 'Using where; Using join buffer'
when the WHERE is n.chromosome = '2L'
'1', 'SIMPLE', 'n', 'ref', 'start_idx,end_idx,chromo_start', 'chromo_start', '17', 'const', '68109', '100.00', 'Using where'
The number of records for X or 2L are almost the same.
I spent last couple days but I couldn't figure it out. It may be a simple mistake I can't see or might be a bug.
Could you help me?
First, without seeing any index information, I would have an index on your TSS_TData_Range on the Chromosome key and transcript_start (but a minimum of the chromosome key). I would also assume there is an index on chromosome on your unique_nucleosome_re table. Then, it appears the TSS is your SHORT table, so I would move THAT into the FIRST position of the query and invoke use of the "STRAIGHT_JOIN" clause...
SELECT STRAIGHT_JOIN
n.name,
t.transcript_start,
n.start
FROM
tss_tdata_range t,
unique_nucleosome_re n
where
t.chromosome = 'X'
and t.chromosome = n.chromosome
and t.transcript_start > n.end
and t.ts_minus_250 < n.start
order by
t.transcript_start
I'd be interested in the performance too if it works well for you...