I can't seem to get this query to perform any faster than 8 hours! 0_0
I have read up on indexing and I am still not sure I am doing this right.
I am expecting my query to calculate a value for BROK_1_RATING based on dates and other row values - 500,000 records.
Using record #1 as an example - my query should:
get all other records that have the same ESTIMID
ignore records where ANALYST =""
ignore records where ID is the same as record being compared i.e.
ID != 1
the records must fall within a time frame
i.e. BB.ANNDATS_CONVERTED <= working.ANNDATS_CONVERTED,
BB.REVDATS_CONVERTED > working.ANNDATS_CONVERTED
BB.IRECCD must = 1
Then count the result
Then write the count value to the BROK_1_RATING column for record #1
now do same for record#2, and #3 and so on for the entire table
In human terms - "Examine the date of record #1 - Now, within time frame from record #1 - count the number of times the number 1 exists with the same brokerage ESTIMID, do not count record #1, do not count blank ANALYST rows. Move on to record #2 and do the same"
UPDATE `working` SET `BROK_1_RATING` =
(SELECT COUNT(`ID`) FROM (SELECT `ID`, `IRECCD`, `ANALYST`, `ESTIMID`, `ANNDATS_CONVERTED`, `REVDATS_CONVERTED` FROM `working`) AS BB
WHERE
BB.`ANNDATS_CONVERTED` <= `working`.`ANNDATS_CONVERTED`
AND
BB.`REVDATS_CONVERTED` > `working`.`ANNDATS_CONVERTED`
AND
BB.`ID` != `working`.`ID`
AND
BB.`ESTIMID` = `working`.`ESTIMID`
AND
BB.`ANALYST` != ''
AND
BB.`IRECCD` = 1
)
WHERE `working`.`ANALYST` != '';
| ID | ANALYST | ESTIMID | IRECCD | ANNDATS_CONVERTED | REVDATS_CONVERTED | BROK_1_RATING | NO_TOP_RATING |
------------------------------------------------------------------------------------------------------------------
| 1 | DAVE | Brokerage000 | 4 | 1998-07-01 | 1998-07-04 | | 3 |
| 2 | DAVE | Brokerage000 | 1 | 1998-06-28 | 1998-07-10 | | 4 |
| 3 | DAVE | Brokerage000 | 5 | 1998-07-02 | 1998-07-08 | | 2 |
| 4 | DAVE | Brokerage000 | 1 | 1998-07-04 | 1998-12-04 | | 3 |
| 5 | SAM | Brokerage000 | 1 | 1998-06-14 | 1998-06-30 | | 4 |
| 6 | SAM | Brokerage000 | 1 | 1998-06-28 | 1999-08-08 | | 4 |
| 7 | | Brokerage000 | 1 | 1998-06-28 | 1999-08-08 | | 5 |
| 8 | DAVE | Brokerage111 | 2 | 1998-06-28 | 1999-08-08 | | 3 |
'EXPLAIN' results:
id| select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----------------------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | working | index | ANALYST | PRIMARY | 4 | NULL | 467847 | Using where
2 | DEPENDENT SUBQUERY | <derived3> | ALL | NULL | NULL | NULL | NULL | 467847 | Using where
3 | DERIVED | working | index | NULL | test_combined_indexes | 226 | NULL | 467847 | Using index
I have indexes on the single columns - and as well - have tried multiple column index like this:
ALTER TABLE `working` ADD INDEX `test_combined_indexes` (`IRECCD`, `ID`, `ANALYST`, `ESTIMID`, `ANNDATS_CONVERTED`, `REVDATS_CONVERTED`) COMMENT '';
Well you can shorten the query a lot by just removing the extra stuff:
UPDATE `working` as AA SET `BROK_1_RATING` =
(SELECT COUNT(`ID`) FROM `working` AS BB
WHERE BB.`ANNDATS_CONVERTED` <= AA.`ANNDATS_CONVERTED`
AND BB.`REVDATS_CONVERTED` > AA.`ANNDATS_CONVERTED`
AND BB.`ID` != AA.`ID`
AND BB.`ESTIMID` = AA.`ESTIMID`
AND BB.`ANALYST` != ''
AND BB.`IRECCD` = 1 )
WHERE `ANALYST` != '';
Related
Update: I want to use dynamic sql to select question as column and put answer in row, like cursor or loop, is that possible?
I want the select result like this
+--------+---------------+--------------------------------------------------------------------------+
| userid | Living Status | This is another question get from row and it's longer than 64 characters |
+--------+---------------+--------------------------------------------------------------------------+
| 19 | married | q2_opt3 |
+--------+---------------+--------------------------------------------------------------------------+
And here is my query
select
userid,
min(if(question.ordering=1,o.name,NULL )) as 'Living Status',
min(if(question.ordering=2,o.name,NULL )) as 'This is another question get from row and it's longer than 64 characters'
from answer
inner join question on question.key_value = answer.key_value
inner join q_option o on question.id = o.question_id and o.value = answer.answer
where userid in (19)
GROUP BY id
The question table is like
+----+----------+---------------------------------------------------------------------------+--------------+
| id | ordering | question | key_value |
+----+----------+---------------------------------------------------------------------------+--------------+
| 1 | 1 | Living Status | livingStatus |
| 2 | 2 | This is another question get from row and it's longer than 64 characters | question_2 |
+----+----------+---------------------------------------------------------------------------+--------------+
The answer table is like
+----+--------+--------------+--------+
| id | answer | key_value | userid |
+----+--------+--------------+--------+
| 1 | 2 | livingStatus | 19 |
| 2 | 3 | question_2 | 19 |
+----+--------+--------------+--------+
The q_option table is like
+----+----------+-------------+-------+
| id | name | question_id | value |
+----+----------+-------------+-------+
| 1 | single | 1 | 1 |
| 2 | married | 1 | 2 |
| 3 | divorced | 1 | 3 |
| 4 | q2_opt1 | 2 | 1 |
| 5 | q2_opt2 | 2 | 2 |
| 6 | q2_opt3 | 2 | 3 |
+----+----------+-------------+-------+
I have a query that is doing what I want on a truncated dataset but when I run it on the full dataset (millions of rows) it takes forever to run.
I have two tables - microsat_table and coverage_table.
microsat_table:
+----+----------+-----------+---------+-------------------------------------------------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence |
+----+----------+-----------+---------+-------------------------------------------------+
| 2 | chr2L | 11050 | 11067 | TTTAATTTAATTTAATTT |
| 3 | chr2L | 44173 | 44187 | TATGTATGTATGTAT |
| 5 | chr2L | 54431 | 54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA |
| 6 | chr2L | 57571 | 57594 | ATATATATATATATATATATATAT |
| 7 | chr2L | 72439 | 72453 | CATACATACATACAT |
| 8 | chr2L | 74028 | 74042 | ATACATACATACATA |
| 9 | chr2L | 85573 | 85587 | ATTTTATTTTATTTT |
| 10 | chr2L | 92429 | 92443 | ACATACATACATACA |
| 11 | chr2L | 138132 | 138166 | TATATAGATATATAAATATATATATATATATATAT |
| 13 | chr2L | 162245 | 162259 | ATACATACATACATA |
+----+----------+-----------+---------+-------------------------------------------------+
coverage_table:
| Seq_Name | Start | Stop | Coverage |
+----------+-------+-------+----------+
| chr2L | 5716 | 5771 | 1 |
| chr2L | 8730 | 8824 | 1 |
| chr2L | 9894 | 9948 | 1 |
| chr2L | 19391 | 19491 | 1 |
| chr2L | 19575 | 19675 | 1 |
| chr2L | 19773 | 19776 | 1 |
| chr2L | 19776 | 19872 | 2 |
| chr2L | 21920 | 21959 | 1 |
| chr2L | 21959 | 22020 | 2 |
| chr2L | 22020 | 22059 | 1 |
+----------+-------+-------+----------+
I want to add a column to the microsat_table which calculates the average coverage (from the coverage_table) over all rows where the Start and Stop values in the coverage table fall within the SSR_Start and SSR_End values in the microsat_table.
Example result:
+-----+----------+-----------+---------+--------------------------------+---------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence | avg |
+-----+----------+-----------+---------+--------------------------------+---------+
| 53 | chr2L | 402489 | 402503 | AAAACAAAACAAAAC | 3.0000 |
| 64 | chr2L | 447214 | 447233 | CAGCAGCAGCAGCAGCAGCA | 8.0000 |
| 66 | chr2L | 457839 | 457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA | 2.0000 |
| 105 | chr2L | 579589 | 579603 | TCGAATCGAATCGAA | 11.0000 |
| 123 | chr2L | 628484 | 628501 | TAATGTTAATGTTAATGT | 6.0000 |
+-----+----------+-----------+---------+--------------------------------+---------+
My query is:
UPDATE microsat_table
JOIN
(SELECT m.id, SUM(p.Coverage)/count(p.Start)
AS avg FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name
WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;
Explain results for the truncated table:
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
| 1 | UPDATE | microsat_table_short | NULL | ALL | PRIMARY | NULL | NULL | NULL | 40356 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | testdb.microsat_table_short.id | 1236 | 100.00 | NULL |
| 2 | DERIVED | m | NULL | index | PRIMARY,Sequence,Seq_Name,Motif,SSR_Start,SSR_End | Seq_Name | 53 | NULL | 40356 | 100.00 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | p | NULL | ALL | NULL | NULL | NULL | NULL | 100163 | 1.23 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------------------+------------+-------+---------------------------------------------------+-------------+---------+--------------------------------+--------+----------+----------------------------------------------------+
I added indexes (including trying HASH and BTREE indexes) which sped it up considerably, but I've let it run for 1.5 days on the larger dataset and it still didn't finish.
Does anyone have any suggestions on how to make it run faster?
Thanks!!
There are a few relatively minor infelicities in your code. However the big problem is that while you say you want to calculate "the average coverage (from the coverage_table) over all rows where the Start and Stop values in the coverage table fall within the SSR_Start and SSR_End values in the microsat_table" you don't actually seem to limit the query to doing that. Instead you only coded a match on Seq_Name.
The code below attempts to fix that (I used >= and <= which may not be what you need) and the other more minor bits:
UPDATE microsat_table
JOIN
(
SELECT
m.id,
AVG(p.Coverage) AS avg -- MySQL has it's own average function
FROM
microsat_table m
INNER JOIN coverage_table p ON -- Change to INNER JOIN, your old WHERE clause had this effect anyway
m.Seq_Name = p.Seq_Name -- Use '=' not 'Like' when looking for an exact match
WHERE
p.Start >= m.SSR_Start -- This WHERE clause is the most important change
AND p.End <= m.SSR_End -- You omitted it in your version
GROUP BY
m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;
Maybe updating the table in 1 big transaction is simply too much for the system? (what is the size of the table you're updating?) You could try doing it in blocks. I'd also go for a simple sub-select here, seems easier to read IMHO.
Also take note of Steve Lovell's remark that your query doesn't seem to care about the start/stop columns. Since you probably forgot it by accident I've added it here too, removing it shouldn't be too difficult =)
DECLARE #min_id int,
#max_id int,
#blocksize int
SELECT #min_id = MIN(id),
#max_id = MAX(id),
#blocksize = 100000 -- adapt as needed
FROM microsat_table
WHILE #min_id <= #max_id
BEGIN
UPDATE microsat_table
SET microsat_table.avg = ((SELECT SUM(p.Coverage)/count(p.Start) AS avg
FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name -- if possble use '=' here instead of LIKE
AND p.Start >= m.SSR_Start -- flagrantly "stolen" from Steve Lovell's answer
AND p.End <= m.SSR_End
WHERE m.id = microsat_table.id)
-- limit update to this block:
WHERE microsat_table.id BETWEEN #min_id AND (#min_id + #blocksize - 1)
-- prepare for next block
SELECT #min_id = #min_id + #blocksize
END
You probably want the primary key on the id field of microsat_table and on the Seq_name + Start column of the coverage_table.
translations
+---------+----------------+----------+---------+
| id_user | id_translation | referrer | id_word |
+---------+----------------+----------+---------+
| 1 | 3 | NULL | 4 |
| 1 | 17 | NULL | 3 |
| 2 | 17 | NULL | 5 |
| 2 | 17 | NULL | 1 |
| 2 | 17 | NULL | 7 |
words
+----+------+
| id | word |
+----+------+
| 4 | out |
+----+------+
users_translations
+---------+----------------+----------+---------+
| id_user | id_translation | referrer | id_word |
+---------+----------------+----------+---------+
| 1 | 17 | 1 | 4 |
| 2 | 17 | 2 | 4 |
| 3 | 18 | NULL | 4 |
I need to select all translations for current word and id_translation, but if in the row referrer = 1 (current user), then I don't need another results (translations from another users for current word), if there is no referrer = 1, show all.
SELECT DISTINCT `t`.*, `ut`.`id_user` AS tuser
FROM translations AS t
LEFT JOIN users_translations AS ut ON `t`.`id` = `ut`.`id_translation`
INNER JOIN words ON `words`.`id` = `ut`.`id_word` OR `words`.`id` = `t`.`id_word`
WHERE (`word` = 'help')
ORDER BY `t`.`translation` ASC
+----+-------------+---------+---------+-------+
| id | translation | id_word | id_user | tuser |
+----+-------------+---------+---------+-------+
| 17 | допомагати | 4 | 1 | 2 |
| 17 | допомагати | 4 | 1 | 1 |
First row doesn't need, because we have tuser = 1. When there is no tuser = 1, all results should be returned.
I don't understand how to build select statement and I will be very appreciative that somebody shows me how to make it work.
First thing that comes to mind
--add this to your where clause
id_user <=
CASE WHEN EXISTS(SELECT * FROM translations WHERE id_user = 1 AND id_word = words.id_word)
THEN 1
ELSE (SELECT MAX(Id) FROM translations)
END
I have a query to get the top 'n' users who commented on a specific keyword,
SELECT `user` , COUNT( * ) AS magnitude
FROM `results`
WHERE `keyword` = "economy"
GROUP BY `user`
ORDER BY magnitude DESC
LIMIT 5
I have approx 6000 keywords, and would like to run this query to get me the top 'n' users for each and every keyword we have data for. Assistance appreciated.
Since you haven't given the schema for results, I'll assume it's this or very similar (maybe extra columns):
create table results (
id int primary key,
user int,
foreign key (user) references <some_other_table>(id),
keyword varchar(<30>)
);
Step 1: aggregate by keyword/user as in your example query, but for all keywords:
create view user_keyword as (
select
keyword,
user,
count(*) as magnitude
from results
group by keyword, user
);
Step 2: rank each user within each keyword group (note the use of the subquery to rank the rows):
create view keyword_user_ranked as (
select
keyword,
user,
magnitude,
(select count(*)
from user_keyword
where l.keyword = keyword and magnitude >= l.magnitude
) as rank
from
user_keyword l
);
Step 3: select only the rows where the rank is less than some number:
select *
from keyword_user_ranked
where rank <= 3;
Example:
Base data used:
mysql> select * from results;
+----+------+---------+
| id | user | keyword |
+----+------+---------+
| 1 | 1 | mysql |
| 2 | 1 | mysql |
| 3 | 2 | mysql |
| 4 | 1 | query |
| 5 | 2 | query |
| 6 | 2 | query |
| 7 | 2 | query |
| 8 | 1 | table |
| 9 | 2 | table |
| 10 | 1 | table |
| 11 | 3 | table |
| 12 | 3 | mysql |
| 13 | 3 | query |
| 14 | 2 | mysql |
| 15 | 1 | mysql |
| 16 | 1 | mysql |
| 17 | 3 | query |
| 18 | 4 | mysql |
| 19 | 4 | mysql |
| 20 | 5 | mysql |
+----+------+---------+
Grouped by keyword and user:
mysql> select * from user_keyword order by keyword, magnitude desc;
+---------+------+-----------+
| keyword | user | magnitude |
+---------+------+-----------+
| mysql | 1 | 4 |
| mysql | 2 | 2 |
| mysql | 4 | 2 |
| mysql | 3 | 1 |
| mysql | 5 | 1 |
| query | 2 | 3 |
| query | 3 | 2 |
| query | 1 | 1 |
| table | 1 | 2 |
| table | 2 | 1 |
| table | 3 | 1 |
+---------+------+-----------+
Users ranked within keywords:
mysql> select * from keyword_user_ranked order by keyword, rank asc;
+---------+------+-----------+------+
| keyword | user | magnitude | rank |
+---------+------+-----------+------+
| mysql | 1 | 4 | 1 |
| mysql | 2 | 2 | 3 |
| mysql | 4 | 2 | 3 |
| mysql | 3 | 1 | 5 |
| mysql | 5 | 1 | 5 |
| query | 2 | 3 | 1 |
| query | 3 | 2 | 2 |
| query | 1 | 1 | 3 |
| table | 1 | 2 | 1 |
| table | 3 | 1 | 3 |
| table | 2 | 1 | 3 |
+---------+------+-----------+------+
Only top 2 from each keyword:
mysql> select * from keyword_user_ranked where rank <= 2 order by keyword, rank asc;
+---------+------+-----------+------+
| keyword | user | magnitude | rank |
+---------+------+-----------+------+
| mysql | 1 | 4 | 1 |
| query | 2 | 3 | 1 |
| query | 3 | 2 | 2 |
| table | 1 | 2 | 1 |
+---------+------+-----------+------+
Note that when there are ties -- see users 2 and 4 for keyword "mysql" in the examples -- all parties in the tie get the "last" rank, i.e. if the 2nd and 3rd are tied, both are assigned rank 3.
Performance: adding an index to the keyword and user columns will help. I have a table being queried in a similar way with 4000 and 1300 distinct values for the two columns (in a 600000-row table). You can add the index like this:
alter table results add index keyword_user (keyword, user);
In my case, query time dropped from about 6 seconds to about 2 seconds.
You can use a pattern like this (from Within-group quotas (Top N per group)):
SELECT tmp.ID, tmp.entrydate
FROM (
SELECT
ID, entrydate,
IF( #prev <> ID, #rownum := 1, #rownum := #rownum+1 ) AS rank,
#prev := ID
FROM test t
JOIN (SELECT #rownum := NULL, #prev := 0) AS r
ORDER BY t.ID
) AS tmp
WHERE tmp.rank <= 2
ORDER BY ID, entrydate;
+------+------------+
| ID | entrydate |
+------+------------+
| 1 | 2007-05-01 |
| 1 | 2007-05-02 |
| 2 | 2007-06-03 |
| 2 | 2007-06-04 |
| 3 | 2007-07-01 |
| 3 | 2007-07-02 |
+------+------------+
+--------------------+---------------+------+-----+---------+-------+
| ID | GKEY |GOODS | PRI | COUNTRY | Extra |
+--------------------+---------------+------+-----+---------+-------+
| 1 | BOOK-1 | 1 | 10 | | |
| 2 | PHONE-1 | 2 | 12 | | |
| 3 | BOOK-2 | 1 | 13 | | |
| 4 | BOOK-3 | 1 | 10 | | |
| 5 | PHONE-2 | 2 | 10 | | |
| 6 | PHONE-3 | 2 | 20 | | |
| 7 | BOOK-10 | 2 | 20 | | |
| 8 | BOOK-11 | 2 | 20 | | |
| 9 | BOOK-20 | 2 | 20 | | |
| 10 | BOOK-21 | 2 | 20 | | |
| 11 | PHONE-30 | 2 | 20 | | |
+--------------------+---------------+------+-----+---------+-------+
Above is my table. I want to get all records which GKEY > BOOK-2, Who can tell me the expression with mysql?
Using " WHERE GKEY>'BOOK-2' " Cannot get the correct results.
How about (something like):
(this is MSSQL - I guess it will be similar in MySQL)
select
*
from
(
select
*,
index = convert(int,replace(GKEY,'BOOK-',''))
from table
where
GKEY like 'BOOK%'
) sub
where
sub.index > 2
By way of explanation: The inner query basically recreates your table, but only for BOOK rows, and with an extra column containing the index in the right data type to make a greater than comparison work numerically.
Alternatively something like this:
select
*
from table
where
(
case
when GKEY like 'BOOK%' then
case when convert(int,replace(GKEY,'BOOK-','')) > 2 then 1
else 0
end
else 0
end
) = 1
Essentially the problem is that you need to check for BOOK before you turn the index into a numberic, as the other values of GKEY would create an error (without doing some clunky string handling).
SELECT * FROM `table` AS `t1` WHERE `t1`.`id` > (SELECT `id` FROM `table` AS `t2` WHERE `t2`.`GKEY`='BOOK-2' LIMIT 1)