MySQL : Optimizing simple query - mysql

Needless to say i am not proficient at SQL. Now i have to run a query on a table that looks like this :
id, tp_id, value_1, value_2, value_3, date
This table has 2 entries for each distinct tp_id, with different values. tp_id is a foreign key, which is indexed, in the following table :
id, external_id
I'm trying to retrieve data as follows :
Get distinct tp_id where value_2 = 2, value_1 = 1 | 2, value_3 = 1, and date < now - 1 year. These conditions must hold true for BOTH entries with matching tp_id
I have tried the following query, but as i understand it the SUM function paired with the JOIN statement makes the query too slow :
SELECT t1.tp_id, t2.external_id
FROM table_1 t1
JOIN table_2 t2 ON t1.tp_id = t2.id
GROUP BY t1.tp_id
HAVING
SUM(
t1.value_2 = 2
AND t1.value_1 IN (1, 2)
AND t1.value_3 = 1
AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
) = 2;
Both tables have roughly 2.5M rows.
I'd like to optimize this query or learn a better way to do this, so any help would be welcome.
Thanks in advance
EDIT: It appears running this query will be altogether unnecessary. I will therefore close the question, thanks for the answers

If I got your requirement correctly, something like this might help.
SELECT tp_id
FROM (
SELECT t1.tp_id,count(*) as count
FROM table_1 t1
WHERE
t1.value_2 = 2
AND (t1.value_1 = 1 OR t1.value_1 = 2)
AND t1.value_3 = 1
AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY tp_id
) as res
WHERE res.count = 2
Essentially, I did 3 performance update:
the WHERE condition is applied before the GROUP BY, way more performant than the HAVING
I've used a nested query, but you can also use HAVING COUNT(tp_id) = 2 depending on your MySQL version
2 boolean checks should be more performant than an IN clause

Related

MySQL SUM not working correctly after JOIN

I have 2 tables that look like the following:
TABLE 1 TABLE 2
user_id | date accountID | date | hours
And I'm trying to add up the hours by the week. If I use the following statement I get the correct results:
SELECT
SUM(hours) as totalHours
FROM
hours
WHERE
accountID = 244
AND
date >= '2014-02-02' and date < '2014-02-09'
GROUP BY
accountID
But when I join the two tables I get a number like 336640 when it should be 12
SELECT
SUM(hours) as totalHours
FROM
hours
JOIN table1 ON
user_id = accountID
WHERE
accountID = 244
AND
date >= '2014-02-02' and date < '2014-02-09'
GROUP BY
accountID
Does anyone know why this is?
EDIT: Turns out I just needed to add DISTINC, thanks!
JOIN operations usually generate more rows in the result table: join's result is a row for every possible pair of rows in the two joined tables that happens to meet the criterion selected in the ON clause. If there are multiple rows in table1 that match each row in hours, the result of your join will repeat hours.accountID and hours.hours many times. So, adding up the hours yields a high result.
The reason is that the table you are joining to matches multiple rows in the first table. These all get added together.
The solution is to do the aggregation in a subquery before doing the join:
select totalhours
from (SELECT SUM(hours) as totalHours
FROM hours
WHERE accountID = 244 AND
date >= '2014-02-02' and date < '2014-02-09'
GROUP BY accountID
) h join
table1 t1
on t1.user_id = h.accountID;
I suspect your actual query is more complicated. For instance, table1 is not referenced in this query so the join is only doing filtering/duplication of rows. And the aggregation on hours is irrelevant when you are choosing only one account.
You should probably be specifying LEFT JOIN to be sure that it won't eliminate rows that don't match.
Also, date BETWEEN ? AND ? is preferable to date >= ? AND date < ?.

MySQL join date columns with 1-month lag and performance issues

Note: I found this similar question but it does not address my issue, so I do not believe this is a duplicate.
I have two simple MySQL tables (created with the MyISAM engine), Table1 and Table2.
Both of the tables have 3 columns, a date-type column, an integer ID column, and a float value column. Both tables have about 3 million records and are very straightforward.
The contents of the tables looks like this (with Date and Id as primary keys):
Date Id Var1
2012-1-27 1 0.1
2012-1-27 2 0.5
2012-2-28 1 0.6
2012-2-28 2 0.7
(assume Var1 becomes Var2 for the second table).
Note that for each (year, month, ID) triplet, there will only be a single entry. But the actual day of the month that appears is not necessarily the final day, nor is it the final weekday, nor is it the final business day, etc... It's just some day of the month. This day is important as an observation day in other tables, but the day-of-month itself doesn't matter between Table1 and Table2.
Because of this, I cannot rely on Date + INTERVAL 1 MONTH to produce the matching day-of-month for the date it should match to that is one month ahead.
I'm looking to join the two tables on Date and Id but where the values from the second table (Var2) come from 1-month ahead than Var1.
This sort of code will accomplish it, but I am noticing a significant performance degradation with this, explained below.
-- This is exceptionally slow for me
SELECT b.Date,
b.Id,
a.Var1,
b.Var2
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND YEAR(a.Date + INTERVAL 1 MONTH) = YEAR(b.Date)
AND MONTH(a.Date + INTERVAL 1 MONTH) = MONTH(b.Date)
-- This returns quickly, but if I use it as a sub-query
-- then the parent query is very slow.
SELECT Date + INTERVAL 1 MONTH as FutureDate,
Id,
Var1
FROM Table1
-- That is, the above is fast, but this is super slow:
select b.Date,
b.Id,
a.Var1,
b.Var2
FROM (SELECT Date + INTERVAL 1 MONTH as FutureDate
Id,
Var1
FROM Table1) a
JOIN Table2 b
ON YEAR(a.FutureDate) = YEAR(b.Date)
AND MONTH(a.FutureDate) = MONTH(b.Date)
AND a.Id = b.Id
I've tried re-ordering the JOIN criteria, thinking maybe that matching on Id first in the code would change the query execution plan, but it seems to make no difference.
When I say "super slow", I mean that option #1 from the code above doesn't return the results for all 3 million records even if I wait for over an hour. Option #2 returns in less than 10 minutes, but then option number three takes longer than 1 hour again.
I don't understand why the introduction of the date lag makes it take so long.
How can I
profile the queries to understand why it takes a long time?
write a better query for joining tables based on a 1-month date lag (where day-of-month that results from the 1-month lag may cause mismatches).
Here is an alternative approach:
SELECT b.Date, b.Id, b.Var2
(select a.var1
from Table1 a
where a.id = b.id and a.date < b.date
order by a.date
limit 1
) as var1
b.Var2
FROM Table2 b;
Be sure the primary index is set up with id first and then date on Table1. Otherwise, create another index Table1(id, date).
Note that this assumes that the preceding date is for the preceding month.
Here's another alternative way to go about this:
SELECT thismonth.Date,
thismonth.Id,
thismonth.Var1 AS Var1_thismonth,
lastmonth.Var1 AS Var1_lastmonth
FROM Table2 AS thismonth
JOIN
(SELECT id, Var1,
DATE(DATE_FORMAT(Date,'%Y-%m-01')) as MonthStart
FROM Table2
) AS lastmonth
ON ( thismonth.id = lastmonth.id
AND thismonth.Date >= lastmonth.MonthStart + INTERVAL 1 MONTH
AND thismonth.Date < lastmonth.MonthStart + INTERVAL 2 MONTH
)
To get this to perform ideally, I think you're going to need a compound covering index on (id, Date, Var1).
It works by generating a derived table containing Id,MonthStart,Var1 and then joining the original table to it by a sequence of range scans. Hence the compound covering index.
The other answers gave very useful tips, but ultimately, without making significant modifications to the index structure of my data (which is not feasible at the moment), those methods would not work faster (in any meaningful sense) than what I had already tried in the question.
Ollie Jones gave me the idea to use date formatting, and coupling that with the TIMESTAMPDIFF function seems to make it passably fast, though I still welcome any comments explaining why the use of YEAR, MONTH, DATE_FORMAT, and TIMESTAMPDIFF have such wildly different performance properties.
SELECT b.Date,
b.Id,
b.Var2,
a.Date,
a.Id,
a.Var1
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND (TIMESTAMPDIFF(MONTH,
DATE_FORMAT(a.Date, '%Y-%m-01'),
DATE_FORMAT(b.Date, '%Y-%m-01')) = 1)

SELECT * FROM table while condition=true?

i want to select something from table while one condition is true,
SELECT * FROM (SELECT * FROM`table1` `t1` ORDER BY t1.date) `t2` WHILE t2.id!=5
when while condition comes to false it stop selecting next rows.
Please help me, I have already search a lot and many similars in stackoverflow but I can't get it.
please don't tell me about where , i want solution in sql not in php or anything other
OK the real problem is here
SELECT *,(SELECT SUM(t2.amount) FROM (select * from transaction as t1 order by t1.date) `t2`) as total_per_transition FROM transaction
here i want to calculate total balance on each transaction
First find the first date where the condition fails, so where id=5:
SELECT date
FROM table1
WHERE id = 5
ORDER BY date
LIMIT 1
Then make the above a derived table (we call it lim) and join it to the original table to get all rows with previous dates: t.date < lim.date
SELECT t.*
FROM table1 AS t
JOIN
( SELECT date
FROM table1
WHERE id = 5
ORDER BY date
LIMIT 1
) AS lim
ON t.date < COALESCE(lim.date, '9999-12-31') ;
The COALESCE() is for the case when there are no rows at all with id=5 - and in that case we want all rows from the table.

MySQL Redundant indexes alternative?

I'm new to MySQL, and I'm running this query,
SELECT item_id,amount FROM db.invoice_line WHERE item_id = 'xxx'
OR item_id = 'yyy'
...
AND invoice_id IN
(SELECT id_invoices FROM db.invoices
WHERE customer = 'zzzz'
AND transaction_date > DATE_SUB(NOW(), INTERVAL 6 MONTH)
AND sales_rep = 'aaa') ORDER BY item_id;
That is, select some columns from a table where a foreign key is found in another table.
The issue is that I would like to also have, in the results, the customer name. However, the customer name is not found in the invoice line table, it is found in the invoice table.
While I could naively create a duplicate index upon table creation and inserts, I was wondering if there was a SQL way to select the proper row from the invoice table and have it in the result sets.
Is the performance better if I just duplicate data?
Thanks,
Dane
How about something like this?
SELECT
invoice_line.item_id,
invoice_line.amount,
invoices.customer_name
FROM db.invoice_line
INNER JOIN db.invoices
ON invoice_line.invoice_id = invoices.id_invoices
WHERE invoices.customer = 'zzzz'
AND invoices.transaction_date > DATE_SUB(CURRENT_DATE, INTERVAL 6 MONTH)
AND invoices.sales_rep = 'aaa'
AND (invoice_line.item_id = 'xxx' OR invoice_line.item_id = 'yyy')
ORDER BY invoice_line.item_id;
Use join between table to achieve your result.

MySQL GROUP BY DateTime +/- 3 seconds

Suppose I have a table with 3 columns:
id (PK, int)
timestamp (datetime)
title (text)
I have the following records:
1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title
I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.
There is a similar question here: Mysql DateTime group by 15 mins
I also found this: http://www.artfulsoftware.com/infotree/queries.php#106
I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.
Does this make sense? Please let me know if further clarification is needed.
EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.
I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.
I'm using Tom H.'s excellent idea but doing it a little differently here:
Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.
Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:
SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL
And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:
SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id
Once we have that, we can GROUP BY it as you wanted.
SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime
I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!
Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.
SELECT
SQ1.my_id,
COUNT(*) -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
MT3.title = SQ1.title AND
MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ2 ON
SQ2.title = SQ1.title AND
SQ2.my_time > SQ1.my_time AND
SQ2.my_time <= MT3.my_time
WHERE
SQ2.my_id IS NULL
This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.
Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.
EDIT: Changed the queries to handle exact matches by timestamp.
Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.
I'm going to use this data as the example instead of yours:
id Timestamp
1 1:00:00
2 1:00:03
3 1:00:06
4 1:00:10
Here is the first query to write:
SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b
ON (a.Timestamp is within 3 seconds of b.Timestamp)
It returns:
aid bid
1 1
1 2
2 1
2 2
2 3
3 2
3 3
4 4
Let's create a nice table to hold those things that won't allow duplicates:
CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)
Now the challenge is to find something like the transitive closure of that relation.
To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:
INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)
This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.
Once this is over, you will have a transitively-closed relation like this:
aid bid
1 1
1 2
1 3 --added
2 1
2 2
2 3
3 1 --added
3 2
3 3
4 4
And now for the punchline:
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
returns:
aid Neighbors
1 1,2,3
2 1,2,3
3 1,2,3
4 4
So
SELECT DISTINCT Neighbors
FROM (
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
) Groupings
returns
Neighbors
1,2,3
4
Whew!
I like #Chris Cunningham's answer, but here's another take on it.
First, my understanding of your problem statement (correct me if I'm wrong):
You want to look at your event log as a sequence, ordered by the time of the event,
and partitition it into groups, defining the boundary as being an interval of
more than 3 seconds between two adjacent rows in the sequence.
I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.
So, first our event log table:
--
-- our event log table
--
create table dbo.eventLog
(
id int not null ,
dtLogged datetime not null ,
title varchar(200) not null ,
primary key nonclustered ( id ) ,
unique clustered ( dtLogged , id ) ,
)
Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:
The innermost select defines the upper bound of each group. That upper boundary defines a group.
The outer select defines the lower bound of each group.
Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.
[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]
select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
You could then pull rows from the event log and tag them with the group to which they belong thus:
select *
from ( select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
) period
join dbo.EventLog t on t.dtLogged >= period.dtFrom
and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged
Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.
Simple query:
SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);