Mysql Group Join Optimization Issue - mysql

I'm trying to optimize this query it returns multiple rows from the building_rent_prices and the building_weather and then groups them and calculates the average of their field. So far the tables are all under a million rows yet it takes several seconds, does anyone know how i could optimize this from composite indexes or rewriting the query? I'm assuming it should be able to be a 100ms or quicker query but so far it seems like it cant
SELECT b.*
, AVG(r.rent)
, AVG(w.high_temp)
FROM buildings b
LEFT
JOIN building_rent_prices r
ON r.building_id = b.building_id
LEFT
JOIN building_weather w
ON w.building_id = b.building_id
WHERE w.date BETWEEN CURDATE() AND CURDATE + INTERVAL 4 DAY
AND r.date BETWEEN CURDATE() AND CURDATE + INTERVAL 10 day
GROUP
BY b.building_id
ORDER
BY AVG(r.rent) / b.square_feet DESC
LIMIT 10;
Explain said the following:
1 SIMPLE building_rent_prices range
1 SIMPLE buildings eq_ref
1 SIMPLE building_weather ref
Using where; Using index; Using temporary; Using filesort
Using where
Using where; Using index
Im working on some test data heres the create table
CREATE TABLE building(
building_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
square_feet INT
);
CREATE TABLE building_weather(
building_weather_id INT PRIMARY KEY AUTO_INCREMENT,
building_id INT,
weather_date DATE,
high_temp INT
);
CREATE TABLE building_rates(
building_rate_id INT PRIMARY KEY AUTO_INCREMENT,
building_id INT,
weather_date DATE,
rate double
);
ALTER TABLE building_rates INDEX(building_id);
ALTER TABLE buildings INDEX(building_id);
ALTER TABLE building_weather INDEX(building_id);
This seems to working in under 1 second based on DRapp's answer without indexes(I still need to test that its valid)
select
B.*,
BRP.avgRent,
BW.avgTemp
from
( select building_id,
AVG( rent ) avgRent
from
building_rent_prices
where
date BETWEEN CURDATE() AND CURDATE() + 10
group by
building_id
order by
building_id ) BRP
JOIN buildings B
on BRP.building_id = B.building_id
left join ( select building_id,
AVG( hi_temp ) avgTemp
from building_weather
where date BETWEEN CURDATE() AND CURDATE() + 10
group by building_id) BW
on BRP.building_id = BW.building_id
GROUP BY BRP.building_id
ORDER BY BRP.avgRent / 1 DESC
LIMIT 10;

Let's take a look at this query in detail. You want to report two different kinds of averages for each building. You need to compute those in separate subqueries. If you don't you'll get the cartesian combinatorial explosion.
One is an average of eleven days' worth of rent prices. You get that data with this subquery:
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
This subquery can be optimized by a compound covering index on building_rent_prices, consisting of (date, building_id, rent).
The next is an average of five days' worth of temperature.
SELECT building_id, AVG(high_temp) high_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
This can be optimized by a compound covering index on building_weather, consisting of (date, building_id, high_temp).
Finally, you need to join these two subqueries to your buildings table to generate the final result set.
SELECT buildings.*, a.rent, b.high_temp
FROM buildings
LEFT JOIN (
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
) AS a ON buildings.building_id = a.building_id
LEFT JOIN (
SELECT building_id, AVG(high_temp) high_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
) AS b ON buildings.building_id = b.building_id
ORDER BY a.rent / buildings.square_feet DESC
LIMIT 10
Once the two subqueries are optimized, this one doesn't need anything except the building_id primary key.
In summary, to speed up this query, create the two compound indexes mentioned on the building_rent_prices and building_weather queries.

Don't use CURDATE + 4:
mysql> select CURDATE(), CURDATE() + 30, CURDATE() + INTERVAL 30 DAY;
+------------+----------------+-----------------------------+
| CURDATE() | CURDATE() + 30 | CURDATE() + INTERVAL 30 DAY |
+------------+----------------+-----------------------------+
| 2015-03-15 | 20150345 | 2015-04-14 |
+------------+----------------+-----------------------------+
Add INDEX(building_id) to the second and third tables.
If those don't fix it; come back with a revised query and schema, and I will look deeper.

First, your query to the WEATHER based table is only for 4 days, the RENT PRICES table is for 10 days. Since you don't have any join correlation between the two, you will result in a Cartesian result of 40 records per one building ID. Was that intentional or just not identified as an oops...
Second, I would adjust the query as I have below, but also, I have adjusted BOTH WEATHER and RENT PRICES tables to reflect the same date range period. I start with an sub query of just the prices and group by building and date, then join to buildings, then another sub query to weather grouped by building and date. But here, I join from the rent prices sub query to the weather sub query on both building ID AND date so it will at most retain a 1:1 ratio. I don't know why weather is even a consideration spanning date ranges.
However to help with indexes, I would suggest the following
Table Index on
buildings (Building_ID) <-- probably already exists as a PK
building_rent_prices (date, building_id, rent)
building_weather (date, building_id, hi_temp)
The purpose of the index is to take advantage of the WHERE clause (date first), THEN the GROUP BY ( building ID), and is a COVERING INDEX (includes the rent). Similarly for the building weather table for same reasons.
select
B.*,
BRP.avgRent,
BW.avgTemp
from
( select building_id,
AVG( rent ) avgRent
from
building_rent_prices
where
date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
group by
building_id
order by
building_id ) BRP
JOIN buildings B
on BRP.building_id = B.building_id
left join ( select building_id,
AVG( hi_temp ) avgTemp
from
building_weather
where
date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
group by
building_id ) BW
on BRP.building_id = BW.building_id
GROUP BY
BRP.building_id
ORDER BY
BRP.avgRent / B.square_feet DESC
LIMIT 10;
CLARIFICATION...
I cant guarantee the execution order, but in essence, the two ( queries ) for BPR and BW aliases, they would be done and executed quickly before any join took place. If you wanted the average across the (in my example) 10 days vs a per-day join, then I have removed the "date" as a component of the group, so each will return respectively at most, 1 per building.
Now, joining to the building table on just the 1:1:1 ratio will limit the records in the final result set. This should take care of your concern of the average over those days in question.

For anyone who has issues similar to mine the solution is to GROUP each table you would like to join using building_id that way you are joining one to one with every average. Ollie Jones query with JOIN rather than LEFT JOIN is the closest answer if you do not want results that don't have data in all tables. Also The main issue I had was that I forgot to place an index on a avg(low_temp) column so the INDEXES. What I learned from this is that if you do an aggregated function in your select it belongs in your indexes. I added low_temp to it.
building_weather (date, building_id, hi_temp, low_temp) AS suggested by Ollie and DR APP
ALTER TABLE building_weather ADD index(date, building_id, hi_temp, low_temp);
SELECT buildings.*, a.rent, b.high_temp, b.low_temp
FROM buildings
JOIN (
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
) AS a ON buildings.building_id = a.building_id
JOIN (
SELECT building_id, AVG(high_temp) high_temp, AVG(low_temp) low_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
) AS b ON buildings.building_id = b.building_id
ORDER BY a.rent / buildings.square_feet DESC
LIMIT 10

Related

Compute an average number of transactions per user in a readable manner

I have always been struggling with these types of queries. So, I'd like that someone checks my approach to handle those.
I am asked to find how many transactions, on average, each user executes during a 12 hours timespan starting from the first transaction.
This is the data:
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint NOT NULL,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
My approach is the following (with the steps and the query itself below):
1) For each distinct user_id, find their first and 12 hours' transaction timestamp. This is accomplished by the inner query aliased as t1
2) Then, by inner join to the second inner query (t2), basically, I augment each row of the transactions table with two variables "first_trans" and "right_trans" of the 1st step.
3) Now, by where-condition, I select only those transaction timestamps that fall in the interval specified by first_trans and right_trans timestamps
4) Filtered table from the step 3 is now aggregated as count distinct transaction ids per user
5) The result of the 4 steps above is a table where each user has a count of transactions falling into the interval of 12 hrs from the first timestamp. I wrap it in another select that sums users' transaction counts and divides by the number of users, giving an average count per user.
I am quite certain that the end result is correct overall, but I keep thinking I might go without the 4th select. Or, perhaps, the whole code is somewhat clumsy, while my aim was to make this query as readable as possible, and not necessarily computationally optimal.
select
sum(dist_ts)/count(*) as avg_ts_per_user
from (
select
count(distinct transaction_id) as dist_ts,
us_id
from
(select
user_id as us_id,
min(transactions_ts) as first_trans,
min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id )
as t1
inner join
(select * from transactions )
as t2
on t1.us_id=t2.user_id
where transactions_ts >= first_trans
and transactions_ts < right_trans
group by us_id
) as t3
Fiddle demo
I don't think there is a mistake per se. The code can be slightly simplified (and neatened up a bit as follows):
select sum(dist_ts)/count(*) as avg_ts_per_user
from (
select count(distinct transaction_id) as dist_ts, us_id
from (
select user_id as us_id, min(transactions_ts) as first_trans, min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id
) as t1
inner join transactions as t2
on t1.us_id=t2.user_id and transactions_ts >= first_trans and transactions_ts < right_trans
group by us_id
) as t3
The (select * from transactions ) as t2 was simplified above and I somewhat arbitrarilly moved a where clause condition to the on clause of the inner join.
My Fiddle Demo
Here is a second way that does not use inner joins:
select sum(cnt)/count(*) as avg_ts_per_user from (
select count(*) as cnt, t.user_id
from transactions t
where t.transactions_ts >= (select min(transactions_ts) from transactions where user_id = t.user_id)
and t.transactions_ts < (select min(transactions_ts) + interval 12 hour from transactions where user_id = t.user_id)
group by t.user_id
) sq
Another Fiddle
You should probably run EXPLAIN against the two queries to see which one runs better on your server. Also note that min(transaction_ts) is specified twice for each user. Is MySql able to avoid the redundant calculation? I don't know. One possibility would be to create a temporary table consisting of user_id and min_transaction_ts so that the value is computed once. This would only make sense if your table had lots of rows and maybe not even then.

What query will select rows for months that are more than a certain percent different from the previous month?

I have a table representing a car traveling a route, broken up into segments:
CREATE TABLE travels (
id int auto_increment primary key,
segmentID int,
month int,
year int,
avgSpeed int
);
-- sample data
INSERT INTO travels (segmentID, month, year, avgSpeed)
VALUES
(15,1,2014,80),
(15,1,2014,84),
(15,1,2014,82),
(15,2,2014,70),
(15,2,2014,68),
(15,2,2014,66);
The above schema and sample data is also available as a fiddle.
What query will identify segment IDs where average driving speed decreased by more than 10% compared to the previous month?
here is my solution
Sqlfidle demo
The key is to keep track between previous month and next so i`m doing year*100+month and after group by year and moth check for difference 1 and 89 in year*100+month field.
Also it is pitty that MySQL does not support CTE and makes query ugly using derivered tables.
Code:
select s.month,s.speed,m.month as prevmonth,m.speed as sp, 100-s.speed/m.speed*100 as speeddiff from
(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
) as s
,
(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
) as m
where s.segmentid=m.segmentid and (s.mark=m.mark+1 or s.mark=m.mark+89) and (m.speed-(m.speed/10))>s.speed;
CTE code working on every DB except MySQL
with t as(SELECT segmentid,month,year*100+month as mark,avg(avgSpeed) as speed from travels
group by segmentid,month,year*100+month
)
select s.month,s.speed,m.month as prevmonth,m.speed as sp, 100-s.speed/m.speed*100 as speeddiff from t s
inner join t m on s.segmentid=m.segmentid and (s.mark=m.mark+1 or s.mark=m.mark+89)
where (m.speed-(m.speed/10))>s.speed;
You need to select each month, join the next month (which is a little convoluted due to your table structure) and find the decrease(/increase). Try the following complex query
SELECT
t1.segmentID, t1.month, t1.year, AVG(t1.avgSpeed) as avgSpeed1,
AVG(t2.avgSpeed) as avgSpeed2,
1-(AVG(t1.avgSpeed)/AVG(t2.avgSpeed)) as decrease
FROM
travels t1
LEFT JOIN
travels t2
ON
CONCAT(t2.year,'-',LPAD(t2.month,2,'00'),'-',LPAD(1,2,'00')) = DATE_ADD(CONCAT(t1.year,'-',LPAD(t1.month,2,'00'),'-',LPAD(1,2,'00')), INTERVAL -1 MONTH)
GROUP BY
segmentID, month, year
HAVING
avgSpeed1/avgSpeed2 < .9
Here is the updated SQLFiddle - http://sqlfiddle.com/#!2/183c1/25
This requires a self join. This answer will get you started. You can work on the details.
select somefields
from yourtable t1 join yourtable t2 on t1.something = t2.something
where t1.month = whatever
and t2.month = t1.month + 1
and t2.speed <= t1.speed * .9

MySQL Simple Query Improvement with Large Rows

I have a simple query across a 20 million record table, and I need a index that will improve the select statement for the following query:
SELECT count(item_id), count(distinct user_id)
FROM activity
INNER JOIN item on item.item_id = activity.item_id
WHERE item.item_id = 3839 and activity.created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
I have a index on:
activity - activity_id (PRIMARY), item_id, created_at - All Single Index
item - item_id (PRIMARY)
With items that have a lot of content (like 600k), it takes like 4-5 seconds to run the query.
Any advice?
If there is a FOREIGN KEY constraint from activity to item (and assuming that user_id is in table activity), then your query is equivalent to:
SELECT COUNT(*), COUNT(DISTINCT user_id)
FROM activity
WHERE item_id = 3839
AND created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY) ;
which should be more efficient as it has to get data from only one table or just one index. An index on (item_id, created_at, user_id) would be useful.

Can this MySQL subquery be optimised?

I have two tables, news and news_views. Every time an article is viewed, the news id, IP address and date is recorded in news_views.
I'm using a query with a subquery to fetch the most viewed titles from news, by getting the total count of views in the last 24 hours for each one.
It works fine except that it takes between 5-10 seconds to run, presumably because there's hundreds of thousands of rows in news_views and it has to go through the entire table before it can finish. The query is as follows, is there any way at all it can be improved?
SELECT n.title
, nv.views
FROM news n
LEFT
JOIN (
SELECT news_id
, count( DISTINCT ip ) AS views
FROM news_views
WHERE datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP
BY news_id
) AS nv
ON nv.news_id = n.id
ORDER
BY views DESC
LIMIT 15
I don't think you need to calculate the count of views as a derived table:
SELECT n.id, n.title, count( DISTINCT nv.ip ) AS views
FROM news n
LEFT JOIN news_views nv
ON nv.news_id = n.id
WHERE nv.datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP BY n.id, n.title
ORDER BY views DESC LIMIT 15
The best advice here is to run these queries through EXPLAIN (or whatever mysql's equivalent is) to see what the query will actually do - index scans, table scans, estimated costs, etc. Avoid full table scans.

sql query to get duplicate records with different dates

I need to get records with different date field ,
table Sites:
field id
reference
created
Every day we add lot of records, so I need to do a function that extract all records existing with duplicates of rows just was added, to do some notifications.
the conditions that i can't get is the difference between records of the current day and the old data in the table should be (one day to 4 days) .
If is there any simple query to do that without using transaction .
I'm not sure I totally understand what you mean by duplicate records, but here's a basic date query:
SELECT fieldId, reference, created, DATE(created) as the_date
FROM Sites
WHERE the_date
BETWEEN DATE( DATE_SUB( NOW() , INTERVAL 3 DAY ) )
AND DATE ( NOW() )
I'm making several assumptions such as:
You don't want the "first" row returned
Duplicates don't carry the
date forward (The next after initial 4 days is not a duplicate)
The 4 days means +4 days so Day 5 is included
So, my code is :
with originals as (
select s1.*
from sites as s1
where 0 = (
select count(*)
from sites as s2
where s1.field_id = s2.field_id
and s1.reference = s2.reference
and s1.created <> s2.created
and DATEDIFF(DAY,s2.created, s1.created) between 1 and 4
)
)
select s1.*
from sites as s1
inner join originals as o
on s1.field_id = o.field_id
and s1.reference = o.reference
and s1.created <> o.created
where DATEDIFF(DAY,o.created, s1.created) between 1 and 4
order by 1,2,3;
Here it is in a fiddle: http://sqlfiddle.com/#!3/9b407/20
This could be simpler if some conditions are relaxed.
thanks a lot for every one who tried to help me ,
i have found this solution after lot of test
SELECT `id`,`reference`,count(`config_id`) as c,`created` FROM `sites`
where datediff(date(current_date()),date(`created`)) < 4
group by `reference`
having c > 1
thanks a lot for your help