MySQL: How to construct a given query - mysql

I am not a MySQL guru at all, and I would really appreciate if someone takes some time to help me. I have three tables as shown below:
TEAM(teamID, teamName, userID)
YOUTH_TEAM(youthTeamID, youthTeamName, teamID)
YOUTH_PLAYER(youthPlayerID, youthPlayerFirstName, youthPlayerLastName, youthPlayerAge, youthPlayerDays, youthPlayerRating, youthPlayerPosition, youthTeamID)
And this is the query that I have now:
SELECT team.teamName, youth_team.youthTeamName, youth_player.*
FROM youth_player
INNER JOIN youth_team ON youth_player.youthTeamID = youth_team.youthTeamID
INNER JOIN team ON youth_team.teamID = team.teamID
WHERE youth_player.youthPlayerAge < 18
AND youth_player.youthPlayerDays < 21
AND youth_player.youthPlayerRating >= 5.5
What I would like to add to this query is a more thorough checks like the following:
if player has 16 years, and his position is scorer, then the player should have at least 7 rating in order to be returned
if player has 15 years, and his position is playmaker, then the player should have at least 5.5 rating in order to be returned
etc., etc.
How can I implement these requirements in my query (if possible), and is that query going to be a bad-way solution? Is it maybe going to be better if I do the selection with PHP code (if we suppose I use PHP) instead of doing it in the query?

Here is a possible solution with an additional "criteria/filter" table:
-- SAMPLE TEAMS: Yankees, Knicks:
INSERT INTO `team` VALUES (1,'Yankees',2),(2,'Knicks',1);
-- SAMPLE YOUTH TEAMS: Yankees Juniors, Knicks Juniors
INSERT INTO `youth_team` VALUES (1,'Knicks Juniors',1),(2,'Yankees Juniors',2);
-- SAMPLE PLAYERS
INSERT INTO `youth_player` VALUES
(1,'Carmelo','Anthony',16,20,7.5,'scorer',1),
(2,'Amar\'e','Stoudemire',17,45,5.5,'playmaker',1),
(3,'Iman','Shumpert',15,15,6.1,'playmaker',1),
(4,'Alex','Rodriguez',18,60,3.5,'playmaker',2),
(5,'Hiroki','Kuroda',16,17,8.7,'scorer',2),
(6,'Ichiro','Suzuki',19,73,8.3,'playmaker',2);
-- CRITERIA TABLE
CREATE TABLE `criterias` (
`id` int(11) NOT NULL,
`age` int(11) DEFAULT NULL,
`position` varchar(45) DEFAULT NULL,
`min_rating` double DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- SAMPLE CRITERIAS
-- AGE=16, POSITION=SCORER, MIN_RATING=7
-- AGE=15, POSITION=PLAYMAKER, MIN_RATING=5.5
INSERT INTO `criterias` VALUES (1,16,'scorer',7), (2,15,'playmaker',5.5);
Now your query could look like:
SELECT team.teamName, youth_team.youthTeamName, youth_player.*
FROM youth_player
CROSS JOIN criterias
INNER JOIN youth_team ON youth_player.youthTeamID = youth_team.youthTeamID
INNER JOIN team ON youth_team.teamID = team.teamID
WHERE
(
youth_player.youthPlayerAge < 18
AND youth_player.youthPlayerDays < 21
AND youth_player.youthPlayerRating >= 5.5
)
AND
(
youth_player.youthPlayerAge = criterias.age
AND youth_player.youthPlayerPosition = criterias.position
AND youth_player.youthPlayerRating >= criterias.min_rating
)
This yields (shortened results):
teamName youthTeamName youthPlayerName Age Days Rating Position
=============================================================================
Yankees "Knicks Juniors" Carmelo Anthony 16 20 7.5 scorer
Yankees "Knicks Juniors" Iman Shumpert 15 15 6.1 playmaker
Knicks "Yankees Juniors" Hiroki Kuroda 16 17 8.7 scorer

Doing it in the query is quite fine...... as long as it doesn't get too messed up. You can perform a lot of stuff in your query, but it may get hard to maintain. So if it gets too long and you want somebody else to take a look at it, you should split it up or find a solution in your php-script.
As for your requirements add this too your WHERE-part:
AND
(
(YOUTH_PLAYER.youthPlayerAge >= 16 AND YOUTH_PLAYER.youthPlayerPosition = 'scorer' AND YOUTH_PLAYER.youthPlayerRating >= 7)
OR (YOUTH_PLAYER.youthPlayerAge >= 15 AND YOUTH_PLAYER.youthPlayerPosition = 'playmaker' AND YOUTH_PLAYER.youthPlayerRating >= 5.5)
)

Related

MySQL - Join row with the next N smaller rows

I have a table:
id timestamp
1 1
23 2
12 4
45 6
3 7
4 8
I need this result:
major minor
1 2
1 4
1 6
2 4
2 6
2 7
I need to join each number, with the next 3 smallest numbers. Since these numbers are inserted out of order, I can't use the ids.
Because the numbers are also not in regular intervals I cannot set a specific limit to find the max number to join with.
Solutions I have:
I could create a temp table and use an auto increment id to do this.
I can do this for a single number, and write a script to iterate through the table. This is the query for it (Going with this for now, till something better comes up):
SELECT * FROM
(SELECT id major_id, timestamp major_timestamp FROM timestamps WHERE interval_id=7 ORDER BY timestamp DESC limit 1) timestamps_major
LEFT JOIN
(SELECT id minor_id, timestamp minor_timestamp FROM timestamps WHERE timestamp < (SELECT timestamp FROM timestamps WHERE interval_id=7 ORDER BY timestamp DESC limit 1) ORDER BY timestamp DESC LIMIT 2) timestamps_minor
ON major_timestamp>minor_timestamp
This just needs to be done for all numbers once, and then once per day to calculate and store a moving average. So speed is not an issue.
Wondering what is the best way to approach this. Thanks.
EDIT:
This is the actual table with timestamps and ids. The example I posted is just simplified for the sake of the question.
CREATE TABLE `timestamps` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`interval_id` tinyint(3) unsigned NOT NULL,
`timestamp` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `interval_timestamp` (`interval_id`,`timestamp`),
KEY `interval_id` (`interval_id`),
KEY `timestamp` (`timestamp`),
CONSTRAINT `timestamps_ibfk_1` FOREIGN KEY (`interval_id`) REFERENCES `intervals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=75157 DEFAULT CHARSET=latin1
Here's a possible solution (see this sqlfiddle to play around with it)
SELECT *
FROM mytable major inner join mytable minor
ON minor.timestamp > major.timestamp
WHERE (SELECT COUNT(*) FROM mytable m WHERE m.timestamp < minor.timestamp and m.timestamp > major.timestamp) < 3
ORDER BY major.timestamp, minor.timestamp
I'm definitely not confident this is the cleanest solution (and I didn't do anything to handle "ties" for equal timestamps), but it does do what you want so it might be something to build off of at a minimum.
All I am doing is joining the tables then counting the number of rows "between" the major and minor so that I don't get too many.

MySQL query with multiple subqueries is too slow. How do I speed it up?

I have a MySQL query which does exactly what I want, but it takes anywhere between 110 & 130 seconds to process. The problem is that it works in tandem with a software that times out 20 seconds after making the query.
Is there anything I can do to speed up the query? I'm considering moving the db over to another server, but are there any more elegant options before I go that route?
-- 1 Give me a list of IDs & eBayItemIDs
-- 2 where it is flagged as bottom tier
-- 3 Where it has been checked less than 168 times
-- 4 Where it has not been checked in the last hour
-- 5 Or where it was never checked but appears on the master list.
-- 1 Give me a list of IDs & eBayItemIDs
SELECT `id`, eBayItemID
FROM `eBayDD_Main`
-- 2 where it is flagged as bottom tier
WHERE `isBottomTier`='0'
-- 3 Where it has been checked less than 168 times
AND (`id` IN
(SELECT `mainid`
FROM `eBayDD_History`
GROUP BY `mainid`
HAVING COUNT(`mainID`) < 168)
-- 4 Where it has not been checked in the last hour
AND id IN
(SELECT `mainID`
FROM `eBayDD_History`
GROUP BY `mainID`
HAVING ((TIME_TO_SEC(TIMEDIFF(NOW(), MAX(`dateCollected`)))/60)/60) > 1))
-- 5 Or where it was never checked but appears on the master list.
OR (`id` IN
(SELECT `id`
FROM `eBayDD_Main`)
AND `id` NOT IN
(SELECT `mainID`
FROM `eBayDD_History`))
If I understand the logic correctly, you should be able to replace this logic with this:
select m.`id`, m.eBayItemID
from `eBayDD_Main` m left outer join
(select `mainid`, count(`mainID`) as cnt,
TIME_TO_SEC(TIMEDIFF(NOW(), MAX(`dateCollected`)))/60)/60) as dc
from `eBayDD_History`
group by `mainid`
) hm
on m.mainid = hm.mainid
where m.`isBottomTier` = '0' and hm.cnt < 168 and hm.dc > 1 or
hm.mainid is null;

MySQL Calculating sum over pairwise time differences of log file

i have a table in mysql to log user actions. Each row in the table corresponds to a user action, like login, logout etc.
The table looks like:
CREATE TABLE IF NOT EXISTS `user_activity_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`action_type` smallint NOT NULL,
`action_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
id user_id action_type action_created
...
22 6 1 2013-07-21 14:31:14
23 6 2 2013-07-21 14:31:16
24 8 2 2013-07-21 14:31:18
25 8 1 2013-07-21 14:45:18
26 8 0 2013-07-21 14:45:25
27 8 1 2013-07-21 14:54:54
28 8 2 2013-07-21 15:09:11
29 6 1 2013-07-21 15:09:17
30 6 2 2013-07-21 15:09:29
...
Imagine the action 1 is login and 2 is logout and that i want to find out the total time (in hours:minutes:seconds) the user with id 6 was logged in within a specific range of dates.
My first idea was to fetch all rows with either action 1 or 2 and calculate the date differences in PHP myself. This seems rather complicated and i am sure this can be done in one query with mysql, too!
What i tried was this:
SELECT TIMEDIFF(ual1.action_created, ual2.action_created) FROM user_activity_log
ual1,user_activity_log ual2 WHERE ual1.user_id = 6 AND ual2.user_id = 6 AND
ual1.action_type = 1 AND ual2.action_type = 2 AND
DATE(ual1.action_created) >= '2013-07-21' AND
DATE(ual1.action_created) <= '2013-07-21'
ORDER BY ual1.action_created
to select all login events from ual1 and all logout events from ual2 from the same user and then calculate the pairwise time difference for day 2013.7.21, which does not really work and i don't know why.
How can i calculate the total login time (sum over all time differences, date action 2 - date action 1)?
The result from the correct operation should be 2 seconds from log id pair 22,23 + 12 seconds from log id pair 29,30 = 14 seconds.
Thank you very much for your help in advance. Best regards
I think the easiest way to structure this type of query is using correlated subqueries (and, to be honest, I generally don't like correlated subqueries, but this is an exception). Your query would probably work with the right group by clause.
Here is an alternative method:
select TIMEDIFF(action_created, LogoutTS)
from (select ual.*,
(select ual2.user_activity_log
from user_activity_log ual2
where ual2.user_id = ual.user_id and
ual2.action_type = 2 and
ual2.action_created > ual.action_created
order by ual2.action_created desc
limit 1
) as LogoutTS
from user_activity_log ual
where ual.user_id = 6 and
ual.action_type = 1
) ual
To get the total, you then need to do something like sum(TIMEDIFF(action_created, LogoutTS). However, this can depend on the format of the time column. It might look something like this:
select SUM((UNIX_TIMESTAMP(LogoutTS) - UNIX_TIMESTAMP(action_created))/1000)
Or:
select sec_to_time(SUM((UNIX_TIMESTAMP(LogoutTS) - UNIX_TIMESTAMP(action_created))/1000))

Optimizing a MySQL query summing and averaging by multiple groups over a given date range

I'm currently working on a home-grown analytics system, currently using MySQL 5.6.10 on Windows Server 2008 (moving to Linux soon, and we're not dead set on MySQL, still exploring different options, including Hadoop).
We've just done a huge import, and what was a lightning-fast query for a small customer is now unbearably slow for a big one. I'm probably going to add an entirely new table to pre-calculate the results of this query, unless I can figure out how to make the query itself fast.
What the query does is take #StartDate and #EndDate as parameters, and calculates, for every day of that range, the date, the number of new reviews on that date, a running total of number of reviews (including any before #StartDate), and the daily average rating (if there is no information for a given day, the average rating will be carried over from the previous day).
Available filters are age, gender, product, company, and rating type. Every review has 1-N ratings, containing at the very least an "overall" rating, but possibly more per customer/product, such as "Quality", "Sound Quality", "Durability", "Value", etc...
The API that calls this injects these filters based on user selection. If no rating type is specified, it uses "AND ratingTypeId = 1" in place of the AND clause comment in all three parts of the query I'll be listing below. All ratings are integers between 1 and 5, though that doesn't really matter to this query.
Here are the tables I'm working with:
CREATE TABLE `times` (
`timeId` int(11) NOT NULL AUTO_INCREMENT,
`date` date NOT NULL,
`month` char(7) NOT NULL,
`quarter` char(7) NOT NULL,
`year` char(4) NOT NULL,
PRIMARY KEY (`timeId`),
UNIQUE KEY `date` (`date`)
) ENGINE=MyISAM
CREATE TABLE `reviewCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`totalReviews` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`)
) ENGINE=MyISAM
CREATE TABLE `ratingCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`ratingTypeId` int(11) NOT NULL,
`negativeRatings` int(10) unsigned NOT NULL DEFAULT '0',
`positiveRatings` int(10) unsigned NOT NULL DEFAULT '0',
`neutralRatings` int(10) unsigned NOT NULL DEFAULT '0',
`totalRatings` int(10) unsigned NOT NULL DEFAULT '0',
`ratingsSum` double unsigned DEFAULT '0',
`totalRecommendations` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`,`ratingTypeId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`),
KEY `ratingTypeId_fk` (`ratingTypeId`)
) ENGINE=MyISAM
The 'times' table is pre-filled with every day from 1900-01-01 to 2049-12-31, and the two count tables are populated by an ETL script with a roll-up query grouped by company, product, age, gender, ratingType, etc...
What I'm expecting back from the query is something like this:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
2013-01-25 5505 16091 4.058400718778077
2013-01-27 2043 18134 3.992957746478873
2013-01-28 3280 21414 3.983625730994152
2013-01-29 4648 26062 3.921597633136095
...
2013-03-09 1608 60297 3.9409722222222223
2013-03-10 470 60767 3.7743682310469313
2013-03-11 1028 61795 4.036697247706422
2013-03-13 494 62289 3.857388316151203
2013-03-14 449 62738 3.8282208588957056
I'm pretty sure I could pre-calculate everything grouped by age, gender, etc..., except for the average, but I may be wrong on that. If I had three reviews for two products on one day, with all other groups different, and one had a rating of 2 and 5, and the other a 4, the first would have a daily average of 3.5, and the second 4. Averaging those averages would give me 3.75, when I'd expect to get 3.66667. Maybe I could do something like multiplying the average for that grouping by the number of reviews to get the total rating sum for the day, sum those up, then divide them by total ratings count at the end. Seems like a lot of extra work, but it may be faster than what I'm currently doing. Speaking of which, here's my current query:
SET #cumulativeCount :=
(SELECT coalesce(sum(rc.totalReviews), 0)
FROM reviewCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
);
SET #dailyAverageWithCarry :=
(SELECT SUM(rc.ratingsSum) / SUM(rc.totalRatings)
FROM ratingCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
AND rc.totalRatings > 0
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.date DESC LIMIT 1
);
SELECT
subquery.d AS `Date`,
subquery.newReviewsCount AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(subquery.dailyRatingAverage, #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
(
SELECT
dt.date AS d,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount,
SUM(rac.ratingsSum) / SUM(rac.totalRatings) AS dailyRatingAverage
FROM times dt
LEFT JOIN reviewCount rc ON dt.timeId = rc.createdOnTimeId
LEFT JOIN ratingCount rac ON dt.timeId = rac.createdOnTimeId
WHERE dt.date BETWEEN #StartDate AND #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.timeId
) AS subquery;
The query currently takes ~2 minutes to run, with the following row counts:
times 54787
reviewCount 276389
ratingCount 473683
age 122
gender 3
ratingType 28
product 70070
Any help would be greatly appreciated. I'd either like to make this query much faster, or if it would be faster to do so, to pre-calculate the values grouped by date, age, gender, product, company, and ratingType, then do a quick roll-up query on that table.
UPDATE #1: I tried Meherzad's suggestions of adding indexes to times and ratingCount with:
ALTER TABLE times ADD KEY `timeId_date_key` (`timeId`, `date`);
ALTER TABLE ratingCount ADD KEY `createdOnTimeId_totalRatings_key` (`createdOnTimeId`, `totalRatings`);
Then ran my initial query again, and it was about 1s faster (~89s), but still too slow. I tried Meherzad's suggested query, and had to kill it after a few minutes.
As requested, here is the EXPLAIN results from my query:
id|select_type|table|type|possible_keys|key|key_len|ref|rows|Extra
1|PRIMARY|<derived2>|ALL|NULL|NULL|NULL|NULL|6808032|NULL
2|DERIVED|dt|range|PRIMARY,timeId_date_key,date|date|3|NULL|88|Using index condition; Using temporary; Using filesort
2|DERIVED|rc|ref|PRIMARY,companyId_fk,createdOnTimeId|createdOnTimeId|4|dt.timeId|126|Using where
2|DERIVED|rac|ref|createdOnTimeId,createdOnTimeId_total_ratings_key|createdOnTimeId|4|dt.timeId|614|NULL
I checked the cache read miss rate as mentioned in the article on buffer sizes, and it was
Key_reads 58303
Key_read_requests 147411279
For a miss rate of 3.9551247635535405672723319902814e-4
UPDATE #2: Solved! The indices definitely helped, so I'll give credit for the answer to Meherzad. What actually made the most difference was realizing that calculating the rolling average and daily/cumulative review counts in the same query was joining those two huge tables together. I saw that the variable initialization was done in two separate queries, and decided to try separating the two big queries into subqueries and then joining them based on the timeId. Now it runs in 0.358s with the following query:
SET #StartDate = '2013-01-24';
SET #EndDate = '2013-04-24';
SELECT
#StartDateId:=MIN(timeId), #EndDateId:=MAX(timeId)
FROM
times
WHERE
date IN (#StartDate , #EndDate);
SELECT
#CumulativeCount:=COALESCE(SUM(totalReviews), 0)
FROM
reviewCount
WHERE
createdOnTimeId < #StartDateId
-- Add Filters
;
SELECT
#DailyAverage:=COALESCE(SUM(ratingsSum) / SUM(totalRatings), 0)
FROM
ratingCount
WHERE
createdOnTimeId < #StartDateId
AND totalRatings > 0
-- Add Filters
GROUP BY createdOnTimeId
ORDER BY createdOnTimeId DESC
LIMIT 1;
SELECT
t.date AS `Date`,
COALESCE(q1.newReviewsCount, 0) AS `NewReviews`,
(#CumulativeCount:=#CumulativeCount + COALESCE(q1.newReviewsCount, 0)) AS `CumulativeReviewsCount`,
(#DailyAverage:=COALESCE(q2.dailyRatingAverage,
COALESCE(#DailyAverage, 0))) AS `DailyRatingAverage`
FROM
times t
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount
FROM
reviewCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q1 ON t.timeId = q1.createdOnTimeId
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
SUM(rc.ratingsSum) / SUM(rc.totalRatings) AS dailyRatingAverage
FROM
ratingCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q2 ON t.timeId = q2.createdOnTimeId
WHERE
t.timeId BETWEEN #StartDateId AND #EndDateId;
I had assumed that two subqueries would be incredibly slow, but they were insanely fast because they weren't joining completely unrelated rows. It also pointed out the fact that my earlier results were way off. For example, from above:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
Should have been, and now is:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
The average was correct, but the join was screwing up the number of both new and cumulative reviews, which I verified with a single query.
I also got rid of the joins to the times table, instead determining the start and end date IDs in a quick initialization query, then just rejoined to the times table at the end.
Now the results are:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
2013-01-25 551 407878 4.058400718778077
2013-01-26 455 408333 3.838926174496644
2013-01-27 433 408766 3.992957746478873
2013-01-28 425 409191 3.983625730994152
...
2013-04-13 170 426066 3.874239350912779
2013-04-14 182 426248 3.585714285714286
2013-04-15 171 426419 3.6202531645569622
2013-04-16 0 426419 3.6202531645569622
2013-04-17 0 426419 3.6202531645569622
2013-04-18 0 426419 3.6202531645569622
2013-04-19 0 426419 3.6202531645569622
2013-04-20 0 426419 3.6202531645569622
2013-04-21 0 426419 3.6202531645569622
2013-04-22 0 426419 3.6202531645569622
2013-04-23 0 426419 3.6202531645569622
2013-04-24 0 426419 3.6202531645569622
The last few averages properly carry the earlier ones, too, since we haven't imported from that customer's data feed in about 10 days.
Thanks for the help!
Try this query
You don't have necessary indexes to optimize your query
Table times add compound index on (timeId, dateId)
Table ratingCount add compound index on (createdOnTimeId, totalRatings)
As you have already mentioned that you are using various other AND filters according to the user input so create a compound index for those columns in the order which you are adding for their respective table Ex Table ratingCount compound index (createdOnTimeId, totalRatings, ratingType, age, gender, product, and company). NOTE This index will be useful only if you add these constraints in the query.
I'd also check to make sure your buffer pool is large enough to hold your indexes. You don't want indexes to be paging in and out of the buffer pool during a query.
Check your buffer pool size
BUFFER_SIZE
If you don't find any improvement in performance please post explain statement for your query also, it will help in understanding problem properly.
I have tried to understand your query and made a new one check whether it works or not.
SELECT
*
FROM
(SELECT
dt.timeId
dt.date,
COALESCE(SUM(rc.totalReviews), 0) AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(SUM(rac.ratingsSum) / SUM(rac.totalRatings), #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
times dt
LEFT JOIN
reviewCount rc
ON
dt.timeId = rc.createdOnTimeId
LEFT JOIN
ratingCount rac ON dt.timeId = rac.createdOnTimeId
JOIN
(SELECT #cumulativeCount:=0, #dailyAverageWithCarry:=0) tmp
WHERE
dt.date < #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY
dt.timeId
ORDER BY
dt.timeId
) AS subquery
WHERE
subquery.date>#StartDate;
Hope this helps....

Multiple sort depending on the current day and tomorrow (bus trips)

I am stuck on huge problem i will say with my below query. Here j5 represent friday and j6 represent saturday (1 to 7... sunday to monday).
As you know, the buses have different schedules depending on the time of the week. Here, I am taking next 5 trips departure after 25:00:00 on cal (j5) and/or after 01:00:00 on cal2 (j6). Bus schedule are builded like this :
If it's 1 am then the current bus time is 25, 2 am is 26 ... you got it. So if I want departure trip for today after let's say 1 AM, i may get only 2-3 since the "bus" day end soon. To solve this problem, I want to add the next departure from the next day (here is saturday after friday). But next day start at 00 like every day in our world.
So what I want to do is : get all next trips for friday j5 after 25:00:00. If I don't have 5, then get all n trip departure for saturday after 01:00:00 (since 25:00:00 = 01:00:00).
Example :
I get departure trip at 25:16:00, 25:46:00 and 26:16:00 for friday. It's 3. I want then to get 2 other departure trip for the next day so i get 5 at the end, and it will be like this 04:50:00 and 05:15:00.
So next departure trip from X stop is : 25:16:00(friday), 25:46:00(friday), 26:16:00(friday), 04:50:00(saturday), 05:15:00(saturday).
I am having problem to sort both results from trips.trip_departure.
I know it may be complicated, it's complicated for me to explain but... anyway. Got question I am here. Thanks a lot in advance !
PS: Using MySQL 5.1.49 and PHP 5.3.8
PS2: I want to avoid doing multiple query in PHP so I'd like to do this in one query, no matter what.
SELECT
trips.trip_departure,
trips.trip_arrival,
trips.trip_total_time,
trips.trip_direction
FROM
trips,
trips_assoc,
(
SELECT calendar_regular.cal_regular_id
FROM calendar_regular
WHERE calendar_regular.j5 = 1
) as cal,
(
SELECT calendar_regular.cal_regular_id
FROM calendar_regular
WHERE calendar_regular.j6 = 1
) as cal2
WHERE
trips.trip_id = trips_assoc.trip_id
AND
trips.route_id IN (109)
AND
trips.trip_direction IN (0)
AND
trips.trip_period_start <= "2011-11-25"
AND
trips.trip_period_end >= "2011-11-25"
AND
(
(
cal.cal_regular_id = trips_assoc.calendar_id
AND
trips.trip_departure >= "25:00:00"
)
OR
(
cal2.cal_regular_id = trips_assoc.calendar_id
AND
trips.trip_departure >= "01:00:00"
)
)
ORDER BY
trips.trip_departure ASC
LIMIT
5
EDIT Table structure :
Table calendar_regular
j1 mean sunday, j7 monday, etc).
`cal_regular_id` tinyint(3) unsigned NOT NULL AUTO_INCREMENT,
`j1` tinyint(1) NOT NULL COMMENT 'Lundi',
`j2` tinyint(1) NOT NULL COMMENT 'Mardi',
`j3` tinyint(1) NOT NULL COMMENT 'Mercredi',
`j4` tinyint(1) NOT NULL COMMENT 'Jeudi',
`j5` tinyint(1) NOT NULL COMMENT 'Vendredi',
`j6` tinyint(1) NOT NULL COMMENT 'Samedi',
`j7` tinyint(1) NOT NULL COMMENT 'Dimanche',
PRIMARY KEY (`cal_regular_id`),
KEY `j1` (`j1`),
KEY `j2` (`j2`),
KEY `j3` (`j3`),
KEY `j4` (`j4`),
KEY `j5` (`j5`),
KEY `j6` (`j6`),
KEY `j7` (`j7`)
Data :
cal_regular_id j1 j2 j3 j4 j5 j6 j7
1 0 0 0 0 1 0 0
2 0 0 0 1 1 0 0
3 1 1 1 1 1 0 0
4 0 0 0 0 0 1 0
5 0 0 0 0 0 0 1
Some bus are avaiable x days it's a table that define when in the week... assigned to the trip_assoc table.
Trips table
`agency_id` smallint(5) unsigned NOT NULL,
`trip_id` binary(16) NOT NULL,
`trip_period_start` date NOT NULL,
`trip_period_end` date NOT NULL,
`trip_direction` tinyint(1) unsigned NOT NULL,
`trip_departure` time NOT NULL,
`trip_arrival` time NOT NULL,
`trip_total_time` mediumint(8) NOT NULL,
`trip_terminus` mediumint(8) NOT NULL,
`route_id` mediumint(8) NOT NULL,
`shape_id` binary(16) NOT NULL,
`block` binary(16) DEFAULT NULL,
KEY `testing` (`route_id`,`trip_direction`),
KEY `trip_departure` (`trip_departure`)
trips_assoc table
`agency_id` tinyint(4) NOT NULL,
`trip_id` binary(16) NOT NULL,
`calendar_id` smallint(6) NOT NULL,
KEY `agency_id` (`agency_id`),
KEY `trip_id` (`trip_id`,`calendar_id`)
First off, NEVER let an outside entity dictate a non-unique join column. They can possibly (with authorization/authentication) dictate unique ones (like a deterministic GUID value). Otherwise, they get to dictate a natural key somewhere, and your database automatically assigns row ids for joining. Also, unless you're dealing with a huge number of joins (multiple dozens) over un-indexed rows, the performance is going to be far less of a factor than the headaches of dealing with it elsewhere.
So, from the look of things, you are storing bus schedules from multiple companies (something like google must be doing for getting public transit directions, yes).
Here's how I would deal with this:
You're going to need a calendar file. This is useful for all business scenarios, but will be extremely useful here (note: don't put any route-related information in it).
Modify your agency table to control join keys. Agencies do not get to specify their ids, only their names (or some similar identifier). Something like the following should suffice:
agency
=============
id - identity, incrementing
name - Externally specified name, unique
Modify your route table to control join keys. Agencies should only be able to specify their (potentially non-unique) natural keys, so we need a surrogate key for joins:
route
==============
id - identity, incrementing
agency_id - fk reference to agency.id
route_identifier - natural key specified by agency, potentially non-unique.
- required unique per agency_id, however (or include variation for unique)
route_variation - some agencies use the same routes for both directions, but they're still different.
route_status_id - fk reference to route_status.id (potential attribute, debatable)
Please note that the route table shouldn't actually list the stops that are on the route - it's sole purpose is to control which agency has which routes.
Create a location or address table. This will benefit you mostly in the fact that most transit companies tend to put multiple routes through the same locations:
location
=============
id - identity, incrementing
address - there are multiple ways to represent addresses in a database.
- if nothing else, seperating the fields should suffice
lat/long - please store these properly, not as a single column.
- two floats/doubles will suffice, although there are some dedicated solutions.
At this point, you have two options for dealing with stops on a route:
Define a stop table, and list out all stops. Something like this:
stop
================
id - identity, incrementing
route_id - fk reference to route.id
location_id - fk reference to location.id
departure - Timestamp (date and time) when the route leaves the stop.
This of course gets large very quickly, but makes dealing with holiday schedules easy.
Define a schedule table set, and an schedule_override table set:
schedule
===================
id - identity, incrementing
route_id - fk reference to route.id
start_date - date schedule goes into effect.
schedule_stop
===================
schedule_id - fk reference to schedule.id
location_id - fk reference to location.id
departure - Time (time only) when the route leaves the stop
dayOfWeek - equivalent to whatever is in calendar.nameOfDay
- This does not have to be an id, so long as they match
schedule_override
===================
id - identity, incrementing
route_id - fk reference to route.id
effective_date - date override is in effect. Should be listed in the calendar file.
reason_id - why there's an override in effect.
schedule_override_stop
===========================
schedule_override_id - fk reference to schedule_override.id
location_id - fk reference to location.id
departure - time (time only) when the route leaves the stop
With this information, I can now get the information I need:
SELECT
FROM agency as a
JOIN route as b
ON b.agency_id = a.id
AND b.route_identifier = :(whatever 109 equates to)
AND b.route_variation = :(whatever 0 equates to)
JOIN (SELECT COALESCE(d.route_id, j.route_id) as route_id,
COALESCE(e.location_id, j.location_id) as location_id,
COALESCE(TIMESTAMP(c.date, e.departure),
TIMESTAMP(c.date, j.departure)) as departure_timestamp
FROM calendar as c
LEFT JOIN (schedule_override as d
JOIN schedule_override_stop as e
ON e.schedule_override_id = d.id)
ON d.effective_date = c.date
LEFT JOIN (SELECT f.route_id, f.start_date
g.dayOfWeek, g.departure, g.location_id,
(SELECT MIN(h.start_date)
FROM schedule as h
WHERE h.route_id = f.route_id
AND h.start_date > f.start_date) as end_date
FROM schedule as f
JOIN schedule_stop as g
ON g.schedule_id = f.id) as j
ON j.start_date <= c.date
AND j.end_date > c.date
AND j.dayOfWeek = c.dayOfWeek
WHERE c.date >= :startDate
AND c.date < :endDate) as k
ON k.route_id = b.id
AND k.departure_timestamp >= :leaveAfter
JOIN location as m
ON m.id = k.location_id
AND m.(location inforation) = :(input location information)
ORDER BY k.departure_timestamp ASC
LIMIT 5
This will give a list of all departures leaving from the specified location, for the given route, between startDate and endDate (exclusive), and after the leaveAfter timestamp. Statement (equivalent) runs on DB2. It picks up changes to schedules, overrides for holidays, etc.
I think X-Zero advice is the best solution, but I had free time:) Please see below, I have used concat to handle as timestamp and after ordered by those two column. I wrote freehand can be error, I have used exists, somewhere I read its more faster than join but you can just use concat and order parts of the query
SELECT
trips.trip_departure,
trips.trip_arrival,
trips.trip_total_time,
trips.trip_direction,
CONCAT(trips.trip_period_start,' ',trips.trip_departure) as start,
CONCAT(trips.trip_period_end,' ',trips.trip_departure) as end,
FROM trips
WHERE EXISTS
(
SELECT
trips_assoc.calendar_id
FROM
trips_assoc
WHERE
trips.trip_id = trips_assoc.trip_id
AND EXISTS
(
SELECT
calendar_regular.cal_regular_id
FROM
calendar_regular
WHERE
cal2.cal_regular_id = trips_assoc.calendar_id
AND
(
calendar_regular.j5 = 1
OR
calendar_regular.j6 = 1
)
)
)
AND
trips.route_id IN (109)
AND
trips.trip_direction IN (0)
AND
trips.trip_period_start <= "2011-11-25"
AND
trips.trip_period_end >= "2011-11-25"
AND
(
trips.trip_departure >= "25:00:00"
OR
trips.trip_departure >= "01:00:00"
)
ORDER BY
TIMESTAMP(start) ASC,TIMESTAMP(end) ASC
LIMIT
5
EDIT: COPY/PASTE issue corrected