MySQL - select distinct values against a range of columns - mysql

The following table will store exchange rates between various currencies over time:
CREATE TABLE `currency_exchange` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency_from` int(11) DEFAULT NULL,
`currency_to` int(11) DEFAULT NULL,
`rate_in` decimal(12,4) DEFAULT NULL,
`rate_out` decimal(12,4) DEFAULT NULL,
`exchange_date` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
How would I query it to fetch a list the most recent exchange rates?
It'd essentially be identifying the combined currency_from and currency_to columns as a distinct exchange rate, then finding the one with the most recent date.
For example, let's say I've got the data as:
INSERT INTO currency_exchange (id, currency_from, currency_to, rate_in, rate_out, exchange_date) VALUES
(1, 1, 2, 0.234, 1.789, '2012-07-23 09:24:34'),
(2, 2, 1, 0.234, 1.789, '2012-07-24 09:24:34'),
(3, 2, 1, 0.234, 1.789, '2012-07-24 09:24:35'),
(4, 1, 3, 0.234, 1.789, '2012-07-24 09:24:34');
I'd want it to select row ID's:
1 - as the most recent rate between currencies 1 and 2
3 - as the most recent rate between currencies 2 and 1
4 - as the most recent rate between currencies 1 and 3

The following query should work:
SELECT ce.*
FROM currency_exhcnage ce
LEFT JOIN currency_exchange newer
ON (newer.currency_from = ce.currency_from
AND newer.currency_to = ce.currency_to
AND newer.exchange_date > ce.exchange_date)
WHERE newer.id IS NULL
The trick of doing a self-LEFT JOIN is to avoid resorting to a subquery that may be very expensive if you have large datasets. It's essentially looking for records where no "newer" record exists.
Alternatively, you could go for a simpler (although it may (or may not, as noted in comments) be slower):
SELECT *
FROM currency_exchange ce
NATURAL JOIN (
SELECT currency_from, currency_to, MAX(exchange_date) AS exchange_date
FROM currency_exchange
GROUP BY currency_from, currency_to
) AS most_recent

Insert the values of $currency_from and $currency_to in your dynamic query.
The query below will return the nearest row to the current time
SELECT id FROM currency_exchange WHERE currency_from='$currency_from' AND currency_to='$currency_to' ORDER BY ABS( DATEDIFF( exchange_date, now() ) ) LIMIT 1

Related

How to INSERT a value based on the current date and a generated sequence number in MySQL?

I have this MySQL table:
CREATE TABLE bills
(
id_interess INT UNSIGNED NOT NULL,
id_bill VARCHAR(30) NULL,
PRIMARY KEY (id_interess)
) ENGINE=InnoDB;
And now I want to be able to manually insert unique integer for id_interess and automatically generate id_bill so that it consists of a current date and an integer (integer resets on a new year using trigger) like this:
id_interess |id_bill |
------------+-----------+
1 |20170912-1 |
2 |20171030-2 |
6 |20171125-3 |
10 |20171231-4 |
200 |20180101-1 |
3 |20180101-2 |
8 |20180102-3 |
If anyone has direct solution to this using only one query, I would be very glad! I only came up with a solution that uses three queries, but I still get some errors...
My newbie attempt: I created an additional column id_bill_tmp which holds integer part of id_bill like this:
CREATE TABLE bill
(
id_interess INT UNSIGNED NOT NULL,
id_bill_tmp INT UNSIGNED NULL,
id_bill VARCHAR(30) NULL,
PRIMARY KEY (id_interess)
) ENGINE=InnoDB;
Table from above would in this case look like this (note that on new year id_bill_tmp is reset to 1 and therefore I can't use AUTO_INCREMENT which can only be used on keys and keys need unique values in a column):
id_interess |id_bill_tmp |id_bill |
------------+--------------+-----------+
1 |1 |20170912-1 |
2 |2 |20171030-2 |
6 |3 |20171125-3 |
10 |4 |20171231-4 |
200 |1 |20180101-1 |
3 |2 |20180101-2 |
6 |3 |20180102-3 |
So for example to insert 1st row from the above table, table would have to be empty, and I would insert a value in three queries like this:
1st query:
INSERT INTO racuni (id_interess) VALUES (1);
I do this first because I don't know how to increment a nonexistent value for id_bill_tmp and this helped me to first get id_bill_tmp = NULL:
id_interess |id_bill_tmp |id_bill |
------------+--------------+-----------+
1 |[NULL] |[NULL] |
2nd query
Now I try to increment id_bill_tmp to become 1 - I tried two queries both fail saying:
table is specified twice both as a target for 'update' and as a separate source for data
This are the queries I tried:
UPDATE bills
SET id_bill_tmp = (SELECT IFNULL(id_bill_tmp, 0)+1 AS id_bill_tmp FROM bills)
WHERE id_interess = 1;
UPDATE bills
SET id_bill_tmp = (SELECT max(id_bill_tmp)+1 FROM bills)
WHERE id_interess = 1;
3rd query:
The final step would be to reuse id_bill_tmp as integer part of id_bill like this:
UPDATE bills
SET id_bill = concat(curdate()+0,'-',id_bill_tmp)
WHERE id_interess = 1;
so that I finally get
id_interess |id_bill_tmp |id_bill |
------------+--------------+-----------+
1 |1 |20170912-1 |
So if anyone can help me with the 2nd query or even present a solution with a single query or even without using column id_bill_tmp it would be wonderful.
Solution #1 - with the extra column
Demo
http://rextester.com/GOTPA70741
SQL
INSERT INTO bills (id_interess, id_bill_tmp, id_bill) VALUES (
1, -- (Change this value appropriately for each insert)
IF(LEFT((SELECT id_bill FROM
(SELECT MAX(CONCAT(LEFT(id_bill, 8),
LPAD(SUBSTR(id_bill, 10), 10, 0))) AS id_bill
FROM bills) b1), 4) = DATE_FORMAT(CURDATE(),'%Y'),
IFNULL(
(SELECT id_bill_tmp
FROM (SELECT id_bill_tmp
FROM bills
WHERE CONCAT(LEFT(id_bill, 8),
LPAD(SUBSTR(id_bill, 10), 10, 0)) =
(SELECT MAX(CONCAT(LEFT(id_bill, 8),
LPAD(SUBSTR(id_bill, 10), 10, 0)))
FROM bills)) b2),
0),
0)
+ 1,
CONCAT(DATE_FORMAT(CURDATE(),'%Y%m%d'), '-' , id_bill_tmp));
Notes
The query looks slightly more complicated that it actually is because of the issue that MySQL won't let you directly use a subselect from the same table that's being inserted into. This is circumvented using the method of wrapping another subselect around it as described here.
Solution #2 - without the extra column
Demo
http://rextester.com/IYES40010
SQL
INSERT INTO bills (id_interess, id_bill) VALUES (
1, -- (Change this value appropriately for each insert)
CONCAT(DATE_FORMAT(CURDATE(),'%Y%m%d'),
'-' ,
IF(LEFT((SELECT id_bill
FROM (SELECT MAX(CONCAT(LEFT(id_bill, 8),
LPAD(SUBSTR(id_bill, 10), 10, 0))) AS id_bill
FROM bills) b1), 4) = DATE_FORMAT(CURDATE(),'%Y'),
IFNULL(
(SELECT id_bill_tmp
FROM (SELECT SUBSTR(MAX(CONCAT(LEFT(id_bill, 8),
LPAD(SUBSTR(id_bill, 10), 10, 0))), 9)
AS id_bill_tmp
FROM bills) b2),
0),
0)
+ 1));
Notes
This is along the same lines as above but gets the numeric value that would have been in id_bill_tmp by extracting from the right part of id_bill from the 10th character position onwards via SUBSTR(id_bill, 10).
Step by step breakdown
CONCAT(...) assembles the string by concatenating its parts together.
DATE_FORMAT(CURDATE(),'%Y%m%d') formats the current date as yyyymmdd (e.g. 20170923).
The IF(..., <x>, <y>) is used to check whether the most recent date that is already present is for the current year: If it is then the numeric part should continue by incrementing the sequence, otherwise it is reset to 1.
LEFT(<date>, 4) gets the year from the most recent date - by extracting from the first 4 characters of id_bill.
SELECT MAX(...) AS id_bill FROM bills gets the most recent date + sequence number from id_bill and gives this an alias of id_bill. (See the notes above about why the subquery also needs to be given an alias (b1) and then wrapped in another SELECT). See the two steps below for how a string is constructed such that MAX can be used for the ordering.
CONCAT(LEFT(id_bill, 8), ...) is constructing a string that can be used for the above ordering by combining the date part with the sequence number padded with zeros. E.g. 201709230000000001.
LPAD(SUBSTR(id_bill, 10), 10, 0) pads the sequence number with zeros (e.g. 0000000001 so that MAX can be used for the ordering. (See the comment by Paul Spiegel to understand why this needs to be done - e.g. so that sequence number 10 is ordered just after 9 rather than just after 1).
DATE_FORMAT(CURDATE(),'%Y') formats the current date as a year (e.g. 2017) for the IF comparison mentioned in (3) above.
IFNULL(<x>, <y>) is used for the very first row since no existing row will be found so the result will be NULL. In this case the numeric part should begin at 1.
SELECT SUBSTR(MAX(...), 9) AS id_bill_tmp FROM bills selects the most recent date + sequence number from id_bill (as described above) and then extracts its sequence number, which is always from character position 9 onwards. Again, this subquery needs to be aliased (b2) and wrapped in another SELECT.
+ 1 increments the sequence number. (Note that this is always done since 0 is used in the cases described above where the sequence number should be set to 1).
If you are certain to be inserting in chronological order, then this will both bump the number and eliminate the need for the annual trigger:
DROP FUNCTION fcn46309431;
DELIMITER //
CREATE FUNCTION fcn46309431 (_max VARCHAR(22))
RETURNS VARCHAR(22)
DETERMINISTIC
SQL SECURITY INVOKER
BEGIN
RETURN
CONCAT(DATE_FORMAT(CURDATE(), "%Y%m%d"), '-',
IF( LEFT(_max, 4) = YEAR(CURDATE()),
SUBSTRING_INDEX(_max, '-', -1) + 1,
1 ) );
END
//
DELIMITER ;
INSERT INTO se46309431 (id_interess, id_bill)
SELECT 149, fcn46309431(MAX(id_bill)) FROM se46309431;
SELECT * FROM se46309431;
(If you might insert out of date order, then the MAX(..) can mess up.)
A similar solution is shown here: https://www.percona.com/blog/2008/04/02/stored-function-to-generate-sequences/
What you could do is to create a sequence with table, as shown there:
delimiter //
create function seq(seq_name char (20)) returns int
begin
update seq set val=last_insert_id(val+1) where name=seq_name;
return last_insert_id();
end
//
delimiter ;
CREATE TABLE `seq` (
`name` varchar(20) NOT NULL,
`val` int(10) unsigned NOT NULL,
PRIMARY KEY (`name`)
)
Then you need to populate the sequence values for each year, like so:
insert into seq values('2017',1);
insert into seq values('2018',1);
insert into seq values('2019',1);
...
(only need to do this once)
Finally, this should work:
insert into bills (id_interess, id_bill)
select
123,
concat(date_format(now(), '%Y%m%d-'), seq(date_format(now(), '%Y')));
Just replace 123 with some real/unique/dynamic id and you should be good to go.
I think you should redesign your approach to make life easier.
I would design your table as follows:
id_interess |id_counter |id_bill |
------------+--------------+-----------+
1 |1 |20170912 |
2 |2 |20171231 |
3 |1 |20180101 |
Your desired output for the first row would be "20170912-1", but you would merge id_counter and id_bill in your SQL-Query or in your application logic, not directly in a table (here is why).
Now you can write your SQL-Statements for that table.
Furthermore, I would advise not to store the counter in the table. You should only read the records' id and date from your database and calculate the id_counter in your application (or even in your SQL-Query).
You could also declare your column id_counter as auto_increment and reset it each time, see here.
One approach to do in single query would be just save the date in your table when ever you update any record. For id_bill no., generate a sequence when you want to display the records.
Schema
CREATE TABLE IF NOT EXISTS `bill` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
`bill_date` date NULL
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
Query
select a.id,concat(DATE_FORMAT(a.bill_date,"%Y%m%d"),'-',a.no) id_bill
from(
select b.*,count(b2.bill_date) no
from bill b
join bill b2 ON (EXTRACT(YEAR FROM b.bill_date) = EXTRACT(YEAR FROM b2.bill_date)
and b.bill_date >= b2.bill_date)
group by b.id
order by b.bill_date,no
) a
Inner query will return you the rank of each record per year by joining the same table outer query just format the data as per your desired view
DEMO
If for same date there can be more than 1 entries then in inner query the id column which is set to auto_increment can be used to handle this case
Updated Query
select a.id,concat(DATE_FORMAT(a.bill_date,"%Y%m%d"),'-',a.no) id_bill
from(
select b.*,count(b2.bill_date) no
from bill b
join bill b2 ON (EXTRACT(YEAR FROM b.bill_date) = EXTRACT(YEAR FROM b2.bill_date)
and b.id >= b2.id)
group by b.id
order by b.bill_date,no
) a
Updated Demo
The following solution requires generated (virtual) columns (available in MySQL 5.7 and MariaDB).
CREATE TABLE bills (
id_interess INT UNSIGNED NOT NULL,
bill_dt DATETIME DEFAULT CURRENT_TIMESTAMP,
bill_year YEAR AS (year(bill_dt)),
year_position INT UNSIGNED NULL,
id_bill VARCHAR(30) AS (concat(date_format(bill_dt, '%Y%m%d-'), year_position)),
PRIMARY KEY (id_interess),
INDEX (bill_year, year_position)
) ENGINE=InnoDB;
bill_year and id_bill are not stored in the table. They are derived from other columns. However - bill_year is stored in the index, which we need to get the last position for a specific year efficiently (it would also work without the index).
To insert a new row with the current timestamp:
insert into bills(id_interess, year_position)
select 1, coalesce(max(year_position), 0) + 1
from bills
where bill_year = year(now());
You can also use a custom timestamp or date:
insert into bills(id_interess, bill_dt, year_position)
select 10, '2016-01-01', coalesce(max(year_position), 0) + 1
from bills
where bill_year = year('2016-01-01')
Demo: https://www.db-fiddle.com/f/8pFKQb93LqNPNaD5UhzVwu/0
To get even simpler inserts, you can create a trigger which will calculate year_postion:
CREATE TRIGGER bills_after_insert BEFORE INSERT ON bills FOR EACH ROW
SET new.year_position = (
SELECT coalesce(max(year_position), 0) + 1
FROM bills
WHERE bill_year = year(coalesce(new.bill_dt, now()))
);
Now your insert statement would look like:
insert into bills(id_interess) values (1);
or
insert into bills(id_interess, bill_dt) values (11, '2016-02-02');
And the select statements:
select id_interess, id_bill
from bills
order by id_bill;
Demo: https://www.db-fiddle.com/f/55yqMh4E1tVxbpt9HXnBaS/0
Update
If you really, really need to keep your schema, you can try the following insert statement:
insert into bills(id_interess, id_bill)
select
#id_interess,
concat(
date_format(#date, '%Y%m%d-'),
coalesce(max(substr(id_bill, 10) + 1), 1)
)
from bills
where id_bill like concat(year(#date), '%');
Replace #id_interess and #date accordingly. For #date you can use CURDATE() but also any other date you want. There is no issue inserting dates out of order. You can even insert dates from 2016 when entries for 2017 already exist.
Demo: http://rextester.com/BXK47791
The LIKE condition in the WHERE clause can use an index on id_bill (if you define it), so the query only need to read the entries from the same year. But there is no way to determine the last counter value efficiently with this schema. The engine will need to read all rows for the cpecified year, extract the counter and search for the MAX value. Beside the complexity of the insert statement, this is one more reason to change the schema.

Left Join Sum and Comparison

I have a stock buying program for which there can exist multiple sell transactions per buy transaction. I am trying to create a query that will pull up any stocks for which I still have shares invested. For Example, if I buy 50 shares of a stock and sell 20 one day and 10 on another, I should still have 20 shares left over. I have done most of the hard work but I swear I am missing something small. My current query will not pull a result if it does not exist in the sell_transaction table. In my example, that is transaction_id 3 in buy_transactions which should return 100 shares but returns nothing. The following code can be put into SQL Fiddle and worked on.
Schema
CREATE TABLE `buy_transactions` (
`buy_transactions_id` int(11),
`buy_transaction_date` date,
`symbol` varchar(50),
`shares` int(11),
`price_per_shae` decimal(10,6));
insert into buy_transactions values (1,'2016-01-25','A',15,100.000000);
insert into buy_transactions values (2,'2014-03-16','A',20,30.000000);
insert into buy_transactions values (3,'2016-01-15','AA',100,60.000000);
insert into buy_transactions values (4,'2015-05-05','AA',500,60.000000);
CREATE TABLE `sell_transactions` (
`sell_transactions_id` int(11) NOT NULL AUTO_INCREMENT,
`sell_transaction_date` varchar(45) DEFAULT NULL,
`shares` int(11) DEFAULT NULL,
`price` decimal(10,6) DEFAULT NULL,
`related_buy_transaction` int(11) DEFAULT NULL,
PRIMARY KEY (`sell_transactions_id`));
insert into sell_transactions values (1, '2016-01-25', 5, 120.000000, 1);
insert into sell_transactions values (2, '2016-01-25', 10, 130.000000, 1);
insert into sell_transactions values (3, '2016-01-25', 10, 50.000000, 2);
insert into sell_transactions values (4, '2016-01-15', 500, 61.000000, 4);
Current Query
select bt.buy_transactions_id, bt.symbol, bt.shares - rt.SoldShares as remaining_stock
from buy_transactions bt
left join
(select related_buy_transaction, sum(shares) as SoldShares from sell_transactions group by related_buy_transaction) rt
on bt.buy_transactions_id = rt.related_buy_transaction
where bt.shares - rt.SoldShares > 0;
Current Query Results
buy_transactions_id symbol remaining_stock
2 A 10
SQLFiddle
Use coalesce:
select bt.buy_transactions_id
, bt.symbol
, bt.shares - coalesce(rt.SoldShares, 0) as remaining_stock
from buy_transactions bt
left join
( select related_buy_transaction
, sum(shares) as SoldShares
from sell_transactions
group by related_buy_transaction ) rt on bt.buy_transactions_id = rt.related_buy_transaction
where bt.shares > coalesce(rt.SoldShares, 0);
SQLFiddle

Efficiently SELECT a DB row marked as "latest version" and while matching a given interval

I have a table like so:
id min max version data
1 1 10 1 a
2 11 20 1 b
3 21 30 1 c
4 1 10 2 a
5 11 20 2 b
6 21 30 2 c
min, max represent values of key. Each (min, max) row within the given version is guaranteed to have mutually exclusive key intervals.
Suppose I have a key value of 5 which and I want the latest version of data for that key. This means, I want to select row with id = 4.
Normally I want to select the set with the latest version, but sometimes I may specify the version number explicitly.
What I have now is this:
select * from range_table where 5 between `min` and `max` and ver = 2;
Question: is there a way to select max version automatically (max ver), without specifying it explicitly? (By "efficiently" I mean without examining all tables rows.)
To Recreate Table
drop table range_table;
CREATE TABLE `range_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`min` int(11) NOT NULL,
`max` int(11) NOT NULL,
`ver` int(11) NOT NULL default 1,
`data` CHAR NOT NULL,
PRIMARY KEY (`id`),
unique key ver_min_max(ver, `min`, `max`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
insert into range_table (`min`,`max`, ver, data) values
(1, 10, 1, 'a'),
(11, 20, 1, 'b'),
(21, 30, 1, 'c'),
(1, 10, 2, 'a'),
(11, 20, 2, 'b'),
(21, 30, 2, 'd');
You could take the first row ordered by ver desc...
select * from range_table where 5 between `min` and `max` order by ver desc limit 1;
If you care about performance, then, depending on the size and/or selectivity of the columns, you can add an index to the min or max column. If the number of versions remain low for each min-max, then your query will be optimized.
Please try the following to select always the latest version
select * from range_table where #key between `min` and `max` and ver = (select max (a.ver) as max_ver from range_table as a where #key between a.`min` and a.`max`)
where #key would be a given key value.

Optimizing a MySQL query summing and averaging by multiple groups over a given date range

I'm currently working on a home-grown analytics system, currently using MySQL 5.6.10 on Windows Server 2008 (moving to Linux soon, and we're not dead set on MySQL, still exploring different options, including Hadoop).
We've just done a huge import, and what was a lightning-fast query for a small customer is now unbearably slow for a big one. I'm probably going to add an entirely new table to pre-calculate the results of this query, unless I can figure out how to make the query itself fast.
What the query does is take #StartDate and #EndDate as parameters, and calculates, for every day of that range, the date, the number of new reviews on that date, a running total of number of reviews (including any before #StartDate), and the daily average rating (if there is no information for a given day, the average rating will be carried over from the previous day).
Available filters are age, gender, product, company, and rating type. Every review has 1-N ratings, containing at the very least an "overall" rating, but possibly more per customer/product, such as "Quality", "Sound Quality", "Durability", "Value", etc...
The API that calls this injects these filters based on user selection. If no rating type is specified, it uses "AND ratingTypeId = 1" in place of the AND clause comment in all three parts of the query I'll be listing below. All ratings are integers between 1 and 5, though that doesn't really matter to this query.
Here are the tables I'm working with:
CREATE TABLE `times` (
`timeId` int(11) NOT NULL AUTO_INCREMENT,
`date` date NOT NULL,
`month` char(7) NOT NULL,
`quarter` char(7) NOT NULL,
`year` char(4) NOT NULL,
PRIMARY KEY (`timeId`),
UNIQUE KEY `date` (`date`)
) ENGINE=MyISAM
CREATE TABLE `reviewCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`totalReviews` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`)
) ENGINE=MyISAM
CREATE TABLE `ratingCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`ratingTypeId` int(11) NOT NULL,
`negativeRatings` int(10) unsigned NOT NULL DEFAULT '0',
`positiveRatings` int(10) unsigned NOT NULL DEFAULT '0',
`neutralRatings` int(10) unsigned NOT NULL DEFAULT '0',
`totalRatings` int(10) unsigned NOT NULL DEFAULT '0',
`ratingsSum` double unsigned DEFAULT '0',
`totalRecommendations` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`,`ratingTypeId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`),
KEY `ratingTypeId_fk` (`ratingTypeId`)
) ENGINE=MyISAM
The 'times' table is pre-filled with every day from 1900-01-01 to 2049-12-31, and the two count tables are populated by an ETL script with a roll-up query grouped by company, product, age, gender, ratingType, etc...
What I'm expecting back from the query is something like this:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
2013-01-25 5505 16091 4.058400718778077
2013-01-27 2043 18134 3.992957746478873
2013-01-28 3280 21414 3.983625730994152
2013-01-29 4648 26062 3.921597633136095
...
2013-03-09 1608 60297 3.9409722222222223
2013-03-10 470 60767 3.7743682310469313
2013-03-11 1028 61795 4.036697247706422
2013-03-13 494 62289 3.857388316151203
2013-03-14 449 62738 3.8282208588957056
I'm pretty sure I could pre-calculate everything grouped by age, gender, etc..., except for the average, but I may be wrong on that. If I had three reviews for two products on one day, with all other groups different, and one had a rating of 2 and 5, and the other a 4, the first would have a daily average of 3.5, and the second 4. Averaging those averages would give me 3.75, when I'd expect to get 3.66667. Maybe I could do something like multiplying the average for that grouping by the number of reviews to get the total rating sum for the day, sum those up, then divide them by total ratings count at the end. Seems like a lot of extra work, but it may be faster than what I'm currently doing. Speaking of which, here's my current query:
SET #cumulativeCount :=
(SELECT coalesce(sum(rc.totalReviews), 0)
FROM reviewCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
);
SET #dailyAverageWithCarry :=
(SELECT SUM(rc.ratingsSum) / SUM(rc.totalRatings)
FROM ratingCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
AND rc.totalRatings > 0
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.date DESC LIMIT 1
);
SELECT
subquery.d AS `Date`,
subquery.newReviewsCount AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(subquery.dailyRatingAverage, #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
(
SELECT
dt.date AS d,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount,
SUM(rac.ratingsSum) / SUM(rac.totalRatings) AS dailyRatingAverage
FROM times dt
LEFT JOIN reviewCount rc ON dt.timeId = rc.createdOnTimeId
LEFT JOIN ratingCount rac ON dt.timeId = rac.createdOnTimeId
WHERE dt.date BETWEEN #StartDate AND #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.timeId
) AS subquery;
The query currently takes ~2 minutes to run, with the following row counts:
times 54787
reviewCount 276389
ratingCount 473683
age 122
gender 3
ratingType 28
product 70070
Any help would be greatly appreciated. I'd either like to make this query much faster, or if it would be faster to do so, to pre-calculate the values grouped by date, age, gender, product, company, and ratingType, then do a quick roll-up query on that table.
UPDATE #1: I tried Meherzad's suggestions of adding indexes to times and ratingCount with:
ALTER TABLE times ADD KEY `timeId_date_key` (`timeId`, `date`);
ALTER TABLE ratingCount ADD KEY `createdOnTimeId_totalRatings_key` (`createdOnTimeId`, `totalRatings`);
Then ran my initial query again, and it was about 1s faster (~89s), but still too slow. I tried Meherzad's suggested query, and had to kill it after a few minutes.
As requested, here is the EXPLAIN results from my query:
id|select_type|table|type|possible_keys|key|key_len|ref|rows|Extra
1|PRIMARY|<derived2>|ALL|NULL|NULL|NULL|NULL|6808032|NULL
2|DERIVED|dt|range|PRIMARY,timeId_date_key,date|date|3|NULL|88|Using index condition; Using temporary; Using filesort
2|DERIVED|rc|ref|PRIMARY,companyId_fk,createdOnTimeId|createdOnTimeId|4|dt.timeId|126|Using where
2|DERIVED|rac|ref|createdOnTimeId,createdOnTimeId_total_ratings_key|createdOnTimeId|4|dt.timeId|614|NULL
I checked the cache read miss rate as mentioned in the article on buffer sizes, and it was
Key_reads 58303
Key_read_requests 147411279
For a miss rate of 3.9551247635535405672723319902814e-4
UPDATE #2: Solved! The indices definitely helped, so I'll give credit for the answer to Meherzad. What actually made the most difference was realizing that calculating the rolling average and daily/cumulative review counts in the same query was joining those two huge tables together. I saw that the variable initialization was done in two separate queries, and decided to try separating the two big queries into subqueries and then joining them based on the timeId. Now it runs in 0.358s with the following query:
SET #StartDate = '2013-01-24';
SET #EndDate = '2013-04-24';
SELECT
#StartDateId:=MIN(timeId), #EndDateId:=MAX(timeId)
FROM
times
WHERE
date IN (#StartDate , #EndDate);
SELECT
#CumulativeCount:=COALESCE(SUM(totalReviews), 0)
FROM
reviewCount
WHERE
createdOnTimeId < #StartDateId
-- Add Filters
;
SELECT
#DailyAverage:=COALESCE(SUM(ratingsSum) / SUM(totalRatings), 0)
FROM
ratingCount
WHERE
createdOnTimeId < #StartDateId
AND totalRatings > 0
-- Add Filters
GROUP BY createdOnTimeId
ORDER BY createdOnTimeId DESC
LIMIT 1;
SELECT
t.date AS `Date`,
COALESCE(q1.newReviewsCount, 0) AS `NewReviews`,
(#CumulativeCount:=#CumulativeCount + COALESCE(q1.newReviewsCount, 0)) AS `CumulativeReviewsCount`,
(#DailyAverage:=COALESCE(q2.dailyRatingAverage,
COALESCE(#DailyAverage, 0))) AS `DailyRatingAverage`
FROM
times t
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount
FROM
reviewCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q1 ON t.timeId = q1.createdOnTimeId
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
SUM(rc.ratingsSum) / SUM(rc.totalRatings) AS dailyRatingAverage
FROM
ratingCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q2 ON t.timeId = q2.createdOnTimeId
WHERE
t.timeId BETWEEN #StartDateId AND #EndDateId;
I had assumed that two subqueries would be incredibly slow, but they were insanely fast because they weren't joining completely unrelated rows. It also pointed out the fact that my earlier results were way off. For example, from above:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
Should have been, and now is:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
The average was correct, but the join was screwing up the number of both new and cumulative reviews, which I verified with a single query.
I also got rid of the joins to the times table, instead determining the start and end date IDs in a quick initialization query, then just rejoined to the times table at the end.
Now the results are:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
2013-01-25 551 407878 4.058400718778077
2013-01-26 455 408333 3.838926174496644
2013-01-27 433 408766 3.992957746478873
2013-01-28 425 409191 3.983625730994152
...
2013-04-13 170 426066 3.874239350912779
2013-04-14 182 426248 3.585714285714286
2013-04-15 171 426419 3.6202531645569622
2013-04-16 0 426419 3.6202531645569622
2013-04-17 0 426419 3.6202531645569622
2013-04-18 0 426419 3.6202531645569622
2013-04-19 0 426419 3.6202531645569622
2013-04-20 0 426419 3.6202531645569622
2013-04-21 0 426419 3.6202531645569622
2013-04-22 0 426419 3.6202531645569622
2013-04-23 0 426419 3.6202531645569622
2013-04-24 0 426419 3.6202531645569622
The last few averages properly carry the earlier ones, too, since we haven't imported from that customer's data feed in about 10 days.
Thanks for the help!
Try this query
You don't have necessary indexes to optimize your query
Table times add compound index on (timeId, dateId)
Table ratingCount add compound index on (createdOnTimeId, totalRatings)
As you have already mentioned that you are using various other AND filters according to the user input so create a compound index for those columns in the order which you are adding for their respective table Ex Table ratingCount compound index (createdOnTimeId, totalRatings, ratingType, age, gender, product, and company). NOTE This index will be useful only if you add these constraints in the query.
I'd also check to make sure your buffer pool is large enough to hold your indexes. You don't want indexes to be paging in and out of the buffer pool during a query.
Check your buffer pool size
BUFFER_SIZE
If you don't find any improvement in performance please post explain statement for your query also, it will help in understanding problem properly.
I have tried to understand your query and made a new one check whether it works or not.
SELECT
*
FROM
(SELECT
dt.timeId
dt.date,
COALESCE(SUM(rc.totalReviews), 0) AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(SUM(rac.ratingsSum) / SUM(rac.totalRatings), #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
times dt
LEFT JOIN
reviewCount rc
ON
dt.timeId = rc.createdOnTimeId
LEFT JOIN
ratingCount rac ON dt.timeId = rac.createdOnTimeId
JOIN
(SELECT #cumulativeCount:=0, #dailyAverageWithCarry:=0) tmp
WHERE
dt.date < #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY
dt.timeId
ORDER BY
dt.timeId
) AS subquery
WHERE
subquery.date>#StartDate;
Hope this helps....

selecting row with many child rows

I am hoping I can make myself understood enough! I have the following SQL query
SELECT DATE_FORMAT(calendar_date,'%W %D, %M, %Y') AS calendar_date,calendar_entry_title,calendar_entry_teaser
FROM calendar_month
LEFT JOIN calendar_entry ON calendar_entry.calendar_id = calendar_month.calendar_id
ORDER BY calendar_date
Here are the table detail I am dealing with.
CREATE TABLE IF NOT EXISTS `calendar_entry` (
`calendar_entry_id` int(11) NOT NULL AUTO_INCREMENT,
`calendar_id` int(11) NOT NULL,
`school_id` int(11) NOT NULL,
`calendar_entry_title` varchar(250) NOT NULL,
`calendar_entry_teaser` varchar(250) NOT NULL,
`calendar_entry_text` text NOT NULL,
PRIMARY KEY (`calendar_entry_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
--
-- Dumping data for table `calendar_entry`
--
INSERT INTO `calendar_entry` (`calendar_entry_id`, `calendar_id`, `school_id`, `calendar_entry_title`, `calendar_entry_teaser`, `calendar_entry_text`) VALUES
(1, 1, 1, 'School Event 1', 'School event information 1', 'This would be the full body of the text that would show on the full page for this given entry'),
(2, 1, 1, 'School Event 2', 'School event information 2', 'This would be the full body of the text that would show on the full page for this given entry');
CREATE TABLE IF NOT EXISTS `calendar_month` (
`calendar_id` int(11) NOT NULL AUTO_INCREMENT,
`school_id` int(11) NOT NULL,
`calendar_date` date NOT NULL,
PRIMARY KEY (`calendar_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
--
-- Dumping data for table `calendar_month`
--
INSERT INTO `calendar_month` (`calendar_id`, `school_id`, `calendar_date`) VALUES
(1, 1, '2012-08-11'),
(2, 1, '2012-08-12');
The problem I have is, there are only 2 rows in the calendar_month table. One of these rows has 2 rows related to it in the month_entry table. When I run the query that I have it will display 3 rows. what I need it to do is only show 2 rows, the month that has two rows I need to be displayed as one row. Can this be done with how I have set it up?
Thanks
result -
Saturday 11th, August, 2012 School Event 1 School event information 1
Saturday 11th, August, 2012 School Event 2 School event information 2
Sunday 12th, August, 2012 NULL NULL
What I actually want -
Saturday 11th, August, 2012 School Event 1 School event information 1 School Event 2 School event information 2
Sunday 12th, August, 2012 NULL NULL
did you try
SELECT DATE_FORMAT(calendar_date,'%W %D, %M, %Y') AS calendar_date, TMP.var1
FROM calendar_month
LEFT JOIN
(SELECT GROUP_CONCAT(calendar_entry_title, ' ',calendar_entry_teaser) AS var1, calendar_id
FROM calendar_entry
GROUP BY calendar_id) AS TMP ON TMP.calendar_id = calendar_month.calendar_id
ORDER BY calendar_date
You can't get that kind of result in mysql only, or using 2 subquery for each row, which will sooner or later crash your server. Rather, use php to sort through the result and store in which calendar_entry_title you are.
example:
$query =" SELECT DATE_FORMAT(calendar_date,'%W %D, %M, %Y') AS calendar_date,
calendar_entry_title,
calendar_entry_teaser
FROM calendar_month
LEFT JOIN calendar_entry ON calendar_entry.calendar_id = calendar_month.calendar_id
ORDER BY calendar_date, calendar_entry_title";
$result = mysql_query($query);
$events = array();
foreach($result as $r){
$events[$r['calendar_date']][] = $r;
}
echo '<pre>';
print_r($events);
echo '</pre>';