Large mysql query with group by - mysql

I have a pricing history table with half a billion records. It is formatted like this:
Id, sku, vendor, price, datetime
What I want to do is get average price of all products by vendor for a certain date range. Most products are updated once every 3 days, but it varies.
So, this is the query I want to run:
SELECT
avg(price)
FROM table
WHERE
vendor='acme'
AND datetime > '12-15-2014'
AND datetime < '12-18-2014'
GROUP BY sku
This 3 day range is broad enough that i will for sure get at least one price sample, but some skus may have been sampled more than once, hence group by to try and get only one instance of each sku.
The problem is, this query runs and runs and doesn't seem to finish (more than 15 minutes). There are around 500k unique skus.
Any ideas?
edit: corrected asin to sku

For this query to be optimized by mysql you need to create a composite index
(vendor, datetime, asin)
IN THIS PARTICULAR ORDER (it mattters)
It also worth trying creating another one
(vendor, datetime, asin, price)
since it may perform better (since it's a so called "covering index").
The indexes with other order, like (datetime, vendor) (which is suggested in another answer) are useless since the datetime is used in a range comparison.
Few notes:
The index will be helpful if only the vendor='acme' AND datetime > '12-15-2014' AND datetime < '12-18-2014' filter condition covers a small part of the whole table (say less than 10%)
Mysql does not support dd-mm-yyyy literals (at least it's not documented, see references) so I assume it must be yyyy-mm-dd instead
Your comparison does not cover the first second of the December 15th, 2014. So you probably wanted datetime >= '2014-12-15' instead.
References:
http://dev.mysql.com/doc/refman/5.6/en/range-optimization.html
http://dev.mysql.com/doc/refman/5.6/en/date-and-time-literals.html

You need an index to support your query. Suggest you create an index on vendor and datetime like so:
CREATE INDEX pricing_history_date_vendor ON pricing_history (datetime, vendor);
Also, I assume you wanted to group by sku rather than undefined column asin.
Not to mention your non-standard SQL date format MM-dd-yyyy as pointed out by others in comments (should be yyyy-MM-dd).

Related

MySQL - Group By date/time functions on a large table

I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:

Why does MySQL drops my index when using DATE(`table`.`column`)

I have a MySQL innodb table with a few columns.
one of them is named "dateCreated" which is a DATETIME column and it is indexed.
My query:
SELECT
*
FROM
`table1`
WHERE
DATE(`dateCreated`) BETWEEN '2014-8-7' AND '2013-8-7'
MySQL for some reason refuses to use the index on the dateCreated column (even with USE INDEX or FORCE INDEX.
However, if I change the query to this:
SELECT
*
FROM
`table1`
WHERE
`dateCreated` BETWEEN '2014-8-7' AND '2013-8-7'
note the DATE(...) removal
MySQL uses the index just fine.
I could manage without using the DATE() function, but this is just weird to me.
I understand that maybe MySQL indexes the full date and time and when searching only a part of it, it gets confused or something. But there must be a way to use a partial date (lets say MONTH(...) or DATE(...)) and still benefit from the indexed column and avoid the full table scan.
Any thoughts..?
Thanks.
As you have observed once you apply a function to that field you destroy access to the index. So,
It will help if you don't use between. The rationale for applying the function to the data is so you can get the data to match the parameters. There are just 2 parameter dates and several hundred? thousand? million? rows of data. Why not reverse this, change the parameters to suit the data? (making it a "sargable" predicate)
SELECT
*
FROM
`table1`
WHERE
( `dateCreated` >= '2013-08-07' AND `dateCreated` < '2014-08-07' )
;
Note 2013-08-07 is used first, and this needs to be true if using between also. You will not get any results using between if the first date is younger than the second date.
Also note that exactly 12 months of data is contained >= '2013-08-07' AND < '2014-08-07', I presume this is what you are seeking.
Using the combination of date(dateCreated) and between would include 1 too many days as all events during '2014-08-07' would be included. If you deliberately wanted one year and 1 day then add 1 day to the higher date i.e. so it would be < '2014-08-08'

SmallInt vs Date in MySQL - Performance, Flexibility and Size

I am creating a table that stores data, actually counter, for products for each week.
Example:
id = 1
productId = 195
DateTime = 01/07/2012
Counter = 0
My question to you is about database storage space, query flexibility and performance.
Instead of the DateTime column, I thought about using a SmallInt 'WeekNumber' column.
I will decide on the Date that the weeks start (base date). Let's say 10/10/2012.
For each product and for each week, there will be a row that represents the total of something that I count on a daily basis (ie. Pageviews for a specific product page).
From What I'v eread:
Date column is 4 bytes
SmallInt is 2 bytes
I want to save as much space as possible, but I want to be able to query the database base on range of dates (august 2012 to September 2013), specific week in a specific year, etc.
Is this approach to the schema is good, or I will find myself having problem with poor SQL performance, Query flexibility, indexes, etc.
Consider the sacrifice and complication you're going to make in order to save 2 bytes one byte....
In order to use the smallint you're going to pass every call to the data through a function to get its "week number" starting from your own arbitrary date.... This is neither more performant nor more clear.
The query is, likewise, not as flexible because each one will need to compare based on your magic "starting date" rather than just a date compare/group. Your queries will likely not be SARGable and will probably be slower
EDIT: From your comments you have a hard limit of 50GB.... that's a lot of space for an aggregation DB like you're discussing. You're inviting undue stress and loss of sustainability by complicating this.
According to MySQL, the DATE type is only 3 bytes compared to the 2 bytes for the SMALLINT
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
You are, therefore, going to save one byte per row (you say 2000 per week)... so let's say 2KB per week, 104 KB per year.....
If this table has no child tables (no foreign keys referencing it), to conserve space, you might consider omitting the surrogate primary key (id), and instead use a composite key (productId,date_) as the primary key. (From what you describe, it sounds as you are going to want to have the combination of those columns as UNIQUE, and both of those columns as NOT NULL.
If what you want to store is a "week" identifier rather than a DATE, there's no problem on the database side of things, as long as your queries aren't wrapping that column in an expression to get a DATE values to use in predicates. That is, for performance, your predicates are going to need to be on the bare "week identifier" column, e.g.
WHERE t.product_id = 195 AND t.week_id >= 27 AND t.week_id < 40
Predicates like that on the bare column will be sargable (that is, allow for an index to be used.) You do NOT want to be wrapping that week_id column in an expression to return a DATE, and use WHERE clause on that expression. (Having expressions on the literal side of the comparison is not a problem... you just don't want them on the "table" side.
That's really going to be the determining factor of whether you can use a week_id in place of a DATE column.
Using a "period id" in place of a DATE is fairly straightforward to implement for periods that are whole months. (It's also straightforward for "days", but is really of less benefit there.) Implementing this approach for "week" periods is more complicated, because of the handling you need for a week that is split between two years.
Consider, for example, that the last two days of this year (2012) are on Sunday and Monday, but Tuesday thru Saturday of that same week are in 2013. You'd need to decide whether that's two separate weeks, or whether that's the same week.
But the 1-byte savings (of SMALLINT vs DATE) isn't the real benefit. What the "week_id" column gets you (as I see it) is that you have a single id value that identifies a week. Consider the date values of '2012-07-30', '2012-07-31', '2012-08-01' they all really represent the same week. So you have multiple values for the week, such that a UNIQUE constraint on (product_id,date) doesn't really GUARANTEE (on the database side) that you don't have more than row for the same week. (That's not an insurmountable problem of course, you can specify that you only store a Sunday (or Monday) date value.)
In summary,
To conserve space, I would first drop that surrogate id column, and make the combination of the product_id and the DATE be the primary key.
Then I would ONLY consider changing that DATE into a SMALLINT, if I could GUARANTEE that all queries would be referencing that bare SMALLINT column, and NOT referencing an expression that converts the SMALLINT column back into DATE.

storing dates in mysql

Is it better to store dates in mysql in three columns or use just one column. Which one is faster. Also, if I just want to do inserts with todays date in format dd/mm/yy , how do I do that. and also how do I do selects with that. Also, lets say if I wanted to get results for all the wednesdays, how do I do that or lets say one date 25th of all the months and years, how do i do that.
Thanks People.
I am using PHP with Apache and Mysql.
What are the drawbacks of using the structure that I am proposing. I can easily get all the 25th by using the date table and I can get all the days using another column for days. How much difference would be there in the terms of speed between my proposed solution and using a DATE table?
You will want to use a proper column type, such as DATE, DATETIME, or TIMESTAMP, depending on your needs. They are built specifically to handle dates, and can more easily perform other functions (adding, comparing, etc.) that would be difficult to perform on 3 separate columns.
Read this for more info.
DAYOFWEEK(date) will give you a numeric representation for the day. In your case, 4 = Wednesday. DAYOFMONTH(date) will work for finding all 25th days of the month.
DAYNAME(date) will return the name of the weekday for date
Also, if I just want to do inserts with todays date in format dd/mm/yy ,how do I do that.
Well it depends on the format your date is passed in through your
form but you are going to want to store your date in YYYY-mm-dd format.
INSERT INTO my_table (timefieldname) VALUES ( '$date' );
and also how do I do selects with that.
SELECT timefieldname FROM my_table;
//or you can format the date - this will give you month/day/year 01/01/2012
SELECT DATE_FORMAT(timefieldname, '%m/%d/%Y') FROM my_table;
Also, lets say if I wanted to get results for all the wednesdays,
SELECT timefieldname FROM my_table WHERE DAYNAME(timefieldname) = 'Wednesday';
How do I do that or lets say one date 25th of all the months and years, how do i do that.
SELECT timefieldname FROM my_table WHERE DAY(timefieldname) = '25';
You can free up having to pass dates from your codebase and let mysql insert them for you, provided they are time stamps:
ALTER TABLE tablename ADD `timefieldname` TIMESTAMP ON UPDATE CURRENT_TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ;
It's not much of a speed boost, but it does reduce your need to code and validate variables passed to the database.

MySql Query Grouping by Time

I am trying to create a report to understand the time-of-day that orders are being placed, so I need to sum and group them by time. For example, I would like a sum of all orders placed between 1 and 1:59, then the next row listing the sum of all orders between 2:00 and 2:59, etc. The field is a datetime variable, but for the life me I haven't been able to find the right query to do this. Any suggestions sending me down the right path would be greatly appreciated.
Thanks
If by luck it is mysql and by sum of orders you mean the number of orders and not the value amount:
select date_format(date_field, '%Y-%m-%d %H') as the_hour, count(*)
from my_table
group by the_hour
order by the_hour
This king of grouping (using a calculated field) will certainly not scale over time. If you really need to execute this specific GROUP BY/ORDER BY frequently, you should create an extra field (an UNSIGNED TINYINT field will suffice) storing the hour and place an INDEX on that column.
That is of course if your table is becoming quite big, if it is small (which cannot be stated in mere number of records because it is actually a matter of server configuration and capabilities as well) you won't probably notice much difference in performance.