creating a series of time periods as rows - mysql

I want to write a query that, for any given start date in the past, has as each row a week-long date interval up to the present.
For instance, given the start date of Nov 13th 2010, and the present date of 12-16-2010, I want a result set like
+------------+------------+
| Start | End |
+------------+------------+
| 2010-11-15 | 2010-11-21 |
+------------+------------+
| 2010-11-22 | 2010-11-28 |
+------------+------------+
| 2010-11-29 | 2010-12-05 |
+------------+------------+
| 2010-12-06 | 2010-12-12 |
+------------+------------+
It doesn't go past 12 because the week-long period that the present date occurs in isn't complete.
I can't get a foothold on how I would even start to write this query.
Can I do this in a single query? Or should I use code for looping, and do multiple queries?

It's quite difficult (but not impossible) to create such a result set dynamically in MySQL as it doesn't yet support any of recursive CTEs, CONNECT BY, or generate_series that I would use to do this in other databases.
Here's an alternative approach you can use.
Create and prepopulate a table containing all the possible rows from some date far in the past to some date far in the future. Then you can easily generate the result you need by querying this table with a WHERE clause, using an index to make the query efficient.
The drawbacks of this approach are quite obvious:
It takes up storage space unnecessarily.
If you query outside of the range that you populated your table with you won't get any results, which means that you will either have to populate the table with enough dates to last the lifetime of your application or else you need a script to add more dates every so often.
See also this related question:
How do I make a row generator in MySQL

Beware this is just a concept idea: I do not have a mysql installation right here, so that I cannot test it.
However I would base myself on a table containing the integers, in order to emulate a series.
Something like :
CREATE TABLE integers_table
(
id integer primary key
);
Followed by (warning, this is pseudo code)
INSERT INTO integers_table(0…32767);
(that should be enough weeks for the rest of our lives :-)
Then
FirstMondayInUnixTimeStamp_Utc= 3600 * 24 * 4
SecondPerday=3600 * 24
(since 1 jan 1970 was a thursday. Beware I did not cross check! I might be off a few hours!)
And then
CREATE VIEW weeks
AS
SELECT integers_table.id AS week_id,
FROM_UNIXTIME(FirstMondayInUnixTimeStamp_Utc + week_id * SecondPerDay * 7) as week_start
FROM_UNIXTIME(FirstMondayInUnixTimeStamp_Utc + week_id * SecondPerDay * 7 + SecondPerDay * 6) as week_end;

Related

Sort Array results based on variable Algorithm - MYSQL & PHP

I have an array nested within a PHP while loop that outputs a set of forum posts a number of times. I want to sort the array results based on an algorithm - however I do not want to hardcode the algorithm so I can test different variables at a later date. NB - I'm not looking to sort the items within the array, but rather the final output which when looped will output the array 20+ times.
Currently I have 2 Tables - the Forum table with loads of rows (3000 +):
id | name | date_add | votes | ... |
1 | Test Name | 1234567890 | 2 | ... |
... | ... | ... | ... | ... |
The other table contains the Algorithm variables that I want to pass through to the calculation and has only 1 row:
id | vote_reduction | time_variable | gravity |
1 | 1 | 2 | 1.8 |
The specific algorithm I'm using sorts the information based on how log it has been live (in hours), how many votes it has and the gravity factor makes it more sensitive to time. In full:
(votes - vote_reduction)/((Hours Live + time_variable) ^ gravity)
So far I've managed to get this far, and something is going wrong but I can't quite figure it out:
SELECT forum.*,
((forum.votes - algorithm.vote_reduction)/POW(((TIMESTAMPDIFF(HOUR, SYSDATE(), forum.date_add)) + algorithm.time_variable),algorithm.gravity)) AS algorithm.al,
forum.name, forum.id
FROM forum as forum
LEFT JOIN algorithm AS algorithm ON (algorithm.id='1')
ORDER BY algorithm.al
Any ideas?
I haven't tested the results of the algorithm, but the query returns a result for al if you just remove algorithm. from algorithm.al. I don't think you can make a column alias that acts like it's part of a table. What's confusing me is that you say that it's running on your machine. It's not running on SQL Fiddle and is throwing an error.
SELECT forum.*,
((forum.votes - algorithm.vote_reduction)/POW(((TIMESTAMPDIFF(HOUR, SYSDATE(), forum.date_add)) + algorithm.time_variable),algorithm.gravity)) AS al
FROM forum AS forum
LEFT JOIN algorithm AS algorithm ON (algorithm.id='1')
ORDER BY al
Link to SQL fiddle
There are a few errors in the code as follows:
Making an alias with the name "algorithm" clashes with a MySQL
clause also called ALGORITHM
The calculation (at least the way it is
edited above) creates too many values in the POW clause
Encapsulating all declared aliases in ' ' makes the code more full
proof - but the ORDER BY clause doesn't like quotation marks (so remove them there)
The SYSDATE() and forum.date_add fields are in different formats -
the latter being a timestamp
To fix:
SELECT forum.*, TIMESTAMPDIFF(HOUR, from_unixtime(bd.date_add), NOW()) as 'timedif'
((forum.votes - alg.vote_reduction)/POW(('timedif' + alg.time_variable),alg.gravity)) AS 'al'
FROM forum AS forum
LEFT JOIN algorithm AS 'alg' ON (alg.id='1')
ORDER BY al

Aggregate data tables

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

Selecting missing rows and grouping by date (with read-only access to DB)

I have a dataset from which I need to count all row occurrences grouping by each day and sum them into a dataset of following format:
| date | count |
| 2001-01-01 | 11 |
| 2001-01-02 | 0 |
| 2001-01-03 | 4 |
The problem is, that some of the data is missing from certain periods of time and new dates should be created to have the count of zero. I have searched various topics considering this same issue and from them I've learned that it's possible to solve by creating a temporary calendar table to hold all the dates and join the result dataset with the date table.
Though, I have only a read access to the database I'm using, so it's not possible for me to create a separate calendar table. So could this be possible to solve in a single query only? If not, I could always do this in PHP but I would prefer a more straighforward way to do this.
EDIT: Just to clarify based on the questions asked in the comments: The missing dates are required for a spesific, user given time frame. E.g. the query could be:
SELECT date(timestamp), count(distinct(id))
FROM 'table'
WHERE date(timestamp) BETWEEN date("2001-01-01") AND date("2001-12-31")
GROUP BY date(timestamp)
SQL is really not made for this kind of job :/
That's possible but really really messy and I strongly discourage you from doing it.
The easiest way was to have a separate calendar table but as you said you only have a read access to your database.
The other one is to generate the sequence using this kind of trick:
SELECT #rownum:=#rownum+1 rownum, t.*FROM (SELECT #rownum:=0) r, ("yourquery") t;
I won't get into it, as I already told you, it's really ugly :(
try this...
SELECT Date, COUNT(*) Count
FROM yourtable
GROUP BY Date
This works for sure!!!
Let me know, if it helped!

mysql query - optimizing existing MAX-MIN query for a huge table

I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.

Retrieve missing dates from database via MySQL

So I have a table where I collect data for the jobs that I do. Each time I create a job I assign it a date. The problem with this is the days I don't have jobs aren't stored in the database therefore when I graph my data I never see the days that I had zero jobs.
My current query looks like this:
SELECT job_data_date, SUM(job_data_invoice_amount) as job_data_date_income
FROM job_data
WHERE job_data_date >= '2010-05-05'
GROUP BY job_data_date
ORDER BY job_data_date;
The output is:
| job_data_date | job_data_date_income |
| 2010-05-17 | 125 |
| 2010-05-18 | 190 |
| 2010-05-20 | 170 |
As you can see from the example output the 2010-05-19 would not show up in the results because it was never stored there.
Is there a way to show the dates that are missing?
Thank you,
Marat
One idea is that you could have a table with all of the dates in it that you want to show and then do an outer join with that table.
So if you had a table called alldates with one column (job_data_date):
SELECT ad.job_data_date, SUM(job_data_invoice_amount) as job_data_date_income
FROM alldates ad left outer join job_data jd on ad.job_data_date = jd.job_data_date
WHERE ad.job_data_date >= '2010-05-05'
GROUP BY ad.job_data_date
ORDER BY ad.job_data_date;
The down side is that you would need to keep this table populated with all of the dates you want to show.
There's no reasonable way to do this using pure SQL, on MySQL at least, without creating a table with every date ever devised. Your best option is to alter the application that's using the results of that query to fill in the holes itself. Rather than graphing only the values it received, construct its own set of values with a simple loop; counting up one day at a time, filling in values from the query wherever they're available.