Mysql and Perl using INTERVAL to calculate a total - mysql

Thank you for an example. I'm a trouble to figure out a statement with interval...
I need calculate users in certain interval of dates, but interval must be entered by a user. Example: from '2010/05/05' to '2010/30/07' if interval is 1m(month), then total of this users each 1m interval, like this: 2010/05/05 to 2010/06/05 is total users.
So far I got:
SELECT col1, client, COUNT(client) FROM table1, table2 WHERE col1 IN (condition) AND date BETWEEN '2010/05/01' AND '2010/07/30' AND DATE_ADD(CURDATE(),INTERVAL + 1 month) GROUP BY client;
Of course it calculates all total, but not dates with interval.
Also I tried to use Perl
my #data; #data from dbase.
%date_hash = ($data[1] =>$total); #$data[1] is beg and end dates user entered
foreach $dates (values %date_hash) {
$date_hash{$dates}=$total;
print "Print hash: $dates $date_hash{$dates} \n"
Thank you in advance, :)

One possible SQL only solution (I will offer an algorithm, not exact SQL code)
Create a temp table INTERVALS, populate it (probably easiest to do from Perl in a loop, though MYSQL loop would be sufficient) with data as follows (from 5/5/2010 to 12/10/2010, 1 month intervals):
| start_period | end_period | interval_number |
===============================================
| 2010/05/05 | 2010/06/04 | 1 |
| 2010/06/05 | 2010/07/04 | 2 |
| ...
| 2010/12/05 | 2010/12/10 | 8 |
Then, run the query joining your table to INTERVALS temp table via
SELECT client, COUNT(client)
, i.interval_number, i.start_period, i.end_period
FROM table1
WHERE col1 IN (condition)
AND table1.date >= inetrvals.start_period
AND table1.date <= inetrvals.end_period
GROUP BY client, i.interval_number, i.start_period, i.end_period
Please note that you can select other columns from table1 but only if you group on them as well.

Related

Is it possible in MySQL to find the Min/Max but remove outliers first?

I have a table that holds scan datetime values. I am wanting to find the start and stop scan time of the users from the main portion of scanning. The issue is that a user may perform some checks before or after the bulk of the scanning and generate a few more scans. The data might look as below.
....
| 2020-04-01 19:48:05 |
| 2020-04-01 19:48:22 |
| 2020-04-01 19:48:23 |
| 2020-04-01 19:48:48 |
| 2020-04-01 19:48:49 |
| 2020-04-01 20:45:33 |
+---------------------+
If I group by the date and grab the min/max of these values my time elapsed will be much large than the actual. In the case above the max would add almost 1 hour of extra time, which was not really spent scanning.
SELECT date, MIN(datetime), MAX(datetime) FROM table GROUP BY date
There might be 1 extra scan or there might be several scans at the beginning or the end of the data so throwing out the first and last data points is not really an option.
Hmmm . . . I think this is a gap and islands problem. You need some definition of when an outlier occurs. Say it is 5 minutes:
select min(datetime), max(datetime), count(*) as num_scans
from (select t.*,
sum(case when prev_datetime > datetime - interval 5 minute then 0 else 1 end) over (order by datetime) as grp
from (select t.*,
lag(datetime) over (order by datetime) as prev_datetime
from t
) t
) t
group by grp;
I'm not sure how you distinguish actual scans from the outliers. Perhaps if there is more than one row or so. If that is the case, you can remove the outliers with logic such as having count(*) > 1.

Big Query Transpose [duplicate]

This question already has an answer here:
MySQL pivot row into dynamic number of columns
(1 answer)
Closed 6 years ago.
I have a Google Big Query table with the following columns:
date | user | group | value
----------------------------
date1 | user1 | group1 | 10
----------------------------
date1 | user2 | group1 | 5
----------------------------
date2 | user1 | group1 | 20
----------------------------
date2 | user2 | group1 | 10
---------------------------
etc...
Now I want to convert this to this:
group | date1 | date2
----------------------
group1 | 15 | 30
So I want to have the sum of value for each day per group. I wrote a query that looks like this:
SELECT date, group, value FROM [table] GROUP BY date, group, value
But how do I transpose this so that each colums is a date and each row is a collection of totals for the value?
There is no nice way of doing this in BigQuery as of yet, but you can do it following below idea
Step 1
Run below query
SELECT 'SELECT [group], ' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF([date] = "' + [date] + '", value, NULL)) as [d_' + REPLACE([date], '/', '_') + ']'
)
+ ' FROM YourTable GROUP BY [group] ORDER BY [group]'
FROM (
SELECT [date] FROM YourTable GROUP BY [date] ORDER BY [date]
)
As a result - you will get string like below (it is formatted below for readability sake)
SELECT
[group],
SUM(IF([date] = "date1", value, NULL)) AS [d_date1],
SUM(IF([date] = "date2", value, NULL)) AS [d_date2]
FROM YourTable
GROUP BY [group]
ORDER BY [group]
Step 2
Just run above composed query
Result will be like below
group d_date1 d_date2
group1 15 30
Note 1: Step 1 is helpful if you have many groups to pivot so too much of manual work. In this case - Step 1 helps you to generate your query
Note 2: these steps are easily implemented in any client of your choice or you can just run those in BigQuery Web UI
You can see more about pivoting in my other posts.
How to scale Pivoting in BigQuery?
Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.
You can also see below as simplified examples (if above one is too complex/verbose):
How to transpose rows to columns with large amount of the data in BigQuery/SQL?
How to create dummy variable columns for thousands of categories in Google BigQuery?
Pivot Repeated fields in BigQuery

Mysql Update / Insert: copying historical data

I have some historical data tables in my Mysql database.
I want to repeat a day's historical data for another day in the same table.
Table structure, with some sample data:
Id | Date | Value
1 | 2012-04-30 | 5
2 | 2012-04-30 | 10
3 | 2012-04-30 | 15
I want to repeat those ids & values, but for a new date - e.g. 2012-05-01. i.e. adding:
1 | 2012-05-01 | 5
2 | 2012-05-01 | 10
3 | 2012-05-01 | 15
I feel that there should be a straightforward way of doing this... I've tried playing with UPDATE statements with sub-queries and using multiple LEFT JOINs, but haven't get there yet.
Any ideas on how I can do this?
EDIT: To clarify...
- I do NOT want to add these to a new table
- Nor do I want to change the existing records in the table.
- The ids are intentionally duplicated (they are a foreign_key to another table that records what the data refers to...).
INSERT INTO yourTable
SELECT ID, "2012-05-01" As Date, Value
FROM yourTable
WHERE Date = "2012-04-31"
Usually, your ID would be an autoincrement though, so having the same ID in the same table would not work. Either use a different ID, or a different table.
Different ID (next autoincrement):
INSERT INTO yourTable
SELECT NULL as ID, "2012-05-01" As Date, Value
FROM yourTable
WHERE Date = "2012-04-31"
Different table (referring to original ID)
INSERT INTO yourTable_hist
SELECT NULL as ID, ID as old_ID, "2012-05-01" As Date, Value
FROM yourTable
WHERE Date = "2012-04-31"
Maybe something like this:
UPDATE Table1
SET Date=DATE_ADD(Date, INTERVAL 1 DAY)
Or if you want to insert them to a new table:
INSERT INTO Table1
SELECT
ID,
DATE_ADD(Date, INTERVAL 1 DAY),
Value
FROM
Table2

Query database in weekly interval

I have a database with a created_at column containing the datetime in Y-m-d H:i:s format.
The latest datetime entry is 2011-09-28 00:10:02.
I need the query to be relative to the latest datetime entry.
The first value in the query should be the latest datetime entry.
The second value in the query should be the entry closest to 7 days from the first value.
The third value should be the entry closest to 7 days from the second value.
REPEAT #3.
What I mean by "closest to 7 days from":
The following are dates, the interval I desire is a week, in seconds a week is 604800 seconds.
7 days from the first value is equal to 1316578202 (1317183002-604800)
the value closest to 1316578202 (7 days) is... 1316571974
unix timestamp | Y-m-d H:i:s
1317183002 | 2011-09-28 00:10:02 -> appear in query (first value)
1317101233 | 2011-09-27 01:27:13
1317009182 | 2011-09-25 23:53:02
1316916554 | 2011-09-24 22:09:14
1316836656 | 2011-09-23 23:57:36
1316745220 | 2011-09-22 22:33:40
1316659915 | 2011-09-21 22:51:55
1316571974 | 2011-09-20 22:26:14 -> closest to 7 days from 1317183002 (first value)
1316499187 | 2011-09-20 02:13:07
1316064243 | 2011-09-15 01:24:03
1315967707 | 2011-09-13 22:35:07 -> closest to 7 days from 1316571974 (second value)
1315881414 | 2011-09-12 22:36:54
1315794048 | 2011-09-11 22:20:48
1315715786 | 2011-09-11 00:36:26
1315622142 | 2011-09-09 22:35:42
I would really appreciate any help, I have not been able to do this via mysql and no online resources seem to deal with relative date manipulation such as this. I would like the query to be modular enough to be able to change the interval weekly, monthly, or yearly. Thanks in advance!
Answer #1 Reply:
SELECT
UNIX_TIMESTAMP(created_at)
AS unix_timestamp,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT max(created_at) - 7
FROM my_table
)
)
AS `random_1`,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT MAX(created_at) - 14
FROM my_table
)
)
AS `random_2`
FROM my_table
WHERE created_at =
(
SELECT MAX(created_at)
FROM my_table
)
Returns:
unix_timestamp | random_1 | random_2
1317183002 | 1317183002 | 1317183002
Answer #2 Reply:
RESULT SET:
This is the result set for a yearly interval:
id | created_at | period_index | period_timestamp
267 | 2010-09-27 22:57:05 | 0 | 1317183002
1 | 2009-12-10 15:08:00 | 1 | 1285554786
I desire this result:
id | created_at | period_index | period_timestamp
626 | 2011-09-28 00:10:02 | 0 | 0
267 | 2010-09-27 22:57:05 | 1 | 1317183002
I hope this makes more sense.
It's not exactly what you asked for, but the following example is pretty close....
Example 1:
select
floor(timestampdiff(SECOND, tbl.time, most_recent.time)/604800) as period_index,
unix_timestamp(max(tbl.time)) as period_timestamp
from
tbl
, (select max(time) as time from tbl) most_recent
group by period_index
gives results:
+--------------+------------------+
| period_index | period_timestamp |
+--------------+------------------+
| 0 | 1317183002 |
| 1 | 1316571974 |
| 2 | 1315967707 |
+--------------+------------------+
This breaks the dataset into groups based on "periods", where (in this example) each period is 7-days (604800 seconds) long. The period_timestamp that is returned for each period is the 'latest' (most recent) timestamp that falls within that period.
The period boundaries are all computed based on the most recent timestamp in the database, rather than computing each period's start and end time individually based on the timestamp of the period before it. The difference is subtle - your question requests the latter (iterative approach), but I'm hoping that the former (approach I've described here) will suffice for your needs, since SQL doesn't lend itself well to implementing iterative algorithms.
If you really do need to determine each period based on the timestamp in the previous period, then your best bet is going to be an iterative approach -- either using a programming language of your choice (like php), or by building a stored procedure that uses a cursor.
Edit #1
Here's the table structure for the above example.
CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment PRIMARY KEY,
`time` datetime NOT NULL
)
Edit #2
Ok, first: I've improved the original example query (see revised "Example 1" above). It still works the same way, and gives the same results, but it's cleaner, more efficient, and easier to understand.
Now... the query above is a group-by query, meaning it shows aggregate results for the "period" groups as I described above - not row-by-row results like a "normal" query. With a group-by query, you're limited to using aggregate columns only. Aggregate columns are those columns that are named in the group by clause, or that are computed by an aggregate function like MAX(time)). It is not possible to extract meaningful values for non-aggregate columns (like id) from within the projection of a group-by query.
Unfortunately, mysql doesn't generate an error when you try to do this. Instead, it just picks a value at random from within the grouped rows, and shows that value for the non-aggregate column in the grouped result. This is what's causing the odd behavior the OP reported when trying to use the code from Example #1.
Fortunately, this problem is fairly easy to solve. Just wrap another query around the group query, to select the row-by-row information you're interested in...
Example 2:
SELECT
entries.id,
entries.time,
periods.idx as period_index,
unix_timestamp(periods.time) as period_timestamp
FROM
tbl entries
JOIN
(select
floor(timestampdiff( SECOND, tbl.time, most_recent.time)/31536000) as idx,
max(tbl.time) as time
from
tbl
, (select max(time) as time from tbl) most_recent
group by idx
) periods
ON entries.time = periods.time
Result:
+-----+---------------------+--------------+------------------+
| id | time | period_index | period_timestamp |
+-----+---------------------+--------------+------------------+
| 598 | 2011-09-28 04:10:02 | 0 | 1317183002 |
| 996 | 2010-09-27 22:57:05 | 1 | 1285628225 |
+-----+---------------------+--------------+------------------+
Notes:
Example 2 uses a period length of 31536000 seconds (365-days). While Example 1 (above) uses a period of 604800 seconds (7-days). Other than that, the inner query in Example 2 is the same as the primary query shown in Example 1.
If a matching period_time belongs to more than one entry (i.e. two or more entries have the exact same time, and that time matches one of the selected period_time values), then the above query (Example 2) will include multiple rows for the given period timestamp (one for each match). Whatever code consumes this result set should be prepared to handle such an edge case.
It's also worth noting that these queries will perform much, much better if you define an index on your datetime column. For my example schema, that would look like this:
ALTER TABLE tbl ADD INDEX idx_time ( time )
If you're willing to go for the closest that is after the week is out then this'll work. You can extend it to work out the closest but it'll look so disgusting it's probably not worth it.
select unix_timestamp
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 7
from my_table )
)
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 14
from my_table )
)
from my_table
where sql_tstamp = ( select max(sql_tstamp)
from my_table )

MySQL: Return 0 if row doen't exist

I've been bashing my head on this for a while, so now I'm here :) I'm a SQL beginner, so maybe this will be easy for you guys...
I have this query:
SELECT COUNT(*) AS counter, recur,subscribe_date
FROM paypal_subscriptions
WHERE recur='monthly' and subscribe_date > "2010-07-16" and subscribe_date < "2010-07-23"
GROUP BY subscribe_date
ORDER BY subscribe_date
Now the dates I've shown above are hard coded, my application will supply a variable date range.
Right now I'm getting a result table where there is a value for that date.
counter |recur | subscribe_date
2 | Monthly | 2010-07-18
3 | Monthly | 2010-07-19
4 | Monthly | 2010-07-20
6 | Monthly | 2010-07-22
I'd like to return in the counter column if the date doesn't exist.
counter |recur | subscribe_date
0 | Monthly | 2010-07-16
0 | Monthly | 2010-07-17
2 | Monthly | 2010-07-18
3 | Monthly | 2010-07-19
4 | Monthly | 2010-07-20
0 | Monthly | 2010-07-21
6 | Monthly | 2010-07-22
0 | Monthly | 2010-07-23
Is this possible?
You will need a table of dates (new table added), and then you will have to do an outer join between that table and your query.
This question is also similar to another question. Answers can be quite similar.
Insert Dates in the return from a query where there is none
You will need a table of dates to group against. This is quite easy in MSSQL using CTE's like this - I'm not sure if MySQL has something similar?
Otherwise you will need to create a hard table as a one off exercise
EDIT : Give this a try:
SELECT COUNT(pp.subscribe_date) AS counter, dl.date, MIN(pp.recur)
FROM date_lookup dl
LEFT OUTER JOIN paypal pp
on (pp.subscribe_date = dl.date AND pp.recur ='monthly')
WHERE dl.date >= '2010-07-16' and dl.date <= '2010-07-23'
GROUP BY dl.date
ORDER BY dl.date
The subject of the query needs to be changed to the date_lookup table
(the order of the Left Outer Join becomes important)
Count(*) isn't going to work since the 'date' record always exists - need to count something in the PayPay table
pp.recur ='monthly' is now a join condition, not a filter because of the LOJ
Finally, showing pp.recur in the select list isn't going to work.
I've used an aggregate, but MIN(pp.recur) will return null if there are no PayPal records
What you could do when you parameterize your query is to just repeat the Recur Type Filter?
Again, plz excuse the MSSQL syntax
SELECT COUNT(pp.subscribe_date) AS counter, dl.date, #ppRecur
FROM date_lookup dl
LEFT OUTER JOIN paypal pp
on (pp.subscribe_date = dl.date AND pp.recur =#ppRecur)
WHERE dl.date >= #DateFrom and dl.date <= #DateTo
GROUP BY dl.date
ORDER BY dl.date
Since there was no easy way to do this, I had to have the application fill in the blanks for me rather than have the database return the data I wanted. I do get a performance hit for this, but it was necessary for the completion of the report.
I will definitely look into making this return what I want from the DB in the near future. I'll give nonnb's solution a try.
thanks everyone!