I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5
2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.
Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.
Consider the following 2 rows (with timestamp in date format for ease of reading):
TimeStamp StartDate
20120601 20120601
20120701 20120601
Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.
HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:
TimeStamp COUNT
20120601 2
and a 50:50 chance of getting
TimeStamp COUNT
20120701 2
If you add more data to the dataset as so:
TimeStamp StartDate
20120601 20120601
20120701 20120601
20120701 20120701
You could get
TimeStamp COUNT
20120601 2
20120701 1
or
TimeStamp COUNT
20120701 2
20120701 1
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
EDIT
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
SELECT DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.
Try this
SELECT DATE(timestamp), COUNT(week(date(startdate))) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Related
I have some data that has year, month, date, column_x. The column_x can be missing or not missing. What I want to generate is the missing rate of column_x. In order to do so, I'm trying to create two columns that contains the total row number, which would be total_count, and count column, that represents the column_x == null.
I'm trying to create something like below:
total_count | count | year | month | date
60 | 20 | 2022 | 12 | 01
so I can do in future count / total_count to get some percentage.
However, I'm not sure how I can generate a query.
I tried subqueries but it's throwing me an error.. how can I achieve this through pyspark or sql subqueries? (I can register temp table and run sql queries as well)
What I want to generate is the missing rate of column_x.
You can do conditional counts. In MySQL:
select year, month, day,
count(*) as cnt_total,
count(column_x) as cnt_x_not_null,
sum(column_x is null) as cnt_x_null,
avg(column_x is null) as ratio_x_null
from mytable
group by year, month, day
The last expression (avg) gives you the ratio of rows where the column is null. This works because MySQL evaluates conditions as 0/1 in numeric context, so we can just use avg on top of the is null predicate.
Other columns in the resultset give more examples of conditional counts.
Since few days, I am trying to count records per hour from the MySQL database.
I have a table with a lot of records and I have column DATE and column TIME where in DATE I have the date of the record in the format 2022-05-19, and in the column TIME, I have the time of the record in the format 14:59:38.
What I am trying is to count every single day how many records per hour I have. Something like this:
DATE HOUR PCS
22-05-18 06-07 11
22-05-18 08-09 20
......... ..... ..
....... 21-22 33
I have tried many different ways but no success.
For example:
SELECT 'Date', count(*) FROM `root4`
where
DATE between '2022-05-01' and '2022-05-1' AND
TIME BETWEEN '06:11:05' AND '07:11:05'
Any help is highly evaluated.
I would recommend not using reserved words for columns, as you will have to escape them a lot. https://dev.mysql.com/doc/refman/8.0/en/keywords.html
If you stored TIME as a timestamp, you can extract the hour using the HOUR() function and group by that:
SELECT
`DATE`,
HOUR(`TIME`) AS `HOUR`,
COUNT(1)
FROM your_table
GROUP BY
`DATE`,
HOUR(`TIME`)
If you happened to store it as text you can use REGEXP_SUBSTR to get the hour value from your time string.
SELECT
`DATE`,
CAST(REGEXP_SUBSTR(`TIME`, '[0-9]+') AS UNSIGNED) AS `HOUR`,
COUNT(1)
FROM your_table
GROUP BY
`DATE`,
CAST(REGEXP_SUBSTR(`TIME`, '[0-9]+') AS UNSIGNED)
You can format your HOUR column how you want, like displaying 01-02 instead of 1 by using CONCAT, but this is your basic setup.
I have a sample table here with the following columns and sample records. I want to be able to sum my column cases using with a specific date range (the helper column).
I want to get my results this way:
Sum all cases WHERE date range is in between 2022-03-23 - 2022-04-01 and so on.
date range
Sum of Cases
2022-03-23-2022-04-01
5 (sample result only)
2022-03-24-2022-04-02
9 (sample result only)
The logic of the date range is always n - n9 days.
I 've tried this type of query but it does not work, it there a way for me to get this without have to use a query to create another column?
SELECT Date,
sum([QUERY 1]) as "Reports 7 days prev",
sum ([QUERY 2]) as "Reports 7 days after"
FROM REPORTS
GROUP BY Date
Data:
Date
BuyerID
Cases
Helper (Date Range)
4/1/2022
20001
2
2022-03-23-2022-04-01
4/1/2022
20001
1
2022-03-23-2022-04-01
4/2/2022
20002
3
2022-03-24-2022-04-02
4/5/2022
20003
5
2022-03-27-2022-04-05
4/7/2022
20004
6
2022-03-29-2022-04-07
4/7/2022
20005
9
2022-03-29-2022-04-07
Are you looking to get total cases for last X number of days? What does your initial data look like?
you can try something like:
Step 1: You aggregate all the cases for each date.
WITH CASES_AGG_BY_DATE AS
(
SELECT Date,
SUM(Cases) AS Total_Cases
FROM REPORTS
GROUP BY Date
),
Step 2: you aggregate the last 7 days rolling cases sum for each date
LAST_7_DAY_AGG AS
(
SELECT Date, SUM(Total_Cases) OVER(ORDER BY Date ASC ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS sum_of_cases,
LAG(Date, 7) AS 7th_day
FROM CASES_AGG_BY_DATE
)
Step 3: create final output and concatenate date and 7th day before that
SELECT Date, CONCAT(Date, "-", 7th_day), sum_of_cases
FROM LAST_7_DAY_AGG;
I am looking to calculate moving averages over variable dates.
My database is structured:
id int
date date
price decimal
For example, I'd like to find out if the average price going back 19 days ever gets greater than the average price going back 40 days within the past 5 days. Each of those time periods is variable.
What I am getting stuck on is selecting a specific number of rows for subquery.
Select * from table
order by date
LIMIT 0 , 19
Knowing that there will only be 1 input per day, can I use the above as a subquery? After that the problem seems trivial....
if you only have one input per day you don't need id, date can be your primary id? Am i missing something? Then use select sum
SELECT SUM(price) AS totalPrice FROM table Order by date desc Limit (most recent date),(furthest back date)
totalPrice/(total days)
I may not understand your question
Yes you can use that as a sub-query like this:
SELECT
AVG(price)
FROM
(SELECT * FROM t ORDER BY date DESC LIMIT 10) AS t1;
This calculates the average price for the latest 10 rows.
see fiddle.
Why MySQL search all rows when I switch to a 1 year range?
--Table dates
id (int)
date (timestamp)
value (varchar)
PRIMARY(id), date_index(date)
1750 rows
Executing
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2011-04-27' AND '2011-04-28'
The rows column display 18 rows.
If I increase or decrease the BETWEEN range - 1 year for example - the rows column display 1750 rows.
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2011-04-27' AND '2012-04-28'
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2010-04-27' AND '2011-04-28'
The optimizer builds the query plan depending on several things including the amount/distribution of the data. My best guess would be that you don't have much more than a year's data or that using the index for the year's worth of data wouldn't use many less rows than the total table size.
If that doesn't sound right can you post up the output of:
SELECT MIN(date), MAX(date) FROM dates;
SELECT COUNT(*) FROM dates WHERE date BETWEEN '2011-04-27' AND '2012-04-28';
This article I wrote shows some examples of how the optimizer works too: What makes a good MySQL index? Part 2: Cardinality