I have a huge table in MySql (more than 80 million tuples) for 30 days, on which I want to run group by query. I've already given the index to the column where I want to run group by
I would like to know whether processing each day (total 30) will be faster or processing just once for all the 30 days.
I'm using Python script for playing with the DB
EDIT :
This the the query for 1 day (run 30 times, each time with next date):
q1 = Select date(time) Date,count(user_name) from users group by Date where date(time) = '2014-03-02';
and this is for all the 30 days (run just once):
q2 = Select date(time) Date, count(user_name) from users group by Date
So, it all boils down to:
Whether q1 is more efficient (run 30 times) or q2 (run once)
Obviously each day will be faster..
One day data is less than whole data and needs less space to perform group by operation on less data as group by is perform after fetching the one day data..
Thanks
Related
I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;
I have a table of cellular invoices, relevant columns are Cellular_Account_id (INT), billing_end_date(DATE), and data_usage_GB.
There is a separate row for each account every month. I'm trying to get a list of accounts that have had no data usage for each of the past three months.
I'm pretty new to databases in general, so I'm not really even sure what syntax I should be searching for, or what approach I should be taking.
I can, of course, select WHERE data_usage_GB = 0.000 AND MONTH(billing_end_date) = month(current_date()) -1 but that only gives me the info in 1 month's range. I'm not sure how to group together the results where data_usage_GB = 0.000 for each of the last three months.
I'd group by the account, get the maximum date for each and then filter them using a having clause:
SELECT cellular_account_id
FROM invoices
GROUP BY cellular_account_id
HAVING MAX(billing_end_date) < DATE_SUB(CURRENT_DATE, INTERVAL 3 MONTH)
Some background first. We have a MySQL database with a "live currency" table. We use an API to pull the latest currency values for different currencies, every 5 seconds. The table currently has over 8 million rows.
Structure of the table is as follows:
id (INT 11 PK)
currency (VARCHAR 8)
value (DECIMAL
timestamp (TIMESTAMP)
Now we are trying to use this table to plot the data on a graph. We are going to have various different graphs, e.g: Live, Hourly, Daily, Weekly, Monthly.
I'm having a bit of trouble with the query. Using the Weekly graph as an example, I want to output data from the last 7 days, in 15 minute intervals. So here is how I have attempted it:
SELECT *
FROM currency_data
WHERE ((currency = 'GBP')) AND (timestamp > '2017-09-20 12:29:09')
GROUP BY UNIX_TIMESTAMP(timestamp) DIV (15 * 60)
ORDER BY id DESC
This outputs the data I want, but the query is extremely slow. I have a feeling the GROUP BY clause is the cause.
Also BTW I have switched off the sql mode 'ONLY_FULL_GROUP_BY' as it was forcing me to group by id as well, which was returning incorrect results.
Does anyone know of a better way of doing this query which will reduce the time taken to run the query?
You may want to create summary tables for each of the graphs you want to do.
If your data really is coming every 5 seconds, you can attempt something like:
SELECT *
FROM currency_data cd
WHERE currency = 'GBP' AND
timestamp > '2017-09-20 12:29:09' AND
UNIX_TIMESTAMP(timestamp) MOD (15 * 60) BETWEEN 0 AND 4
ORDER BY id DESC;
For both this query and your original query, you want an index on currency_data(currency, timestamp, id).
I'm reasonably new to Access and having trouble solving what should be (I hope) a simple problem - think I may be looking at it through Excel goggles.
I have a table named importedData into which I (not so surprisingly) import a log file each day. This log file is from a simple data-logging application on some mining equipment, and essentially it saves a timestamp and status for the point at which the current activity changes to a new activity.
A sample of the data looks like this:
This information is then filtered using a query to define the range I want to see information for, say from 29/11/2013 06:00:00 AM until 29/11/2013 06:00:00 PM
Now the object of this is to take a status entry's timestamp and get the time difference between it and the record on the subsequent row of the query results. As the equipment works for a 12hr shift, I should then be able to build a picture of how much time the equipment spent doing each activity during that shift.
In the above example, the equipment was in status "START_SHIFT" for 00:01:00, in status "DELAY_WAIT_PIT" for 06:08:26 and so-on. I would then build a unique list of the status entries for the period selected, and sum the total time for each status to get my shift summary.
You can use a correlated subquery to fetch the next timestamp for each row.
SELECT
i.status,
i.timestamp,
(
SELECT Min([timestamp])
FROM importedData
WHERE [timestamp] > i.timestamp
) AS next_timestamp
FROM importedData AS i
WHERE i.timestamp BETWEEN #2013-11-29 06:00:00#
AND #2013-11-29 18:00:00#;
Then you can use that query as a subquery in another query where you compute the duration between timestamp and next_timestamp. And then use that entire new query as a subquery in a third where you GROUP BY status and compute the total duration for each status.
Here's my version which I tested in Access 2007 ...
SELECT
sub2.status,
Format(Sum(Nz(sub2.duration,0)), 'hh:nn:ss') AS SumOfduration
FROM
(
SELECT
sub1.status,
(sub1.next_timestamp - sub1.timestamp) AS duration
FROM
(
SELECT
i.status,
i.timestamp,
(
SELECT Min([timestamp])
FROM importedData
WHERE [timestamp] > i.timestamp
) AS next_timestamp
FROM importedData AS i
WHERE i.timestamp BETWEEN #2013-11-29 06:00:00#
AND #2013-11-29 18:00:00#
) AS sub1
) AS sub2
GROUP BY sub2.status;
If you run into trouble or need to modify it, break out the innermost subquery, sub1, and test that by itself. Then do the same for sub2. I suspect you will want to change the WHERE clause to use parameters instead of hard-coded times.
Note the query Format expression would not be appropriate if your durations exceed 24 hours. Here is an Immediate window session which illustrates the problem ...
' duration greater than one day:
? #2013-11-30 02:00# - #2013-11-29 01:00#
1.04166666667152
' this Format() makes the 25 hr. duration appear as 1 hr.:
? Format(#2013-11-30 02:00# - #2013-11-29 01:00#, "hh:nn:ss")
01:00:00
However, if you're dealing exclusively with data from 12 hr. shifts, this should not be a problem. Keep it in mind in case you ever need to analyze data which spans more than 24 hrs.
If subqueries are unfamiliar, see Allen Browne's page: Subquery basics. He discusses correlated subqueries in the section titled Get the value in another record.
I'm making a fitness logbook where indoor rowers can log there results.
To make it interesting and motivating I'm implementing an achievement system.
I like to have an achievement that if someone rows more than 90 times within 24 weeks they get that achievement.
Does anybody have some hints in how i can implement this in MYSQL.
The mysql-table for the logbook is pretty straightforward: id, userid, date (timestamp),etc (rest is omitted because it doesn't really matter)
The jist is that the first rowdate and the last one can't exceed the 24 weeks.
I assume from your application that you want the most recent 24 weeks.
In mysql, you do this as:
select lb.userid
from logbook lb
where datediff(now(), lb.date) >= 7*24
group by userid
having count(*) >= 90
If you need it for an arbitrary 24-week period, can you modify the question?
Just do a sql query to count the number of rows a user has between now and 24 weeks ago. This is a pretty straight forward query to run.
Look at using something with datediff in mysql to get the difference between now and 24 weeks ago.
After you have a script set up to do this, set up a cron job to run either every day or every week and do some automation on this.
I think you should create a table achievers which you populate with the achievers of each day.
You can set a recurrent(daily, right before midnight) event in which you run a query like this:
delete from achievers;
insert into achievers (
select userid
from logbook
where date < currenttimestamp and date > currenttimestamp - 24weeks
group by userid
having count(*) >= 90
)
For events in mysql: http://dev.mysql.com/doc/refman/5.1/en/events-overview.html
This query will give you the list of users total activity in 24 weeks
select * from table groupby userid where `date` BETWEEN DATE_SUB( CURDATE( ) ,INTERVAL 168 DAY ) AND CURDATE( ) having count(id) >= 90