MySQL Group By Into 10 Buckets - mysql

Lets say that I have a table that looks like
1 | Test | Jan 10, 2017
...
10000 | Test | Jan 20, 2030
and I want to bucket the records in the table based on the 2nd column with a set amount of 10 buckets regardless of the values of the dates. All I require is that each bucket covers a time range of equal length.
I understand that I could do something with
GROUP BY
YEAR(datefield),
MONTH(datefield),
DAY(datefield),
HOUR(datefield),
and subtract the largest datefield with the smallest datefield and divide by 10 to get the time length covered in each bucket. However, is there already built-in functionality in MySQL that would do this as doing the manual subtraction and division might lead to smaller edge cases. Am I on the right track by doing the subtraction and division for bucketing into a constant number of buckets?

Related

Nested sort in SELECT followed by Conditional INSERT based upon results of SELECT inquiry

I have been struggling with the following for some time.
The server I am using has MySQL ver 5.7 installed.
The issue:
I wish to take recorded tank level readings from one table, find the difference between the last two records for a particular tank, and multiply this by a factor to get a quantity used.
The extracted quantity, if it is +ve, else 0 , then to be inserted into another table for further use.
The Quant value extracted may be +ve or -ve as tanks fill and empty. I only require the used quantity -ie falling level.
The two following tables are used:
Table 'tf_rdgs' sample;
value 1 is content height.
id
location
value1
reading_time
1
18
1500
2
18
1340
3
9
1600
4
18
1200
5
9
1400
6
18
1765
yyyy
7
18
1642
xxxx
Table 'flow' example
id
location
Quant
reading_time
1
18
5634
dd-mm: HH-mm
2
18
0
dd-mm: HH-mm
3
18
123
current time
I do not require to go back over history and am only interested in the latest level readings as a new level reading is inserted.
I can get the following to work with a table of only one location.
INSERT INTO flow (location, Quant)
SELECT t1.location, (t2.value1 - t1.value1) AS Quant
FROM tf_rdgs t1 cross join tf_rdgs t2 on t1.reading_time > t2.reading_time
ORDER BY t2.reading_time DESC limit 1
It is not particularly efficient but works and gives the following return from the above table.
location
Quant
18
123
for a table with mixed locations including a WHERE t1.location = ... statement does not work.
The problems i am struggling with are
How to nest the initial sorting by location for the subsequent inquiry of difference between the last two tank level readings.
A singular location search is ok rather than all tanks.
A Conditional INSERT to insert the 'Quant' value only if it is +ve or else insert a 0 if it is -ve (ie filling)
I have tried many permutations on these without success.
Once the above has been achieved it needs to run on a conditional trigger - based upon location of inserted data - in the tf_rdgs table activated upon each new reading inserted from the sensors on a particular tank.
I can achieve the above with the exception of the conditional insert if each tank had a dedicated table but unfortunately I cant go there due existing data structure and usage.
Any direction or assitance on parts or whole of this much appreciated.

How do I select row with the most recent reduction in a column value from same columns value in previous row?

I have a set of inventory data where the amount increases at a given rate. For example, the inventory increases by ten units every day. However, from time to time there will be an inventory reduction that could be any amount. I need a query that can find me the most recent inventory reduction and return to me the sum of that deduction.
My table holds date and amount for numerous item id's. In theory what I am trying to do is select all amounts and dates for a given item ID, and then find the difference between the most recent reduction between two days inventory. Due to the fact that multiple items are tracked, there is no guarantee that the id column will be consecutive for a set of items.
Researching to find a solution to this has been completely overwhelming. It seems like window functions might be a good route to try, but I have never used them and don't even really have a concept of where to start.
While I could easily return the amounts and do the calculation in PHP, I feel the right thing to do here is harness SQL but my experience with more complex queries is limited.
ID | ItemID | Date | Amount
1 2 2019-05-05 25
7 2 2019-05-06 26
34 2 2019-05-07 14
35 2 2019-05-08 15
67 2 2019-05-09 16
89 2 2019-05-10 5
105 2 2019-05-11 6
Given the data above, it would be nice to see a result like:
item id | date | reduction
2 2019-05-10 11
This is because the most recent inventory reduction is between id 67 and 89 and the amount of the reduction is 11 on May 10th 2019.
In MySQL 8+, you can use lag():
select t.*, (prev_amount - amount) as reduction
from (select t.*,
lag(amount) over (partition by itemid order by date) as prev_amount
from t
) t
where prev_amount > amount
order by date desc
limit 1;

Calculated field in access with a loop

I want to use a calculated field in access, however, the tricky part for me is when I have to run it through the dates. I have a database with multiple registers for the same day, but with different dates. Let's take this one for example:
Date | Report | Onblock Time
-----|--------|-------------
27/5 | 5:45 | 8:52
-----|--------|-------------
27/5 | 9:35 | 10:57
-----|--------|-------------
27/5 | 11:52 | 12:59
So, what I want to do is add 45 minutes to the first time that shows (in this case 5:45) and add 30 minutes to the last one (in this case 12:59). Once those two things are done, I want to calculate the difference between them.
I've tried [(Onblock Time + 0:30) - (Report - 0:45)] in the expression generator, and it seems to work. The problem I have is when I have to make it for a table that has 1000's of registers, with 4-6 a day. Is there any sort of automated loop, like a for each of anything like that?
Thanks in advance,
Jonathan
If I understood you right, you need a query, which returns for each day number of minutes between minimum of ReportTime + 0:45 and maximum of OnblockTime + 0:30. If so, the SQL for query should be like this:
SELECT ReportDate
,DateDiff("n", DateAdd("n", 45, Min([ReportTime])), DateAdd("n", 30, Max([OnblockTime]))) AS Diff
FROM TimeTable
GROUP BY ReportDate;

Spotfire intersect first 'n' periods

Is there a way to use an Over and Intersect function to get the average sales for the first 3 periods (not always consecutive months, sometimes a month is skipped) for each Employee?
For example:
EmpID 1 is 71.67 ((80 + 60 + 75)/3) despite skipping "3/1/2007"
EmpID 3 is 250 ((350 + 250 + 150)/3).
I'm not sure how EmpID 2 would work because there are just two data points.
I've used a work-around by calculated column using DenseRank over Date, "asc", EmpID and then used another Boolean calculated column where DenseRank column name is <= 3, then used Over functions over the Boolean=TRUE column but I want to figure the correct way to do this.
There are Last 'n' Period functions but I haven't seen anything resembling a First 'n' Period function.
EmpID Date Sales
1 1/1/2007 80
1 2/1/2007 60
1 4/1/2007 75
1 5/1/2007 30
1 9/1/2007 100
2 2/1/2007 200
2 3/1/2007 100
3 12/1/2006 350
3 1/1/2007 250
3 3/1/2007 150
3 4/1/2007 275
3 8/1/2007 375
3 9/1/2007 475
3 10/1/2007 300
3 12/1/2007 200
I suppose the solution depends on where you want this data represented, but here is one example
If((Rank([Date],"asc",[EmpID])<=3) and (Max(Rank([Date],"asc",[EmpID])) OVER ([EmpID])>=3),Avg([Sales]) over ([EmpID]))
You can insert this as a calculated column and it will give you what you want (assuming your data is sorted by date when imported).
You may want to see the row numbering, and in that case insert this as a calculated column as well and name it RN
Rank([Date],"asc",[EmpID])
Explanation
Rank([Date],"asc",[EmpID])
This part of the function is basically applying a row number (labeled as RN in the results below) to each EmpID grouping.
Rank([Date],"asc",[EmpID])<=3
This is how we are taking the top 3 rows regardless if Months are skipped. If your data isn't sorted, we'd have to create one additional calculated column but the same logic applies.
(Max(Rank([Date],"asc",[EmpID])) OVER ([EmpID])>=3)
This is where we are basically ignoring EmpID = 2, or any EmpID who doesn't have at least 3 rows. Removing this would give you the average (dynamically) for each EmpID based on their first 1, 2, or 3 months respectively.
Avg([Sales]) over ([EmpID])
Now that our data is limited to the rows we care about, just take the average for each EmpID.
#Chris- Here is the solution I came up with
Step 1: Inserted a calculated column 'rank' with the expression below
DenseRank([Date],"asc",[EmpID])
Step 2: Created a cross table visualization from the data table and limited data with the expression below

Should I worry about 1B+ rows in a table?

I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.