I'm not even quite sure what to call what I am trying to do. I need a column that accumulates the total of its own calculations on previous rows. Here's a simplified example of the output I need (d is the column I need help with):
a b c d e
1 36 6 36 6
2 0 5 30 6
3 0 4 24 6
4 22 10 40 4
5 0 9 36 4
6 0 8 32 4
a is an autonumber column
b is user entered column
c is a column that usually counts down by one, but occasionally has other values added
d is equal to: the prior row of d + b - prior row of e
e is equal to d/c
It may help if I explain why I need this. I'm working on simulating the impact of rate shocks on a banking institution (commonly called stress testing). b is a column that represents some impact of an interest rate shock (example: a percentage change to the number of loans made). c is a column that represents the number of months before the market is expected to return to normal without any further rate shocks. d represents the impact on loans for that time period and e is the rate at which d is returning to zero.
Here is what I've tried so far and why it doesn't work:
SET var := 0;
SELECT var := var + b - var/c AS d
This would be my preferred solution, but I'm creating a view, and views are not allowed to reference user created variables. It has to be a view because I need to reference it in other queries and views.
SELECT (lag(d,1,0) OVER (ORDER BY a)) + b - (lag(e,1,0) OVER (ORDER BY a)) AS d
This doesn't work because d hasn't been defined yet. I just get an error that d isn't a column.
SELECT
((SELECT sum(temp.b)
FROM table as temp
WHERE temp.a <= a)
/(SELECT sum(temp2.c)
FROM table as temp2
WHERE temp2.a = a))
This doesn't work because I need to divide the reduced value of d, not just the sum of b.
d is the cumulative sum of the difference between b and c. This would be expressed as:
select a, b, c,
sum(b - c) over (order by a) as d,
d / sum(b - c) over (order by a) as e
from t;
I'm using window functions, even though the question is tagged MySQL. MySQL only supports them since version 8.0.
I'm not sure how stable the results would be in a VIEW, but you could use session variables.
SELECT a, b, c, #d := #d + b - #e AS d, #e := #d / c AS e
FROM t
, (SELECT #d := 0 AS initD, #e := 0 AS initE) As init
ORDER BY a;
Also, keep in mind this is kind of crossing over into behavior observed, not actually guaranteed; as MySQL doesn't officially guarantee the select expressions are evaluated from left to right.
I solved this with pure brute force. I created a permanent table, and then I created a trigger which deletes all rows of the table and then recreates the table using the variable solution. Thus any time the underlying data changes, MySQL automatically updates the table with the most current information. The table acts like a view in that it is updated in real time, but is a permanent table.
Unfortunately, MySQL doesn't offer a FOR EVERY STATEMENT option for triggers, which means my code is running over and over again for every line that gets updated, so if I update 100 rows, my table gets deleted and recreated 100 times.
Still looking for a better solution, but at least this works.
Related
I'm on MySQL 5.7 and have a dataset as the table, bundles.
bundle_id
single_amount
item_count
1
100
1
2
20
3
3
15
2
SQL Fiddle: http://sqlfiddle.com/#!9/823c91/1/0
e.g. The table means that for bundle1, a customer bought one item that is $100 and another customer bought 3 items that the individual one is $20.
I want to get the top n data by individual item, not by a bundle. The straightforward idea is to flatten the data as the table below.
single_amount
item_count
100
1
20
3
20
3
20
3
15
2
15
2
It's easy to do it on service but considering the potential size of the dataset, is there a way to do it on MySQL side?
The simplest method is to create one table having one number column and fill it with 1 to maximum numbers that is available in the column item_count.
Let's say, you have created table as numbers_table (number_col)
Then you can use that numbers table as follows:
select single_amount, item_count
from your_table t
join numbers_table n on t.item_count >= n.number_col
MySQL 8+ supports recursive CTEs (and actually CTEs in general). But earlier versions do not. To solve this, you need a numbers table of some sort. MySQL doesn't have one built-in, but you can generate one.
Based on your description of the problem, the bundles table is big enough, so you can use that:
SELECT n.n, b.*
FROM bundles b JOIN
(SELECT (#rn := #rn + 1) as n
FROM bundles b CROSS JOIN
(SELECT #rn := 0) params
LIMIT 100
) n
ON n.n <= b.item_count;
The LIMIT 100 is just for performance. You obviously need as many rows at the maximum item_count.
Here is a SQL Fiddle.
I have two tables in my database
Table A with columns user_id, free_data, used_data
Table B with columns donor_id, receptor_id, share_data
Basically, a user (lets call x) has some data in his account which is represented by his entry in table A. The data is stored in free_data column. He can donate data to any other user (lets call y), which will show up as an entry in Table B. The same amount of data gets deducted from the user x free_data column.
While entry in Table B gets created, an entry in Table A for user y is also created with free_data value equal to share_data. Now user y can give away data to user z & the process continues.
Each user keep using their data & the entry used_data in Table A keeps on adding up to indicate how much data each user has used.
This is like a tree structure where there is a an entry with all the data (root node) who eventually gives data to others who in-turn give data to other nodes.
Now I would like to write an sql query such that, given a node x (id of entry in Table A), I should be able to sum up total data x has given & who all are beneficiaries at multiple level, all of their used_data need to be collated & showed against x.
Basically, I want to collate
Overall data x has donated.
How much of the donated data from x has been used up.
While the implementation is more graph-like, I am more interested to know if we assume it to be a tree below node x & can come up with a single sql query to be able to get the data I need.
Example
Table A
user_id, free_data, used_data
1 50 10
2 30 20
3 20 20
Table B
donor_id, receptor_id, share_data
1 2 30
1 3 20
Total data donated by 1 - 30 + 20 = 50
Total donated data used - 20 + 20 = 40
This is just one level where 1 donated to 2 & 3. 2 in turn could donated to 4 & all that data needed to be collated in a bubbled up fashion for calculating the overall donated data usage.
Yes its possible using a nested set model. There's a book by Joe Celko that describes but if you want to get straight into it there's an article that talks about it. Both the collated data that you need can be retrieved by a single select statement like this:
SELECT * FROM TableB where left > some_value1 and right < some_value2
In the above example to get all the child nodes of "Portable Electronics" the query will be:
SELECT * FROM Electronics WHERE `left` > 10 and `right` < 19
The article describes how the left and right columns should be initialised.
If I understand the problem correctly, the following should give you the desired results:
SELECT B.donor_id AS donor_id, SUM(A.used_data) AS total_used_data FROM A
INNER JOIN B ON A.user_id = B.receptor_id GROUP BY B.donor_id;
Hope this will solve your problem now.
Try below query(note that you will have to pass userid at 2 places):
SELECT SUM(share_data) as total_donated, sum(used_data) as total_used FROM tablea
LEFT JOIN tableB
ON tableA.user_id = tableB.donor_id
WHERE user_id IN (select receptor_id as id
from (select * from tableb
order by donor_id, receptor_id) u_sorted,
(select #pv := '1') initialisation
where find_in_set(donor_id, #pv) > 0
and #pv := concat(#pv, ',', receptor_id)) OR user_id = 1;
I have a mysql table-
User Value
A 1
A 12
A 3
B 4
B 3
B 1
C 1
C 1
C 8
D 34
D 1
E 1
F 1
G 56
G 1
H 1
H 3
C 3
F 3
E 3
G 3
I need to run a query which returns 2nd distinct value that each user has.
Means if any 2 values are accessed by each user , then based on the occurrence, pick the 2nd distinct value.
So as above 1 & 3 is being accessed by each User. Occurrence of 1 is
more than 3 , so 2nd distinct will be 3
So I thought first I will get all distinct user.
create table temp AS Select distinct user from table;
Then I will have an outer query-
Select value from table where value in (...)
In programmatically way , I can iterate through each of the value user contains like Map but in Hive query I just couldn't write that.
This will return the second most frequented value from your list that spans all users. There isn't one of these values in the table which I expect is a typo in the data. In real data you will likely have muliple ties that you need to figure out how to handle.
Select value as second_distinct from
(select value, rank() over (order by occurrences desc) as rank
from
(SELECT value, unique_users, max(count_users) as count_users, count(value) as occurrences
from
(select value, size(collect_set(user) over (partition by value))
as count_users from my_table
) t
left outer join
(select count(distinct user) as unique_users from my_table
) t2 on (1=1)
where unique_users=count_users
group by value, unique_users
) t3
) t4
where rank = 2;
This works. It returns NULL because there is only value that visited every user (value of 1). Value 3 is not a solution because not every user has seen that value in your data. I expect you intended that three should be returned but again it doesn't span all the users (user D did not see value 3).
Not sure how #invoketheshell's answer was marked correct; it doesn't run and it needs 6 MR jobs. This will get you there in 4 and is less code.
Query:
select value
from (
select value, value_count, rank() over (order by value_count desc) rank
from (
select value, count(value) value_count
from (
select value, num_users, max(num_users) over () max_users
from (
select value
, size(collect_set(user) over (partition by value)) num_users
from db.table ) x ) y
where num_users = max_users
group by value ) z ) f
where rank = 2
Output:
3
EDIT: Let me clarify my solution as there seems to be some confusion. The OP's example says
"So as above 1 & 3 is being accessed by each User ... "
As my comment below the question suggests, in the example given, user D never accesses value 3. I made the assumption that this was a typo and added this to the dataset and then added another 1 as well to make there be more 1's than 3's. So my code correctly returns 3, which was the desired output. If you run this script on the actual dataset it will also produce the correct output which is nothing because there isn't a "2nd Distinct". The only time it could produce an incorrect value, is if there was no one specific number that was accessed by all users, which illustrates the point I was trying to make to #invoketheshell: if there is no single number that every user has accessed, running a query with 6 map-reduce jobs is an absurd way to find that out. Since we are using Hive I believe it would be fair to assume that if this problem were a "real-world" problem, it would most likely be executed on at least 100's of TBs of data (probably more). I the interest of preserving time and resources, it would behoove an individual to at least check that one number had been accessed by all users before running a massive query whose analysis hinges on that assumption being true.
I have a table (Table A) with a field of integers (Field B). For each row of Table A, I would like to construct a range of +/- 100 surrounding the integer value of Field B then find all values from Field B that are within these ranges. The query needs to be performed for all values in Field B. The query needs to return each row that is within each row range. Here is an example of what I am trying to do:
Table A
_______
A 1000
B 3000
C 5000
D 1090
Using the above Table A, the query would first find the ranges (+/- 100) for all integers in Field B.
900 - 1100
2900 - 3100
4900 - 5100
990 - 1190
The query would then iterate through these ranges and return rows from Table A that fall within the generated ranges. Using the above example, the query would return:
A 1000
A 1000
B 3000
C 5000
D 1090
D 1090
A and D are returned twice because it they fall within their own ranges. How can I construct a query that will return each row that falls between the range of each row? Thanks in advance for the help.
SELECT t2.*
FROM tableA AS t1
INNER JOIN tableA AS t2 ON t2.fieldB >= (t1.fieldB - 100) AND t2.fieldB <= (t1.fieldB + 100)
Shouldn't A also be shown twice, since it's also in D's range? (that's the case with above query - if incorrect, please elaborate why ^^)
Start with your inner-most pre-qualifier of every Table A record... Then re-join to table A again. I've added the qualifying Group ranges low and hi to show the qualifier basis you were looking for... In addition to D showing up twice, A should show up twice too as it qualifies the "D"s range too.
select
a2.ShowLetter,
a2.FieldB,
GrpRanges.RangeLow,
GrpRanges.RangeHi
from
( select distinct
a1.FieldB - 100 as RangeLow,
a1.FieldB + 100 as RangeHi
from
TableA a1 ) GrpRanges
JOIN TableA a2
on a2.FieldB between GrpRanges.RangeLow and GrpRanges.RangeHi
order by
a2.ShowLetter
I have a MySQL database where one column contains status codes. The column is of type int and the values will only ever be 100,200,300,400. It looks like below; other columns removed for clarity.
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
The id field is auto-generated and will always be sequential. I want to have a third column displaying a comma-separated string of the frequency distribution of the status codes of the previous 10 rows. It should look like this.
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
I want the most frequent code listed first. And where two status codes have the same frequency it doesn't matter to me which is listed first but I did list the smaller code before the larger in the example. Lastly, where a code doesn't appear at all in the previous ten rows, it shouldn't be listed in the freq column either.
And to be very clear the row number that the frequency string appears on does NOT take into account the status code of that row; it's only the previous rows.
So what have I done? I'm pretty green with SQL. I'm a programmer and I find this SQL language a tad odd to get used to. I managed the following self-join select statement.
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
Using the aggregate function avg, I can at least demonstrate the concept. The derived table b provides the correct rows to the avg function but I just can't figure out the multi-step process of counting and grouping rows from b to get a frequency distribution and then collapse the frequency rows into a single string value.
Also I've tried using standard stored functions and procedures in place of the built-in aggregate functions, but it seems the b derived table is out of scope or something. I can't seem to access it. And from what I understand writing a custom aggregate function is not possible for me as it seems to require developing in C, something I'm not trained for.
Here's sql to load up the sample.
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
The sample has 30 rows of data to work with. I know it's a long question, but I just wanted to be as detailed as I could be. I've worked on this for a few days now and would really like to get it done.
Thanks for your help.
The only way I know of to do what you're asking is to use a BEFORE INSERT trigger. It has to be BEFORE INSERT because you want to update a value in the row being inserted, which can only be done in a BEFORE trigger. Unfortunately, that also means it won't have been assigned an ID yet, so hopefully it's safe to assume that at the time a new record is inserted, the last 10 records in the table are the ones you're interested in. Your trigger will need to get the values of the last 10 ID's and use the GROUP_CONCAT function to join them into a single string, ordered by the COUNT. I've been using SQL Server mostly and I don't have access to a MySQL server at the moment to test this, but hopefully my syntax will be close enough to at least get you moving in the right direction:
create trigger sample_trigger BEFORE INSERT ON sample
FOR EACH ROW
BEGIN
DECLARE _freq varchar(50);
SELECT GROUP_CONCAT(tbl.status ORDER BY tbl.Occurrences) INTO _freq
FROM (SELECT status, COUNT(*) AS Occurrences, 1 AS grp FROM sample ORDER BY id DESC LIMIT 10) AS tbl
GROUP BY tbl.grp
SET new.freq = _freq;
END
SELECT id, GROUP_CONCAT(status ORDER BY freq desc) FROM
(SELECT a.id as id, b.status, COUNT(*) as freq
FROM
sample a
JOIN
sample b ON (b.id < a.id) AND (b.id > a.id - 11)
WHERE
a.id > 10
GROUP BY a.id, b.status) AS sub
GROUP BY id;
SQL Fiddle