Tree like data collation in SQL (Mysql) - mysql

I have two tables in my database
Table A with columns user_id, free_data, used_data
Table B with columns donor_id, receptor_id, share_data
Basically, a user (lets call x) has some data in his account which is represented by his entry in table A. The data is stored in free_data column. He can donate data to any other user (lets call y), which will show up as an entry in Table B. The same amount of data gets deducted from the user x free_data column.
While entry in Table B gets created, an entry in Table A for user y is also created with free_data value equal to share_data. Now user y can give away data to user z & the process continues.
Each user keep using their data & the entry used_data in Table A keeps on adding up to indicate how much data each user has used.
This is like a tree structure where there is a an entry with all the data (root node) who eventually gives data to others who in-turn give data to other nodes.
Now I would like to write an sql query such that, given a node x (id of entry in Table A), I should be able to sum up total data x has given & who all are beneficiaries at multiple level, all of their used_data need to be collated & showed against x.
Basically, I want to collate
Overall data x has donated.
How much of the donated data from x has been used up.
While the implementation is more graph-like, I am more interested to know if we assume it to be a tree below node x & can come up with a single sql query to be able to get the data I need.
Example
Table A
user_id, free_data, used_data
1 50 10
2 30 20
3 20 20
Table B
donor_id, receptor_id, share_data
1 2 30
1 3 20
Total data donated by 1 - 30 + 20 = 50
Total donated data used - 20 + 20 = 40
This is just one level where 1 donated to 2 & 3. 2 in turn could donated to 4 & all that data needed to be collated in a bubbled up fashion for calculating the overall donated data usage.

Yes its possible using a nested set model. There's a book by Joe Celko that describes but if you want to get straight into it there's an article that talks about it. Both the collated data that you need can be retrieved by a single select statement like this:
SELECT * FROM TableB where left > some_value1 and right < some_value2
In the above example to get all the child nodes of "Portable Electronics" the query will be:
SELECT * FROM Electronics WHERE `left` > 10 and `right` < 19
The article describes how the left and right columns should be initialised.

If I understand the problem correctly, the following should give you the desired results:
SELECT B.donor_id AS donor_id, SUM(A.used_data) AS total_used_data FROM A
INNER JOIN B ON A.user_id = B.receptor_id GROUP BY B.donor_id;

Hope this will solve your problem now.
Try below query(note that you will have to pass userid at 2 places):
SELECT SUM(share_data) as total_donated, sum(used_data) as total_used FROM tablea
LEFT JOIN tableB
ON tableA.user_id = tableB.donor_id
WHERE user_id IN (select receptor_id as id
from (select * from tableb
order by donor_id, receptor_id) u_sorted,
(select #pv := '1') initialisation
where find_in_set(donor_id, #pv) > 0
and #pv := concat(#pv, ',', receptor_id)) OR user_id = 1;

Related

Get Top n data from a set of grouped data

I'm on MySQL 5.7 and have a dataset as the table, bundles.
bundle_id
single_amount
item_count
1
100
1
2
20
3
3
15
2
SQL Fiddle: http://sqlfiddle.com/#!9/823c91/1/0
e.g. The table means that for bundle1, a customer bought one item that is $100 and another customer bought 3 items that the individual one is $20.
I want to get the top n data by individual item, not by a bundle. The straightforward idea is to flatten the data as the table below.
single_amount
item_count
100
1
20
3
20
3
20
3
15
2
15
2
It's easy to do it on service but considering the potential size of the dataset, is there a way to do it on MySQL side?
The simplest method is to create one table having one number column and fill it with 1 to maximum numbers that is available in the column item_count.
Let's say, you have created table as numbers_table (number_col)
Then you can use that numbers table as follows:
select single_amount, item_count
from your_table t
join numbers_table n on t.item_count >= n.number_col
MySQL 8+ supports recursive CTEs (and actually CTEs in general). But earlier versions do not. To solve this, you need a numbers table of some sort. MySQL doesn't have one built-in, but you can generate one.
Based on your description of the problem, the bundles table is big enough, so you can use that:
SELECT n.n, b.*
FROM bundles b JOIN
(SELECT (#rn := #rn + 1) as n
FROM bundles b CROSS JOIN
(SELECT #rn := 0) params
LIMIT 100
) n
ON n.n <= b.item_count;
The LIMIT 100 is just for performance. You obviously need as many rows at the maximum item_count.
Here is a SQL Fiddle.

Hive Query that returns distinct value that each User has

I have a mysql table-
User Value
A 1
A 12
A 3
B 4
B 3
B 1
C 1
C 1
C 8
D 34
D 1
E 1
F 1
G 56
G 1
H 1
H 3
C 3
F 3
E 3
G 3
I need to run a query which returns 2nd distinct value that each user has.
Means if any 2 values are accessed by each user , then based on the occurrence, pick the 2nd distinct value.
So as above 1 & 3 is being accessed by each User. Occurrence of 1 is
more than 3 , so 2nd distinct will be 3
So I thought first I will get all distinct user.
create table temp AS Select distinct user from table;
Then I will have an outer query-
Select value from table where value in (...)
In programmatically way , I can iterate through each of the value user contains like Map but in Hive query I just couldn't write that.
This will return the second most frequented value from your list that spans all users. There isn't one of these values in the table which I expect is a typo in the data. In real data you will likely have muliple ties that you need to figure out how to handle.
Select value as second_distinct from
(select value, rank() over (order by occurrences desc) as rank
from
(SELECT value, unique_users, max(count_users) as count_users, count(value) as occurrences
from
(select value, size(collect_set(user) over (partition by value))
as count_users from my_table
) t
left outer join
(select count(distinct user) as unique_users from my_table
) t2 on (1=1)
where unique_users=count_users
group by value, unique_users
) t3
) t4
where rank = 2;
This works. It returns NULL because there is only value that visited every user (value of 1). Value 3 is not a solution because not every user has seen that value in your data. I expect you intended that three should be returned but again it doesn't span all the users (user D did not see value 3).
Not sure how #invoketheshell's answer was marked correct; it doesn't run and it needs 6 MR jobs. This will get you there in 4 and is less code.
Query:
select value
from (
select value, value_count, rank() over (order by value_count desc) rank
from (
select value, count(value) value_count
from (
select value, num_users, max(num_users) over () max_users
from (
select value
, size(collect_set(user) over (partition by value)) num_users
from db.table ) x ) y
where num_users = max_users
group by value ) z ) f
where rank = 2
Output:
3
EDIT: Let me clarify my solution as there seems to be some confusion. The OP's example says
"So as above 1 & 3 is being accessed by each User ... "
As my comment below the question suggests, in the example given, user D never accesses value 3. I made the assumption that this was a typo and added this to the dataset and then added another 1 as well to make there be more 1's than 3's. So my code correctly returns 3, which was the desired output. If you run this script on the actual dataset it will also produce the correct output which is nothing because there isn't a "2nd Distinct". The only time it could produce an incorrect value, is if there was no one specific number that was accessed by all users, which illustrates the point I was trying to make to #invoketheshell: if there is no single number that every user has accessed, running a query with 6 map-reduce jobs is an absurd way to find that out. Since we are using Hive I believe it would be fair to assume that if this problem were a "real-world" problem, it would most likely be executed on at least 100's of TBs of data (probably more). I the interest of preserving time and resources, it would behoove an individual to at least check that one number had been accessed by all users before running a massive query whose analysis hinges on that assumption being true.

mysql update with a self referencing query

I have a table of surveys which contains (amongst others) the following columns
survey_id - unique id
user_id - the id of the person the survey relates to
created - datetime
ip_address - of the submission
ip_count - the number of duplicates
Due to a large record set, its impractical to run this query on the fly, so trying to create an update statement which will periodically store a "cached" result in ip_count.
The purpose of the ip_count is to show the number of duplicate ip_address survey submissions have been recieved for the same user_id with a 12 month period (+/- 6months of created date).
Using the following dataset, this is the expected result.
survey_id user_id created ip_address ip_count #counted duplicates survey_id
1 1 01-Jan-12 123.132.123 1 # 2
2 1 01-Apr-12 123.132.123 2 # 1, 3
3 2 01-Jul-12 123.132.123 0 #
4 1 01-Aug-12 123.132.123 3 # 2, 6
6 1 01-Dec-12 123.132.123 1 # 4
This is the closest solution I have come up with so far but this query is failing to take into account the date restriction and struggling to come up with an alternative method.
UPDATE surveys
JOIN(
SELECT ip_address, created, user_id, COUNT(*) AS total
FROM surveys
WHERE surveys.state IN (1, 3) # survey is marked as completed and confirmed
GROUP BY ip_address, user_id
) AS ipCount
ON (
ipCount.ip_address = surveys.ip_address
AND ipCount.user_id = surveys.user_id
AND ipCount.created BETWEEN (surveys.created - INTERVAL 6 MONTH) AND (surveys.created + INTERVAL 6 MONTH)
)
SET surveys.ip_count = ipCount.total - 1 # minus 1 as this query will match on its own id.
WHERE surveys.ip_address IS NOT NULL # ignore surveys where we have no ip_address
Thank you for you help in advance :)
A few (very) minor tweaks to what is shown above. Thank you again!
UPDATE surveys AS s
INNER JOIN (
SELECT x, count(*) c
FROM (
SELECT s1.id AS x, s2.id AS y
FROM surveys AS s1, surveys AS s2
WHERE s1.state IN (1, 3) # completed and verified
AND s1.id != s2.id # dont self join
AND s1.ip_address != "" AND s1.ip_address IS NOT NULL # not interested in blank entries
AND s1.ip_address = s2.ip_address
AND (s2.created BETWEEN (s1.created - INTERVAL 6 MONTH) AND (s1.created + INTERVAL 6 MONTH))
AND s1.user_id = s2.user_id # where completed for the same user
) AS ipCount
GROUP BY x
) n on s.id = n.x
SET s.ip_count = n.c
I don't have your table with me, so its hard for me to form correct sql that definitely works, but I can take a shot at this, and hopefully be able to help you..
First I would need to take the cartesian product of surveys against itself and filter out the rows I don't want
select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)
The output of this should contain every pair of surveys that match (according to your rules) TWICE (once for each id in the 1st position and once for it to be in the 2nd position)
Then we can do a GROUP BY on the output of this to get a table that basically gives me the correct ip_count for each survey_id
(select x, count(*) c from (select s1.survey_id x, s2.survey_id y from surveys s1, surveys s2 where s1.survey_id != s2.survey_id and s1.ip_address = s2.ip_address and (s1.created and s2.created fall 6 months within each other)) group by x)
So now we have a table mapping each survey_id to its correct ip_count. To update the original table, we need to join that against this and copy the values over
So that should look something like
UPDATE surveys SET s.ip_count = n.c from surveys s inner join (ABOVE QUERY) n on s.survey_id = n.x
There is some pseudo code in there, but I think the general idea should work
I have never had to update a table based on the output of another query myself before.. Tried to guess the right syntax for doing this from this question - How do I UPDATE from a SELECT in SQL Server?
Also if I needed to do something like this for my own work, I wouldn't attempt to do it in a single query.. This would be a pain to maintain and might have memory/performance issues. It would be best have a script traverse the table row by row, update on a single row in a transaction before moving on to the next row. Much slower, but simpler to understand and possibly lighter on your database.

MySQL, how to repeat same line x times

I have a query that outputs address order data:
SELECT ordernumber
, article_description
, article_size_description
, concat(NumberPerBox,' pieces') as contents
, NumberOrdered
FROM customerorder
WHERE customerorder.id = 1;
I would like the above line to be outputted NumberOrders (e.g. 50,000) divided by NumberPerBox e.g. 2,000 = 25 times.
Is there a SQL query that can do this, I'm not against using temporary tables to join against if that's what it takes.
I checked out the previous questions, however the nearest one:
is to be posible in mysql repeat the same result
Only gave answers that give a fixed number of rows, and I need it to be dynamic depending on the value of (NumberOrdered div NumberPerBox).
The result I want is:
Boxnr Ordernr as_description contents NumberOrdered
------+--------------+----------------+-----------+---------------
1 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
2 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
....
25 | CORDO1245 | Carrying bags | 2,000 pcs | 50,000
First, let me say that I am more familiar with SQL Server so my answer has a bit of a bias.
Second, I did not test my code sample and it should probably be used as a reference point to start from.
It would appear to me that this situation is a prime candidate for a numbers table. Simply put, it is a table (usually called "Numbers") that is nothing more than a single PK column of integers from 1 to n. Once you've used a Numbers table and aware of how it's used, you'll start finding many uses for it - such as querying for time intervals, string splitting, etc.
That said, here is my untested response to your question:
SELECT
IV.number as Boxnr
,ordernumber
,article_description
,article_size_description
,concat(NumberPerBox,' pieces') as contents
,NumberOrdered
FROM
customerorder
INNER JOIN (
SELECT
Numbers.number
,customerorder.ordernumber
,customerorder.NumberPerBox
FROM
Numbers
INNER JOIN customerorder
ON Numbers.number BETWEEN 1 AND customerorder.NumberOrdered / customerorder.NumberPerBox
WHERE
customerorder.id = 1
) AS IV
ON customerorder.ordernumber = IV.ordernumber
As I said, most of my experience is in SQL Server. I reference http://www.sqlservercentral.com/articles/Advanced+Querying/2547/ (registration required). However, there appears to be quite a few resources available when I search for "SQL numbers table".

Obtain running frequency distribution from previous N rows of MySQL database

I have a MySQL database where one column contains status codes. The column is of type int and the values will only ever be 100,200,300,400. It looks like below; other columns removed for clarity.
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
The id field is auto-generated and will always be sequential. I want to have a third column displaying a comma-separated string of the frequency distribution of the status codes of the previous 10 rows. It should look like this.
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
I want the most frequent code listed first. And where two status codes have the same frequency it doesn't matter to me which is listed first but I did list the smaller code before the larger in the example. Lastly, where a code doesn't appear at all in the previous ten rows, it shouldn't be listed in the freq column either.
And to be very clear the row number that the frequency string appears on does NOT take into account the status code of that row; it's only the previous rows.
So what have I done? I'm pretty green with SQL. I'm a programmer and I find this SQL language a tad odd to get used to. I managed the following self-join select statement.
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
Using the aggregate function avg, I can at least demonstrate the concept. The derived table b provides the correct rows to the avg function but I just can't figure out the multi-step process of counting and grouping rows from b to get a frequency distribution and then collapse the frequency rows into a single string value.
Also I've tried using standard stored functions and procedures in place of the built-in aggregate functions, but it seems the b derived table is out of scope or something. I can't seem to access it. And from what I understand writing a custom aggregate function is not possible for me as it seems to require developing in C, something I'm not trained for.
Here's sql to load up the sample.
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
The sample has 30 rows of data to work with. I know it's a long question, but I just wanted to be as detailed as I could be. I've worked on this for a few days now and would really like to get it done.
Thanks for your help.
The only way I know of to do what you're asking is to use a BEFORE INSERT trigger. It has to be BEFORE INSERT because you want to update a value in the row being inserted, which can only be done in a BEFORE trigger. Unfortunately, that also means it won't have been assigned an ID yet, so hopefully it's safe to assume that at the time a new record is inserted, the last 10 records in the table are the ones you're interested in. Your trigger will need to get the values of the last 10 ID's and use the GROUP_CONCAT function to join them into a single string, ordered by the COUNT. I've been using SQL Server mostly and I don't have access to a MySQL server at the moment to test this, but hopefully my syntax will be close enough to at least get you moving in the right direction:
create trigger sample_trigger BEFORE INSERT ON sample
FOR EACH ROW
BEGIN
DECLARE _freq varchar(50);
SELECT GROUP_CONCAT(tbl.status ORDER BY tbl.Occurrences) INTO _freq
FROM (SELECT status, COUNT(*) AS Occurrences, 1 AS grp FROM sample ORDER BY id DESC LIMIT 10) AS tbl
GROUP BY tbl.grp
SET new.freq = _freq;
END
SELECT id, GROUP_CONCAT(status ORDER BY freq desc) FROM
(SELECT a.id as id, b.status, COUNT(*) as freq
FROM
sample a
JOIN
sample b ON (b.id < a.id) AND (b.id > a.id - 11)
WHERE
a.id > 10
GROUP BY a.id, b.status) AS sub
GROUP BY id;
SQL Fiddle