Get Top n data from a set of grouped data - mysql

I'm on MySQL 5.7 and have a dataset as the table, bundles.
bundle_id
single_amount
item_count
1
100
1
2
20
3
3
15
2
SQL Fiddle: http://sqlfiddle.com/#!9/823c91/1/0
e.g. The table means that for bundle1, a customer bought one item that is $100 and another customer bought 3 items that the individual one is $20.
I want to get the top n data by individual item, not by a bundle. The straightforward idea is to flatten the data as the table below.
single_amount
item_count
100
1
20
3
20
3
20
3
15
2
15
2
It's easy to do it on service but considering the potential size of the dataset, is there a way to do it on MySQL side?

The simplest method is to create one table having one number column and fill it with 1 to maximum numbers that is available in the column item_count.
Let's say, you have created table as numbers_table (number_col)
Then you can use that numbers table as follows:
select single_amount, item_count
from your_table t
join numbers_table n on t.item_count >= n.number_col

MySQL 8+ supports recursive CTEs (and actually CTEs in general). But earlier versions do not. To solve this, you need a numbers table of some sort. MySQL doesn't have one built-in, but you can generate one.
Based on your description of the problem, the bundles table is big enough, so you can use that:
SELECT n.n, b.*
FROM bundles b JOIN
(SELECT (#rn := #rn + 1) as n
FROM bundles b CROSS JOIN
(SELECT #rn := 0) params
LIMIT 100
) n
ON n.n <= b.item_count;
The LIMIT 100 is just for performance. You obviously need as many rows at the maximum item_count.
Here is a SQL Fiddle.

Related

Tree like data collation in SQL (Mysql)

I have two tables in my database
Table A with columns user_id, free_data, used_data
Table B with columns donor_id, receptor_id, share_data
Basically, a user (lets call x) has some data in his account which is represented by his entry in table A. The data is stored in free_data column. He can donate data to any other user (lets call y), which will show up as an entry in Table B. The same amount of data gets deducted from the user x free_data column.
While entry in Table B gets created, an entry in Table A for user y is also created with free_data value equal to share_data. Now user y can give away data to user z & the process continues.
Each user keep using their data & the entry used_data in Table A keeps on adding up to indicate how much data each user has used.
This is like a tree structure where there is a an entry with all the data (root node) who eventually gives data to others who in-turn give data to other nodes.
Now I would like to write an sql query such that, given a node x (id of entry in Table A), I should be able to sum up total data x has given & who all are beneficiaries at multiple level, all of their used_data need to be collated & showed against x.
Basically, I want to collate
Overall data x has donated.
How much of the donated data from x has been used up.
While the implementation is more graph-like, I am more interested to know if we assume it to be a tree below node x & can come up with a single sql query to be able to get the data I need.
Example
Table A
user_id, free_data, used_data
1 50 10
2 30 20
3 20 20
Table B
donor_id, receptor_id, share_data
1 2 30
1 3 20
Total data donated by 1 - 30 + 20 = 50
Total donated data used - 20 + 20 = 40
This is just one level where 1 donated to 2 & 3. 2 in turn could donated to 4 & all that data needed to be collated in a bubbled up fashion for calculating the overall donated data usage.
Yes its possible using a nested set model. There's a book by Joe Celko that describes but if you want to get straight into it there's an article that talks about it. Both the collated data that you need can be retrieved by a single select statement like this:
SELECT * FROM TableB where left > some_value1 and right < some_value2
In the above example to get all the child nodes of "Portable Electronics" the query will be:
SELECT * FROM Electronics WHERE `left` > 10 and `right` < 19
The article describes how the left and right columns should be initialised.
If I understand the problem correctly, the following should give you the desired results:
SELECT B.donor_id AS donor_id, SUM(A.used_data) AS total_used_data FROM A
INNER JOIN B ON A.user_id = B.receptor_id GROUP BY B.donor_id;
Hope this will solve your problem now.
Try below query(note that you will have to pass userid at 2 places):
SELECT SUM(share_data) as total_donated, sum(used_data) as total_used FROM tablea
LEFT JOIN tableB
ON tableA.user_id = tableB.donor_id
WHERE user_id IN (select receptor_id as id
from (select * from tableb
order by donor_id, receptor_id) u_sorted,
(select #pv := '1') initialisation
where find_in_set(donor_id, #pv) > 0
and #pv := concat(#pv, ',', receptor_id)) OR user_id = 1;

Limit On Accumulated Column in MySQL

I'm trying to find an elegant way to write a query that only returns enough rows for a certain column to add up to at least n.
For example, let's say n is 50, and the table rows look like this:
id count
1 12
2 13
3 5
4 18
5 14
6 21
7 13
Then the query should return:
id count
1 12
2 13
3 5
4 18
5 14
Because the counts column adds up to n > 50. (62, to be exact)
It must return the results consecutively starting with the smallest id.
I've looked a bit into accumulators, like in this one: MySQL select "accumulated" column
But AFAIK, there is no way to have the LIMIT clause in an SQL query limit on an SUM instead of a row count.
I wish I could say something like this, but alas, this is not valid SQL:
SELECT *
FROM elements
LIMIT sum(count) > 50
Also, please keep in my the goal here is to insert the result of this query into another table atomically in an automated, performance efficient fashion, so please no suggestions to use a spreadsheet or anything that's not SQL compatible.
Thanks
There are many ways to do this. One is by using Correlated Subquery
SELECT id,
count
FROM (SELECT *,
(SELECT Isnull(Sum(count), 0)
FROM yourtable b
WHERE b.id < a.id) AS Run_tot
FROM yourtable a) ou
WHERE Run_tot < 50

MySQL equally distributed random rows with WHERE clause

I have this table,
person_id int(10) pk
points int(6) index
other columns not very important
I have this random function which is very fast on a table with 10M rows:
SELECT person_id
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id >= r2.id
ORDER BY r1.person_id ASC
LIMIT 1
This is all great but now I wish to show only people with points > 0. Example table:
PERSON_ID POINTS
1 4
2 6
3 0
4 3
When I append AND points > 0 to the where clause, person_id 3 can't be selected, so a gap is created and when the random select person_id 3, person_id 4 will be selected. This gives person 4 a bigger chance to be chosen. Any one got suggestions how I can adjust the query to make it work with the where clause and give all rows same % of chance to be selected.
Info table: The table is uniform, no gaps in person_id's. About 90% will have 0 points. I want to make the query for where points = 0 and points > 0.
Before someone will say, use rand(): this is not solution for tables with more than a few 100k rows.
Bonus question: will it be possible to select x random rows in 1 query, so I do not have to call this query a few times when i want more random rows?
Important note: performance is key, with 10M+ rows the query may not take much longer than the current query, which takes 0.0005 seconds, I prefer to stay under 0.05 second.
Last note: If you think the query will never be this fast with above requirements, but another solution is possible (like fetching 100 rows and showing x random which has more than 0 points), please tell :)
Really appreciate your help and all help is welcome :)
You could generate in-line gap-free id's for the records that you really want to work with, and generate then the random selector using the total number of records available.
Try with this (props to the chosen answer here for the row_number generator):
SELECT r1.*
FROM
(SELECT person_id,
#curRow := #curRow + 1 AS row_number
FROM persons as p,
(SELECT #curRow := 0) r0
WHERE points>0) r1
, (SELECT COUNT(1) * RAND() id
FROM persons
WHERE points>0) r2
WHERE r1.person_id>=r2.id
ORDER BY r1.person_id ASC
LIMIT 1;
You can mess with it in this sqlfiddle.

Reordering of column data in mysql

I have a table like so:
categoryID categoryName
----------------------------
1 A
2 B
3 C
Now I want the user to be able to order this data according to his will. I want to remember his preferred order for future. So I thought I'd add a column order to the table above and make it of type INT and AUTO_INCREMENT. So now I get a table like this:
categoryID categoryName order
-------------------------------------
1 A 1
2 B 2
3 C 3
4 D 4
My problem is - the user now decides, to bring categoryName with order 4 (D in example above) up to 2 (above B in example above) such that the table would now look like:
categoryID categoryName order
-------------------------------------
1 A 1
2 B 3
3 C 4
4 D 2
My question is - How should I go about assigning new values to the order column when a reordering happens. Is there a way to do this without updating all rows in the table?
One approach that comes to mind is to make the column a FLOAT and give it an order of 1.5 if I want to bring it between columns with order 1,2. In this case I keep loosing precision as I reorder items.
EDIT:
Another is to update all rows between (m, n) where m, n are the source and destination orders respectively. But this would mean running (m-n) separate queries wouldn't it?
Edit 2:
Assuming I take the FLOAT approach, I came up with this sql to compute the order value for an item that needs to be inserted after item with id = 2 (for example).
select ((
select `order` as nextHighestOrder
from `categories`
where `order` > (
select `order` as targetOrder
from `categories`
where `categoryID`=2)
limit 1) + (
select `order` as targetOrder
from `categories`
where `categoryID`=2)) / 2;
This gives me 3.5 which is what I wanted to achieve.
Is there a better way to write this? Notice that select order as targetOrder from categories where categoryID=9 is executed twice.
If the number of changes is rather small you can generate a clumsy but rather efficient UPDATE statement if the you know the ids of the involved items:
UPDATE categories
JOIN (
SELECT 2 as categoryID, 3 as new_order
UNION ALL
SELECT 3 as categoryID, 4 as new_order
UNION ALL
SELECT 4 as categoryID, 2 as new_order) orders
USING (categoryId)
SET `order` = new_order;
or (which I like less):
UPDATE categories
SET `order` = ELT (FIND_IN_SET (categoryID, '2,3,4'),
3, 4, 2)
WHERE categoryID in (2,3,4);
UPD:
Assuming that you know the current id of the category (or its name), its old position, and its new position you can use the following query for moving a category down the list (for moving up you will have to change the between condition and new_rank computation to rank+1):
SET #id:=2, #cur_rank:=2, #new_rank:=4;
UPDATE t1
JOIN (
SELECT categoryID, (rank - 1) as new_rank
FROM t1
WHERE rank between #cur_rank + 1 AND #new_rank
UNION ALL
SELECT #id as categoryID, #new_rank as new_rank
) as r
USING (categoryID)
SET rank = new_rank;
The idea with Float sounds reasanoble, just don't show these numbers to a user -)
Whenever user moves an entry up or down, you can figure out entries above and below. Just take their Order number and find mean value - that is a new order for the entry that has been moved.
You could keep order as integer and renumber all the items between a drag's source index and destination index because they can't drag that far, especially as only 20 odd categories. Mulit-item drags make this more complicated however.
Float is easier, but each time they move you find the middle you could very quickly run out of precission, I would write a test for this to check it doesn't finally give up working if you keep moving the 3rd item to the 2nd pos over and over.
Example:
1,2,3
Move 3rd to 2nd
1,1.5,2
Move 3rd to 2nd
1,1.25,1.5
Move 3rd to 2nd
1,1.125,1.25
Do that in an excel spread sheet and you'll find the number becomes too small for floats to deal with in about 30 iterations.
Ok, here's the same that #newtover suggests, but these 2 simple queries can be much easier understood by any other developer, even unexperienced.
Let's say we have a table t1:
id name position
-------------------------------------
1 A 1
2 B 2
3 C 3
4 D 4
5 -E- 5
6 F 6
Let's move item 'E' with id=5 to 2nd position:
1) Increase positions for all items between the old position of item 'E' and the desired position of 'E' (positions 2, 3, 4)
UPDATE t1 SET position=position+1 WHERE position BETWEEN 2 AND 4
2) Now there is no item at position 2, so 'E' can take it's place
UPDATE t1 SET position=2 WHERE id=5
Results, ordered by 'position'
id name position
-------------------------------------
1 A 1
5 -E- 2
2 B 3
3 C 4
4 D 5
6 F 6
Just 2 simple queries, no subqueries.
Restriction: column 'position' cannot be UNIQUE. But perhaps with some modifications it should work as well.
Haven't tested this on large datasets.

Obtain running frequency distribution from previous N rows of MySQL database

I have a MySQL database where one column contains status codes. The column is of type int and the values will only ever be 100,200,300,400. It looks like below; other columns removed for clarity.
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
The id field is auto-generated and will always be sequential. I want to have a third column displaying a comma-separated string of the frequency distribution of the status codes of the previous 10 rows. It should look like this.
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
I want the most frequent code listed first. And where two status codes have the same frequency it doesn't matter to me which is listed first but I did list the smaller code before the larger in the example. Lastly, where a code doesn't appear at all in the previous ten rows, it shouldn't be listed in the freq column either.
And to be very clear the row number that the frequency string appears on does NOT take into account the status code of that row; it's only the previous rows.
So what have I done? I'm pretty green with SQL. I'm a programmer and I find this SQL language a tad odd to get used to. I managed the following self-join select statement.
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
Using the aggregate function avg, I can at least demonstrate the concept. The derived table b provides the correct rows to the avg function but I just can't figure out the multi-step process of counting and grouping rows from b to get a frequency distribution and then collapse the frequency rows into a single string value.
Also I've tried using standard stored functions and procedures in place of the built-in aggregate functions, but it seems the b derived table is out of scope or something. I can't seem to access it. And from what I understand writing a custom aggregate function is not possible for me as it seems to require developing in C, something I'm not trained for.
Here's sql to load up the sample.
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
The sample has 30 rows of data to work with. I know it's a long question, but I just wanted to be as detailed as I could be. I've worked on this for a few days now and would really like to get it done.
Thanks for your help.
The only way I know of to do what you're asking is to use a BEFORE INSERT trigger. It has to be BEFORE INSERT because you want to update a value in the row being inserted, which can only be done in a BEFORE trigger. Unfortunately, that also means it won't have been assigned an ID yet, so hopefully it's safe to assume that at the time a new record is inserted, the last 10 records in the table are the ones you're interested in. Your trigger will need to get the values of the last 10 ID's and use the GROUP_CONCAT function to join them into a single string, ordered by the COUNT. I've been using SQL Server mostly and I don't have access to a MySQL server at the moment to test this, but hopefully my syntax will be close enough to at least get you moving in the right direction:
create trigger sample_trigger BEFORE INSERT ON sample
FOR EACH ROW
BEGIN
DECLARE _freq varchar(50);
SELECT GROUP_CONCAT(tbl.status ORDER BY tbl.Occurrences) INTO _freq
FROM (SELECT status, COUNT(*) AS Occurrences, 1 AS grp FROM sample ORDER BY id DESC LIMIT 10) AS tbl
GROUP BY tbl.grp
SET new.freq = _freq;
END
SELECT id, GROUP_CONCAT(status ORDER BY freq desc) FROM
(SELECT a.id as id, b.status, COUNT(*) as freq
FROM
sample a
JOIN
sample b ON (b.id < a.id) AND (b.id > a.id - 11)
WHERE
a.id > 10
GROUP BY a.id, b.status) AS sub
GROUP BY id;
SQL Fiddle