MySQL complex nth row selection - mysql

I have 2 tables:
Types Data
+----+----------+ +-------+-------+
| id | name | | id | type |
+----+----------+ +-------+-------+
| 1 | name1 | | 1 | 1 |
| 2 | name2 | | 2 | 5 |
| 3 | name3 | | 3 | 7 |
| 4 | name4 | | 4 | 4 |
| 5 | name5 | | 5 | 2 |
| 6 | name6 | | 6 | 6 |
| 7 | name7 | | 7 | 3 |
| .. | .. | | 8 | 5 |
+----+----------+ | 9 | 5 |
| 10 | 4 |
| 11 | 1 |
| 12 | 2 |
| 13 | 6 |
| 14 | 5 |
| 15 | 2 |
| ... | ... |
| 1...? | 1...? |
+-------+-------+
Data table is very large, it contains millions of rows I need to select 1000 rows, but the result has to be from whole table, so every nth row select. I'v done this using answer from How to select every nth row in mySQL starting at n but, I need add some more logic to it, I need a select query that would select every nth row of all the types. I guess this sound complicated so I'll try to describe what I would like to achieve:
Lets say there are 7 Types and Data table has 7M rows 0.5M rows for types 1,2,3, 1.5M rows for types 4,5,6,7 (just be clear intervals may now be equal for all the types).
I need 1000 records that contains equal amounts of types so if I 7 types each type can occur in result set ROUND(1000/7) which would be equal to 142 records per type so I need to select 142 per type from Data table;
For types 1,2,3 which contains 0.5M rows that would be ROUND(0.5M / 142) which equals every nth 3521 row;
For types 4,5,6,7 which contains 1.5M rows that would be ROUND(1.5M / 142) which equals every nth 10563 row;
So result would look something like this:
Result
+-------+------+
| id | type |
+-------+------+
| 1 | 1 |
| 3522 | 1 |
| 7043 | 1 |
| .. | .. |
| .. | 2 |
| .. | 2 |
| .. | .. |
| .. | 3 |
| .. | 3 |
| .. | .. |
| .. | 4 |
| .. | 4 |
| .. | .. |
| .. | 5 |
| .. | 5 |
| .. | .. |
| .. | 6 |
| .. | 6 |
| .. | .. |
| .. | 7 |
| .. | 7 |
| .. | .. |
+-------+------+
I could do this simply in any programming language with multiple queries that return each type's count from Data table, then after doing the maths selecting only single type at the time.
But I would like to do this purely in MySQL, using as less queries as possible.
EDIT
I'll try to explain in more detail what I wan't to achieve with real example.
I have table with 1437823 rows. Table schema looks like this:
+---------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| type | int(11) | NO | | NULL | |
| counter | int(11) | NO | | NULL | |
| time | datetime | NO | | NULL | |
+---------+----------+------+-----+---------+----------------+
That table type statistics is:
+------+-----------+
| Type | Row Count |
+------+-----------+
| 1 | 135160 |
| 2 | 291416 |
| 3 | 149863 |
| 4 | 296293 |
| 5 | 273459 |
| 6 | 275929 |
| 7 | 15703 |
+------+-----------+
(P.S. Types count can change in time.)
Let's say I need to select sample data from time interval, In first version of question I omitted time because I thought of it as insignificant but now I think it might have some significance when ordering to improve performance.
So anyway I need to select approximately 1000 rows sample in which there's equal chunk of data for each type, so the statistic of end result would look like this:
I am selecting 1000 rows with 7 types so ROUND(1000 / 7) = 143 rows per type;
+------+-----------+
| Type | Row Count |
+------+-----------+
| 1 | 143 |
| 2 | 143 |
| 3 | 143 |
| 4 | 143 |
| 5 | 143 |
| 6 | 143 |
| 7 | 143 |
+------+-----------+
So now I need to select 143 rows for each type in equal gaps in time interval. So for single type it would look something like this:
SET #start_date := '2014-04-06 22:20:21';
SET #end_date := '2015-02-20 16:20:58';
SET #nth := ROUND(
(SELECT COUNT(*) FROM data WHERE type = 1 AND time BETWEEN #start_date AND #end_date) / ROUND(1000 / (SELECT COUNT(*) FROM types))
);
SELECT r.*
FROM (SELECT * FROM data WHERE type = 1 AND time BETWEEN #start_date AND #end_date) r
CROSS
JOIN ( SELECT #i := 0 ) s
HAVING ( #i := #i + 1) MOD #nth = 1
Statistics:
+------+-----------+
| Type | Row Count |
+------+-----------+
| 1 | 144 |
+------+-----------+
This query would give me needed results with tolerable performance, but I would need a query for each type which would decrease performance and would require later to concatenate results into single data set since that's what I need for further processing, so I would like to do it in single query or at least get single result set.
P.S. I can tolerate row count deviation in result set as long as type chunks are equal.

This should do what you want (tested on a table with 100 rows with TYPE=1, 200 rows with TYPE=2, 300 rows with TYPE=3, 400 rows with TYPE=4; with the value 10 in _c / 10, I get 40 rows, 10 of each type). Please check the performance, since I'm obviously using a smaller sample table than what you really have.
select * from
(select
#n := #n + 1 _n,
_c,
data.*
from
(select
type _t,
count(*) _c
from data
group by type) _1
inner join data on(_t = data.type)
inner join (select #n := 0) _2 order by data.type) _2
where mod(_n, floor(_c / 10)) = 0
order by type, id;
Although this gets the same number from each group, it isn't guaranteed to get the exact same number from each group, since there are obviously rounding inaccuracies introduced by the floor(_c / 10).

What you want is a stratified sample. A good way to get a stratified sample is to order the rows by the type and assign a sequential number -- the numbering does not have to start over for each type.
You can then get 1000 rows by taking each nth value:
select d.*
from (select d.*, (#rn := #rn + 1) as rn
from data d cross join
(select #rn := 0) vars
order by type
) d
where mod(rn, floor( #rn / 1000 )) = 1;
Note: The final comparison is getting 1 out of n rows to approximate 1000. It might be off by one or two depending on the number of values.
EDIT:
Oops, the above does a stratified sample that matches the original distribution of the types in the data. To get equal counts for each group, enumerate them randomly and choose the first "n" for each group:
select d.*
from (select d.*,
(#rn := if(#t = type, #rn + 1,
if(#t := type, 1, 1)
)
) as rn
from data d cross join
(select #rn := 0, #t := -1) vars
order by type, rand()
) d cross join
(select count(*) as numtypes from types) as t
where rn <= 1000 / numtypes;

Related

Select the first rows of a table (that has a column with int values which sum of values is less than the input

My input is 5 in this case, and I want to select the first two rows, to delete the first row and to update the second one, putting the value 7 instead of 10.
I tried to do this query but it's not enough:
SELECT SUM(`Qty in acquisto`) AS total,`Prezzo in acquisto`
FROM `book`
GROUP BY `Qty in acquisto`
HAVING COUNT(*) >5
You could use variables to get the rows of interest, together with the information you need to update the records:
SELECT *
FROM (
SELECT `Qty in acquisto`,
`Prezzo in acquisto`,
#take := least(`Qty in acquisto`, #needed) as taken,
`Qty in acquisto` - #take as adjusted_acquisto,
#needed := #needed - #take as still_needed
FROM book,
(select #needed := 5) as init
ORDER BY `Prezzo in acquisto` DESC) base
WHERE taken + still_needed > 0
The output for the sample data is:
| Qty in acquisto | Prezzo in acquisto | taken | adjusted_acquisto | still_needed |
|-----------------|--------------------|-------|-------------------|--------------|
| 2 | 1000 | 2 | 0 | 3 |
| 10 | 960 | 3 | 7 | 0 |
See SQL fiddle
In the innermost query, with alias init, you pass the number of books you need (5 in the example).
So in column adjusted_acquisto you find the value you need to perform the deletes and update:
If that value is 0, delete the corresponding record.
It that value is not 0, update the Qty with that value.
E.g.:
SELECT * FROM my_table;
+------+--------+
| id | amount |
+------+--------+
| 800 | 8 |
| 900 | 3 |
| 950 | 4 |
| 960 | 10 |
| 1000 | 2 |
+------+--------+
SELECT n.id
, GREATEST(amount-#x,0) new_amount
, #x:=GREATEST(#x-amount,0) x
FROM my_table n
, (SELECT #x:=5) vars
ORDER
BY id DESC;
+------+--------+------------+------+
| id | amount | new_amount | x |
+------+--------+------------+------+
| 1000 | 2 | 0 | 3 |
| 960 | 10 | 7 | 0 |
| 950 | 4 | 4 | 0 |
| 900 | 3 | 3 | 0 |
| 800 | 8 | 8 | 0 |
+------+--------+------------+------+

Distinct order-number sequence for every customer

I have table of orders. Each customer (identified by the email field) has his own orders. I need to give a different sequence of order numbers for each customer. Here is example:
----------------------------
| email | number |
----------------------------
| test#com.com | 1 |
----------------------------
| example#com.com | 1 |
----------------------------
| test#com.com | 2 |
----------------------------
| test#com.com | 3 |
----------------------------
| client#aaa.com | 1 |
----------------------------
| example#com.com | 2 |
----------------------------
Is possible to do that in a simple way with mysql?
If you want update data in this table after an insert, first of all you need a primary key, a simple auto-increment column does the job.
After that you can try to elaborate various script to fill the number column, but as you can see from other answer, they are not so "simple way".
I suggest to assign the order number in the insert statement, obtaining the order number with this "simpler" query.
select coalesce(max(`number`), 0)+1
from orders
where email='test1#test.com'
If you want do everything in a single insert (better for performance and to avoid concurrency problems)
insert into orders (email, `number`, other_field)
select email, coalesce(max(`number`), 0) + 1 as number, 'note...' as other_field
from orders where email = 'test1#test.com';
To be more confident about not assign at the same customer two orders with the same number, I strongly suggest to add an unique constraint to the columns (email,number)
create a column order_number
SELECT #i:=1000;
UPDATE yourTable SET order_number = #i:=#i+1;
This will keep incrementing the column value in order_number column and will start right after 1000, you can change the value or even you can even use the primary key as the order number since it is unique all the time
I think one more need column for this type of out put.
Example
+------+------+
| i | j |
+------+------+
| 1 | 11 |
| 1 | 12 |
| 1 | 13 |
| 2 | 21 |
| 2 | 22 |
| 2 | 23 |
| 3 | 31 |
| 3 | 32 |
| 3 | 33 |
| 4 | 14 |
+------+------+
You can get this result:
+------+------+------------+
| i | j | row_number |
+------+------+------------+
| 1 | 11 | 1 |
| 1 | 12 | 2 |
| 1 | 13 | 3 |
| 2 | 21 | 1 |
| 2 | 22 | 2 |
| 2 | 23 | 3 |
| 3 | 31 | 1 |
| 3 | 32 | 2 |
| 3 | 33 | 3 |
| 4 | 14 | 1 |
+------+------+------------+
By running this query, which doesn't need any variable defined:
SELECT a.i, a.j, count(*) as row_number FROM test a
JOIN test b ON a.i = b.i AND a.j >= b.j
GROUP BY a.i, a.j
Hope that helps!
You can add number using SELECT statement without adding any columns in table orders.
try this:
SELECT email,
(CASE email
WHEN #email
THEN #rownumber := #rownumber + 1
ELSE #rownumber := 1 AND #email:= email END) as number
FROM orders
JOIN (SELECT #rownumber:=0, #email:='') AS t

MYSQL - how do i select no more than x rows max with the same field value y?

this question is a bit tricky to formulate, so probably has been asked before.
i am selecting rows from a table of interrelating data. i only want a maximum of n rows which have the same value x of some field/column in the table to show up in my set. there is a global limit, in essence i always want the query to return the same amount of rows, with no more than n rows sharing value x. how do i do this?
here's an example of the data (dots are supposed to indicate that this table is large, let's say 20000 rows of data):
some_table
+----+----------+-------------+------------+
| id | some_id | some_column | another_id |
+----+----------+-------------+------------+
| 1 | 10 | value | 8 |
| 2 | 10 | value | 5 |
| 3 | 10 | value | 2 |
| 4 | 20 | value | 3 |
| 5 | 30 | value | 9 |
| 6 | 30 | value | 1 |
| 7 | 30 | value | 4 |
| 8 | 30 | value | 6 |
| 9 | 30 | value | 7 |
| 10 | 40 | value | 10 |
| .. | ... | ... | ... |
| .. | ... | ... | ... |
| .. | ... | ... | ... |
| .. | ... | ... | ... |
+----+----------+-------------+------------+
now here's my select:
select * from some_table where some_column="value" order by another_id limit 6
but instead of returning rows with another_id = 1 thru 6 i want to get no more than 2 rows with the same value of some_id. in other words, i'd like to get:
result set
+----+----------+-------------+------------+
| id | some_id | some_column | another_id |
+----+----------+-------------+------------+
| 6 | 30 | value | 1 |
| 3 | 10 | value | 2 |
| 1 | 10 | value | 3 |
| 7 | 30 | value | 4 |
| 4 | 20 | value | 8 |
| 10 | 40 | value | 10 |
+----+----------+-------------+------------+
note that the results are ordered by another_id, but there are no more than 2 results with the same value of some_id.
how can i best (meaning preferably in one query and reasonably fast) get there? thanks!
select id, some_id, some_column, another_id from (
select
t.*,
#rn := if(#prev = some_id, #rn + 1, 1) as rownumber,
#prev := some_id
from some_table t
, (select #prev := null, #rn := 0) var_init
where some_column="value"
order by some_id, id
) sq where rownumber <= 2
order by another_id;
see it working live in an sqlfiddle
First we order by some_id, id in the subquery to do the right calculations. Then we order by another_id in the outer query to have correct ordering.

CSV formatted GROUP_CONCAT in MySQL

Let's say I have a Table A that I want to transform into Table B.
The values in Table B should always be a CSV formated text with the same number of fields.
First, I need to know what is the largest number of values that a given category handles (in this case, 3 values in category 1, 2 and 4);
Secondly I also need to use that variable to "add" empty fields(",") to the end of the GROUP_CONCAT when a category has "missing" values.
I need this to have a "consistent" CSV in each cell. The application I'm using to process this data doesn't interpret well CSVs with different column number by row...
Table A
+----+----------+-------+
| id | category | value |
+----+----------+-------+
| 1 | 1 | a |
| 2 | 1 | b |
| 3 | 1 | c |
| 4 | 2 | d |
| 5 | 2 | e |
| 6 | 2 | f |
| 7 | 3 | g |
| 8 | 3 | h |
| 9 | 4 | i |
| 10 | 4 | j |
| 11 | 4 | k |
| 12 | 5 | l |
+----+----------+-------+
Table B
+--------------+---------------------+
| id(category) | value(group_concat) |
+--------------+---------------------+
| 1 | a,b,c |
| 2 | d,e,f |
| 3 | g,h, |
| 4 | i,j,k |
| 5 | l,, |
+--------------+---------------------+
EDITED (SQLFiddle):
http://sqlfiddle.com/#!2/825f8
first, to get the largest number of values that a given category handles:
select count(category) from tableA group by category order by count(category) desc limit 1;
second, to add empty fields(",") to the end of the GROUP_CONCAT when a category has "missing" values.
i created a function called unify_length to help do this.
this is the function:
delimiter $$
CREATE FUNCTION `unify_length`(csv_list CHAR(255), length INT) RETURNS char(255)
DETERMINISTIC
BEGIN
WHILE ((SELECT LENGTH(csv_list) - LENGTH(REPLACE(csv_list, ',', ''))) < length-1) DO /* count the number of occurrances in a string*/
SET csv_list = CONCAT(csv_list, ',');
END WHILE;
RETURN csv_list;
END$$
and this is the function call:
select category, unify_length(GROUP_CONCAT(value), length) from tablea group by category;
where length is what was returned from the first query.

Join with positions

I have got tables baskets, fruits and basket_fruits (join-table: basket_id-fruit_id).
How can I return a position of each fruit in basket so I will get something like
+---------------------------------------+
| basket_id | fruit_id | fruit_position |
|---------------------------------------|
| 1 | 2 | 1 |
| 1 | 5 | 2 |
+---------------------------------------+
Fruit position is just a number of a row in a returned joined table (it is not a column).
Schema:
baskets: id, title
fruits: id, title
basket_fruits: id, basket_id, fruit_id
MySQL does not support ranging functions so you'll have to use subqueries:
SELECT basket_id, fruit_id,
(
SELECT COUNT(*)
FROM basket_fruit bfi
WHERE bfi.basket_id = bf.basket_id
AND bfi.fruit_id <= bf.fruit_id
) AS fruit_position
FROM basket_fruit bf
WHERE basket_id = 1
or use session variables (faster but relies on implementation details which are not documented and may break in future releases):
SET #rn = 0;
SELECT basket_id, fruit_id, #rn := #rn + 1 AS fruit_position
FROM basket_fruit bf
WHERE basket_id = 1
ORDER BY
fruit_id
I do not see any column in basket_fruits table that I would consider weightable. If you simply want to add some numbers to the data in that table, you could try this (this allows each basket to have its own weights counting from 1):
SET #current_group = NULL;
SET #current_count = NULL;
SELECT
id, basket_id, fruit_id,
CASE
WHEN #current_group = basket_id THEN #current_count := #current_count + 1
WHEN #current_group := basket_id THEN #current_count := 1
END AS fruit_position
FROM basket_fruits
ORDER BY basket_id, id
Sample input:
+----+-----------+----------+
| id | basket_id | fruit_id |
+----+-----------+----------+
| 2 | 2 | 5 |
| 6 | 2 | 1 |
| 9 | 1 | 2 |
| 15 | 2 | 3 |
| 17 | 1 | 5 |
+----+-----------+----------+
Sample output:
+----+-----------+----------+----------------+
| id | basket_id | fruit_id | fruit_position |
+----+-----------+----------+----------------+
| 9 | 1 | 2 | 1 |
| 17 | 1 | 5 | 2 |
| 2 | 2 | 5 | 1 |
| 6 | 2 | 1 | 2 |
| 15 | 2 | 3 | 3 |
+----+-----------+----------+----------------+
SQL provides no guarantees on the order of the returned rows. Therefore fruit_position is likely to be different when queried from time to time. Most likely this will happen due to DML activity on your table.
If you really need some ordering, you should pick:
Use existing columns as ordering key, like fruit name (if exists)
Create a special field, like seq_nr that will specify ordering for your fruits.