Row number for query results grouped by a column - mysql

I have a table that has the following columns:
id | fk_id | rcv_date
There may be multiple records with a common fk_id, which represents a foreign key id in a related table.
I need to create a query that will assign a row number to each record, grouped by fk_id, sorted by rcv_date.
I originally began with the following query, which works quite well for sorting and assigning row numbers:
SELECT #row:=#row +1 AS ordinality, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, mytable c
ORDER BY rcv_date
However -- the row count and sorting is done across the entire dataset. I need the counting to be within a common fk_id. For example, the following sample data would return (the first column represents the row count/ordinality):
1 | 5 | 2011-10-01
2 | 5 | 2011-10-14
3 | 5 | 2011-11-02
4 | 5 | 2011-12-17
1 | 8 | 2011-09-03
2 | 8 | 2011-11-12
1 | 9 | 2011-10-08
2 | 9 | 2011-10-10
3 | 9 | 2011-11-19
The middle column represents the fk_id. As you can see, the sorting and row count is within the fk_id "grouping."
UPDATE
I have a query that seems to be working, but would like some input as to whether it can be improved:
SELECT IF(#last = c.fk_id, #row:=#row +1, #row:=1) AS ordinality, #last:=c.fk_id, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, (SELECT #last:=0) l, mytable c
ORDER BY c.fk_id, rcv_date
So what this does is order by fk_id and then rcv_date -- which essentially handles my grouping. Then I use a second variable to compare the fk_id in the previous record with the current record: if it's the same, we increment the row; if different, we reset to 1.
My tests with real data appear to be working. I suspect it's a pretty inefficient query though -- so if anyone has ideas for improving it, or see possible flaws, I would love to hear.

This should be pretty straightforward.
SELECT (CASE WHEN #fk <> fk_id THEN #row:=1 ELSE #row:=#row + 1 END) AS ordinality,
#fk:=fk_id, rcv_date
FROM (SELECT #row:=0) AS r,
(SELECT #fk:=0) AS f,
(SELECT fk_id, rcv_date FROM files ORDER BY fk_id, rcv_date) AS t
I ordered by fk_id first to ensure all your foreign keys come together (what if they are not really in the table?), then I did your preferred ordering, ie by rcv_date. The query checks for a change in fk_id and if there is one, then row number variable is set to 1, or else the variable is incremented. Its handled in case statement. Notice that #fk:=fk_id is done after the case checking else it will affect the row number.
Edit: Just noticed your own solution which happened to be the same as I ended up with. Kudos! :)

Related

Please help to optimize MySql UPDATE with two tables and M:N row mapping

I'm post-processing traces for two different kinds of events, where the data is stored in table A and B. Both tables have an producer ID and a time index value. While the same producer can trigger a record in both tables, the time when the different events occur are independent, and much more frequent for table B.
I want to update table A such that, for every row in table A, a column value from table B is taken for the most recent row in table B for the same producer.
Example mappings between two tables:
Here is a simplified example with just one producer in both tables. The goal is not to get the oldest entry in table B, but rather the most recent entry in table B relative to a row in table A. I'm showing B.tIdx < A.tIdx in this example, but <= is just as good for my purposes; just a detail.
Table A Table B
+----+------+----------------------+ +------+------+-------+
| ID | tIdx | NEW value SET FROM B | | ID | tIdx | value |
+----+------+----------------------+ +------+------+-------+
| 1 | 2 | 12.5 | | 1 | 1 | 12.5 |
| 1 | 4 | 4.3 | | 1 | 2 | 9.0 |
+----+------+----------------------+ | 1 | 3 | 4.3 |
| 1 | 4 | 7.8 |
| 1 | 5 | 6.2 |
+------+------+-------+
The actual tables have thousands of different IDs, millions of rows, and nearly as many distinct time index values as rows. I'm having trouble to come up with an UPDATE that doesn't take days to complete.
The following UPDATE works, but executes far too slowly; it starts off at a rate of 100s of updates/s, but soon slows to roughly 5 updates/s.
UPDATE A AS x SET value =
(SELECT value
FROM B AS y
WHERE x.ID = y.ID AND x.tIdx > y.tIdx
ORDER BY y.tIdx DESC
LIMIT 1);
I've tried creating indexes for ID and tIdx separately, but also multi-column indexes with both orders (ID,tIdx) and (tIdx,ID). But even when the multi-column indexes exist, EXPLAIN shows that it only ever indexes on ID or tIdx, but not both together.
I was wondering if the solution is to create nested SELECTs, to first get a temporary table with a particular ID, and then find the 1 row in table B that will meet the time constraint for each tIdx for that particular ID. The following SELECT, with hardcoded ID and tIdx, works and is very fast, completing in 0.00 sec.
SELECT value, ID, tIdx
FROM (
SELECT value, ID, tIdx
FROM B
WHERE ID = 5216
) y
WHERE tIdx < 1253707
ORDER BY tIdx DESC LIMIT 1;
I'd like to incorporate this into an UPDATE somehow, but replace the hardcoded ID and tIdx with the ID,tIdx pair for each row in A.
Or try any other suggestion for a more efficient UPDATE statement.
This is my first post to stackoverflow. Sincere apologizes in advance if I have violated any etiquette.
Update with Inner Join should do it, but it's going to get nasty to do this.
Update a INNER JOIN
(Select b.ID, maxb.atIdx, b.value
From b INNER JOIN (Select a.ID, a.tIdx as atIdx, max(b.tIdx) as bigb
From b INNER JOIN a
ON b.ID=a.ID
Where b.tIdx<=a.tIdx
Group By a.ID,a.tIdx) maxb
ON b.ID=maxb.ID and b.tIdx=maxb.bigb
) bestb ON a.ID=bestb.ID and a.tIdx=bestb.atIdx
Set a.value=bestb.value
To explain this it's best to start with the innermost SQL and work your way to the outermost UPDATE. To start, we need to join every record in table A to every record in table B for each ID. We can filter out the B records that are too recent and summarize that result for each table A record. That leaves us with the tIdx of the B table whose value goes into A for every record key in A. So then we join that to the B table to select the values to update, preserving the A-table's keys. That result is joined back to A to perform the update.
You'll have to see whether this is fast enough for you - I'm worried that this accesses the B table twice and the inner query creates A LOT of join combinations. I would pull out that inner query and see how long it runs by itself. On the positive side, they are all very simple, straightforward queries and they are connected by Inner Joins so there is some opportunity for efficiency in the query optimizer. I think indexes on a(ID,TIdx) [fast lookup to get the Update row] and b(ID) would be useful here.
One thing you can try is lead() to see if that helps the performance:
UPDATE A JOIN
(SELECT b.*,
LEAD(tIDx) OVER (PARTITION BY id order by tIDx) as next_tIDx
FROM b
) b
ON a.id = b.id AND
a.tIDx >= b.tIDx AND
(b.next_tIDx IS NULL or a.tIDx < b.next_tIDx)
SET a.value = b.value;
And for this you want an index on b(id, tidx).

Sql Query to retrive data from table

How to retrieve odd rows from the table?
In the Base table always Cr_id is duplicated 2 times.
Base table
I want a SELECT statement that retrieves only those c_id =1 where Cr_id is always first as shown in the output table.
Output table
Just see the base table and output table you should automatically know what I want, Thanx.
Just testing min date should be enough
drop table if exists t;
create table t(c_id int,cr_id int,dt date);
insert into t values
(1,56,'2020-12-17'),(56,56,'2020-12-17'),
(1,8,'2020-12-17'),(56,8,'2020-12-17'),
(123,78,'2020-12-17'),(1,78,'2020-12-18');
select c_id,cr_id,dt
from t
where c_id = 1 and
dt = (select min(dt) from t t1 where t1.cr_id = t.cr_id);
+------+-------+------------+
| c_id | cr_id | dt |
+------+-------+------------+
| 1 | 56 | 2020-12-17 |
| 1 | 8 | 2020-12-17 |
+------+-------+------------+
2 rows in set (0.002 sec)
What you're looking for could be "partition by", at least if you're working on mssql.
(In the future, please include more background, SQL is not just SQL)
https://codingsight.com/grouping-data-using-the-over-and-partition-by-functions/
I have an old query lying around, that is able to put a sorting index on data who lacks this, although the underlying reason is 99.9% sure to be a bad data design.
Typically I use this query to remove bad data, but you may rewrite it to become a join instead, so that you can identify the data you need.
The reason why I'm not putting that answer here, is to point out, bad data design results in more work when reading it afterwards, whom seems to be the real root cause here.
DELETE t
FROM
(
SELECT ROW_NUMBER () OVER (PARTITION BY column_1 ,column_2, column_3 ORDER BY column_1,column_2 ,column_3 ) AS Seq
FROM Table
)t
WHERE Seq > 1

Return the query when count of a query is greater than a number?

I want to return all rows that have a certain value in a column and have more than 5 instances in which a number is that certain value. For example, I would like to return all rows of the condition in which if the value in the column M has the number 1 in it and there are 5 or more instances of M having the number 1 in it, then it will return all rows with that condition.
select *
from tab
where M = 1
group by id --ID is the primary key of the table
having count(M) > 5;
EDIT: Here is my table:
id | M | price
--------+-------------+-------
1 | | 100
2 | 1 | 50
3 | 1 | 30
4 | 2 | 20
5 | 2 | 10
6 | 3 | 20
7 | 1 | 1
8 | 1 | 1
9 | 1 | 1
10 | 1 | 1
11 | 1 | 1
Originally I just want to insert into a trigger so that if the number of M = 1's is greater than 5, then I want to create an exception. The query I asked for would be inserted into the trigger. END EDIT.
But my table is always empty. Can anyone help me out? Thanks!
Try this :
select *
from tab
where M in (select M from tab where M = 1 group by M having count(id) > 5);
SQL Fiddle Demo
please try
select *,count(M) from table where M=1 group by id having count(M)>5
Since you group on your PK (which seems a futile excercise), you are counting per ID, whicg will indeed always return 1.
As i explain after this code, this query is NOT good, it is NOT the answer, and i also explain WHY. Please do not expect this query to run correctly!
select *
from tab
where M = 1
group by M
having count(*) > 5;
Like this, you group on what you are counting, which makes a lot more sense. At the same time, this will have unexpected behaviour, as you are selecting all kinds of columns that are not in the group by or in any aggregate. I know mySQL is lenient on that, but I don;t even want to know what it will produce.
Try indeed a subquery along these lines:
select *
from tab
where M in
(SELECT M
from tab
group by M
having count(*) > 5)
I've built a SQLFiddle demo (i used 'Test' as table name out of habit) accomplishing this (I don't have a mySQL at hand now to test it).
-- Made up a structure for testing
CREATE TABLE Test (
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
M int
);
SELECT id, M FROM tab
WHERE M IN (
SELECT M
FROM Test
WHERE M = 1
GROUP BY M
HAVING COUNT(M) > 5
)
The sub-query is a common "find the duplicates" kind of query, with the added condition of a specific value for the column M, also stating that there must be at least 5 dupes.
It will spit out a series of values of M which you can use to query the table against, ending with the rows you need.
You shouldn't use SELECT * , it's a bad practice in general: don't retrieve data you aren't actually using, and if you are using it then take the little time needed to type in a list of field, you'll likely see faster querying and on the other hand the code will be way more readable.

How does SELECT DISTINCT work in MySQL?

I have a table with multiple rows which have a same data. I used SELECT DISTINCT to get a unique row and it works fine. But when i use ORDER BY with SELECT DISTINCT it gives me unsorted data.
Can anyone tell me how distinct works?
Based on what criteria it selects the row?
From your comment earlier, the query you are trying to run is
Select distinct id from table where id2 =12312 order by time desc.
As I expected, here is your problem. Your select column and order by column are different. Your output rows are ordered by time, but that order doesn't necessarily need to preserved in the id column. Here is an example.
id | id2 | time
-------------------
1 | 12312 | 34
2 | 12312 | 12
3 | 12312 | 48
If you run
SELECT * FROM table WHERE id2=12312 ORDER BY time DESC
you will get the following result
id | id2 | time
-------------------
2 | 12312 | 12
1 | 12312 | 34
3 | 12312 | 48
Now if you select only the id column from this, you will get
id
--
2
1
3
This is why your results are not sorted.
When you specify SELECT DISTINCT it will give you all the rows, eliminating duplicates from the result set. By "duplicates" I mean rows where all fields have the same values. For example, say you have a table that looks like:
id | num
--------------
1 | 1
2 | 3
3 | 3
SELECT DISTINCT * would return all rows above, whereas SELECT DISTINCT num would return two rows:
num
-----
1
3
Note that which row actual row (eg: whether it's row 2 or row 3) it selects is irrelevant, as the result would be indistinguishable.
Finally, DISTINCT should not affect how ORDER BY works.
Reference: MySQL SELECT statement
The behaviour you describe happens when you ORDER BY an expression that is not present in the SELECT clause. The SQL standard does not allow such a query but MySQL is less strict and allows it.
Let's try an example:
SELECT DISTINCT colum1, column2
FROM table1
WHERE ...
ORDER BY column3
Let's say the content of the table table1 is:
id | column1 | column2 | column3
----+---------+---------+---------
1 | A | B | 1
2 | A | B | 5
3 | X | Y | 3
Without the ORDER BY clause, the above query returns following two records (without ORDER BY the order is not guaranteed):
column1 | column2
---------+---------
A | B
X | Y
But with ORDER BY column3 the order is also not guaranteed.
The DISTINCT clause operates on the values of the expressions present in the SELECT clause. If row #1 is processed first then (A, B) is placed in the result set and it is associated with row #1. Then, when row #2 is processed, the values of the SELECT expressions produce the record (A, B) that is already in the result set. Because of DISTINCT it is dropped. Row #3 produces (X, Y) that is also put in the result set. Then, the ORDER BY column3 clause makes the records be sorted in the result set as (A, B), (X, Y).
But if row #2 is processed before row #1 then, following the same logic exposed in the previous paragraph, the records in the result set are sorted as (X, Y), (A, B).
There is no rule imposed on the database engine about the order it processes the rows when it runs a query. The database is free to process the rows in any order it consider it's better for performance.
Your query is invalid SQL and the fact that it can return different results using the same input data proves it.

Efficiency question - Selecting numeric data from one field

I have a pair of tables and I need to search for numeric values in Table1 that match associated IDs on Table2. For example:
Table1
ID | Item
1 Cat
3 Frog
9 Dog
11 Horse
Table2
Category | Contains
Group 1 1
Group 2 3|9
Group 3 3|9|11
Originally I was thinking a LIKE would work, but if I searched for "1", I'd end up matching "11". I looked into SETs, but the MySQL docs state that the maximum number of elements is 64 and I have over 200 rows of items in Table1. I could wrap each item id with a character (e.g. "|1|") but that doesn't seem very efficient. Each Group will have unique items (e.g., there won't be two Cats in the same Group).
I found a similar topic as my problem and one of the answers suggested making another table, but I don't understand how that would work. A new table containing what, exactly?
The other option I have is to split the Contains into 6 separate columns, since there's never going to be more than 6 items in a Group, but then I'm not sure how to search all 6 columns without relying on six OR queries:
Category | C1 | C2 | C3 | C4 (etc)
Group 1 1 null null null
Group 2 3 9 null null
Group 3 3 9 11 null
SELECT * FROM Table2 WHERE C1 = '1' OR C2 = '1' OR C3 = '1' etc.
I'm not sure what the most efficient way of handling this is. I could use some advice from those with more experience with normalizing this kind of data please. Thank you.
I think it'd be best to create another table to normalize your data, however what you're proposing is not exactly what I'd suggest.
Realistically what you are modeling is a many-to-many relationship between table1 and table2. This means that one row in table1 can be associated with many rows in table2, and vice versa.
In order to create this kind of relation, you need a third table, which we can call rel_table1_table2 for now.
rel_table1_table2 will contain only primary key values from the two associated tables, which in this case seem to be table1.ID and table2.Category.
When you want to associate a row in table1 with a row in table2, you'd add a row to rel_table1_table2 with the primary key values from table1 and table2 respectively.
Example:
INSERT INTO rel_table1_table2 (ID, Category) VALUES (1, "Group 1")
When you need to find out what Items belong to a Category, you'd simply query your association table, for example:
SELECT i.Item from table1 t1 join rel_table1_table2 r on t1.ID=r.ID join table2 t2 on r.Category=t2.Category WHERE t2.Category="Group 3"
Does that make sense?
That "new" table would contain one row for each category an animal belongs to.
create table animal(
animal_id
,name
,primary key(animal_id)
)
create table category(
category_id
,name
,primary key(category_id)
)
create table animal_categories(
animal_id
,category_id
,primary key(animal_id, category_id)
)
For your example data, the animal_categories table would contain:
category_id | animal_id
+-----------+------------+
| 1 | 1 |
| 2 | 3 |
| 2 | 9 |
| 3 | 3 |
| 3 | 9 |
| 3 | 11 |
+-----------+------------+
Instead of using "like" use "REGEXP" so that you don't get "11" when looking for "1"
Break Table2.Contains in another table which joins Item and Category:
Item Item_Category Category
------ -------------- ---------
ID (1)----(*)ItemID Name
Name CategoryID(*)-------(1) ID
Now, your query will look like:
SELECT Category.* FROM Category, Item_Category
WHERE (Item_Category.CategoryID = Category.ID)
AND (Item_Category.ItemID IN (1, 2, 3, 11))
It seems like your problem is the way you are using the rows in Table 2. In databases it should always trigger a red flag when you find yourself using a list of values in a row.
Rather than having each category be in a single row in table 2, how about using the same category in multiple rows, with the Contains column only storing a single value. Your example could be changed to:
Table 1
ID | Item
1 Cat
3 Frog
9 Dog
11 Horse
Table 2
Category | Contains
Group 1 1
Group 2 3
Group 2 9
Group 3 3
Group 3 9
Group 3 11
Now when you want to find out "What items does group 2 contain?", you can write a query for that which selects all of the "Group 2" category rows from Table 2. When you want to find out, "What is the name of item 9", you can write a query that selects a row from Table 1.