Fetching linked list in MySQL database - mysql

I have a MySQL database table with this structure:
table
id INT NOT NULL PRIMARY KEY
data ..
next_id INT NULL
I need to fetch the data in order of the linked list. For example, given this data:
id | next_id
----+---------
1 | 2
2 | 4
3 | 9
4 | 3
9 | NULL
I need to fetch the rows for id=1, 2, 4, 3, 9, in that order. How can I do this with a database query? (I can do it on the client end. I am curious if this can be done on the database side. Thus, saying it's impossible is okay (given enough proof)).
It would be nice to have a termination point as well (e.g. stop after 10 fetches, or when some condition on the row turns true) but this is not a requirement (can be done on client side). I (hope I) do not need to check for circular references.

Some brands of database (e.g. Oracle, Microsoft SQL Server) support extra SQL syntax to run "recursive queries" but MySQL does not support any such solution.
The problem you are describing is the same as representing a tree structure in a SQL database. You just have a long, skinny tree.
There are several solutions for storing and fetching this kind of data structure from an RDBMS. See some of the following questions:
"What is the most efficient/elegant way to parse a flat table into a tree?"
"Is it possible to make a recursive SQL query ?"
Since you mention that you'd like to limit the "depth" returned by the query, you can achieve this while querying the list this way:
SELECT * FROM mytable t1
LEFT JOIN mytable t2 ON (t1.next_id = t2.id)
LEFT JOIN mytable t3 ON (t2.next_id = t3.id)
LEFT JOIN mytable t4 ON (t3.next_id = t4.id)
LEFT JOIN mytable t5 ON (t4.next_id = t5.id)
LEFT JOIN mytable t6 ON (t5.next_id = t6.id)
LEFT JOIN mytable t7 ON (t6.next_id = t7.id)
LEFT JOIN mytable t8 ON (t7.next_id = t8.id)
LEFT JOIN mytable t9 ON (t8.next_id = t9.id)
LEFT JOIN mytable t10 ON (t9.next_id = t10.id);
It'll perform like molasses, and the result will come back all on one row (per linked list), but you'll get the result.

If what you are trying to avoid is having several queries (one for each node) and you are able to add columns, then you could have a new column that links to the root node. That way you can pull in all the data at once by the root id, but you will still have to sort the list (or tree) on the client side.
So in this is example you would have:
id | next_id | root_id
----+---------+---------
1 | 2 | 1
2 | 4 | 1
3 | 9 | 1
4 | 3 | 1
9 | NULL | 1
Of course the disadvantage of this as opposed to traditional linked lists or trees is that the root cannot change without writing on an order of magnitude of O(n) where n is the number of nodes. This is because you would have to update the root id for each node. Fortunately though you should always be able to do this in a single update query unless you are dividing a list/tree in the middle.

This is less a solution and more of a workaround but, for a linear list (rather than the tree Bill Karwin mentioned), it might be more efficient to use a sort column on your list. For example:
TABLE `schema`.`my_table` (
`id` INT NOT NULL PRIMARY KEY,
`order` INT,
data ..,
INDEX `ix_order` (`sort_order` ASC)
);
Then:
SELECT * FROM `schema`.`my_table` ORDER BY `order`;
This has the disadvantage of slower inserts (you have to reposition all sorted elements past the insertion point) but should be fast for retrieval because the order column is indexed.

Related

Please help to optimize MySql UPDATE with two tables and M:N row mapping

I'm post-processing traces for two different kinds of events, where the data is stored in table A and B. Both tables have an producer ID and a time index value. While the same producer can trigger a record in both tables, the time when the different events occur are independent, and much more frequent for table B.
I want to update table A such that, for every row in table A, a column value from table B is taken for the most recent row in table B for the same producer.
Example mappings between two tables:
Here is a simplified example with just one producer in both tables. The goal is not to get the oldest entry in table B, but rather the most recent entry in table B relative to a row in table A. I'm showing B.tIdx < A.tIdx in this example, but <= is just as good for my purposes; just a detail.
Table A Table B
+----+------+----------------------+ +------+------+-------+
| ID | tIdx | NEW value SET FROM B | | ID | tIdx | value |
+----+------+----------------------+ +------+------+-------+
| 1 | 2 | 12.5 | | 1 | 1 | 12.5 |
| 1 | 4 | 4.3 | | 1 | 2 | 9.0 |
+----+------+----------------------+ | 1 | 3 | 4.3 |
| 1 | 4 | 7.8 |
| 1 | 5 | 6.2 |
+------+------+-------+
The actual tables have thousands of different IDs, millions of rows, and nearly as many distinct time index values as rows. I'm having trouble to come up with an UPDATE that doesn't take days to complete.
The following UPDATE works, but executes far too slowly; it starts off at a rate of 100s of updates/s, but soon slows to roughly 5 updates/s.
UPDATE A AS x SET value =
(SELECT value
FROM B AS y
WHERE x.ID = y.ID AND x.tIdx > y.tIdx
ORDER BY y.tIdx DESC
LIMIT 1);
I've tried creating indexes for ID and tIdx separately, but also multi-column indexes with both orders (ID,tIdx) and (tIdx,ID). But even when the multi-column indexes exist, EXPLAIN shows that it only ever indexes on ID or tIdx, but not both together.
I was wondering if the solution is to create nested SELECTs, to first get a temporary table with a particular ID, and then find the 1 row in table B that will meet the time constraint for each tIdx for that particular ID. The following SELECT, with hardcoded ID and tIdx, works and is very fast, completing in 0.00 sec.
SELECT value, ID, tIdx
FROM (
SELECT value, ID, tIdx
FROM B
WHERE ID = 5216
) y
WHERE tIdx < 1253707
ORDER BY tIdx DESC LIMIT 1;
I'd like to incorporate this into an UPDATE somehow, but replace the hardcoded ID and tIdx with the ID,tIdx pair for each row in A.
Or try any other suggestion for a more efficient UPDATE statement.
This is my first post to stackoverflow. Sincere apologizes in advance if I have violated any etiquette.
Update with Inner Join should do it, but it's going to get nasty to do this.
Update a INNER JOIN
(Select b.ID, maxb.atIdx, b.value
From b INNER JOIN (Select a.ID, a.tIdx as atIdx, max(b.tIdx) as bigb
From b INNER JOIN a
ON b.ID=a.ID
Where b.tIdx<=a.tIdx
Group By a.ID,a.tIdx) maxb
ON b.ID=maxb.ID and b.tIdx=maxb.bigb
) bestb ON a.ID=bestb.ID and a.tIdx=bestb.atIdx
Set a.value=bestb.value
To explain this it's best to start with the innermost SQL and work your way to the outermost UPDATE. To start, we need to join every record in table A to every record in table B for each ID. We can filter out the B records that are too recent and summarize that result for each table A record. That leaves us with the tIdx of the B table whose value goes into A for every record key in A. So then we join that to the B table to select the values to update, preserving the A-table's keys. That result is joined back to A to perform the update.
You'll have to see whether this is fast enough for you - I'm worried that this accesses the B table twice and the inner query creates A LOT of join combinations. I would pull out that inner query and see how long it runs by itself. On the positive side, they are all very simple, straightforward queries and they are connected by Inner Joins so there is some opportunity for efficiency in the query optimizer. I think indexes on a(ID,TIdx) [fast lookup to get the Update row] and b(ID) would be useful here.
One thing you can try is lead() to see if that helps the performance:
UPDATE A JOIN
(SELECT b.*,
LEAD(tIDx) OVER (PARTITION BY id order by tIDx) as next_tIDx
FROM b
) b
ON a.id = b.id AND
a.tIDx >= b.tIDx AND
(b.next_tIDx IS NULL or a.tIDx < b.next_tIDx)
SET a.value = b.value;
And for this you want an index on b(id, tidx).

Does MySQL have a way to "coalesce" as an aggregate function?

I'm attempting to take an existing application and re-architect the schema to support new customer requests and fix several outstanding issues (mostly around our current schema being heavily denormalized). In doing so, I've reached an interesting problem which at first glance seems to have a simple solution, but I can't seem to find the function I'm looking for.
The application is a media organization tool.
Our Old Schema:
Our old schema had separate models for "Groups", "Subgroups", and "Videos". A Group could have many Subgroups (one-to-many) and a Subgroup could have many Videos (one-to-many).
There were certain fields that were shared among Groups, Subgroups, and Videos. For instance, the Google Analytics ID to be used when the Video was embedded on a page. Whenever we displayed the embed page we would first look if the value was set on the Video. If not, we checked its Subgroup. If not, we checked its Group. The query looked roughly like so (I wish this were the real query, but unfortunately our application was written over many years by many junior developers, so the truth is much more painful):
SELECT
v.id,
COALESCE(v.google_analytics_id, sg.google_analytics_id, g.google_analytics_id) as google_analytics_id
FROM
Videos v
LEFT JOIN Subgroups sg ON sg.id = v.subgroup_id
LEFT JOIN Groups g ON g.id = sg.group_id
Pretty straight-forward. Now the issue we've run into is that customers want to be able to nest groups arbitrarily deep, and our schema clearly only allows for 2 levels (and, in fact, necessitates two levels - even if you only want one)
New Schema (First Pass):
As a first pass, I knew we'd want a basic tree structure for the Groups, so I came up with this:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
parent_id INT,
ga_id VARCHAR(20)
)
We can then easily nest up to N levels deep with N joins like so:
SELECT
v.id,
COALESCE(v.ga_id, g1.ga_id, g2.ga_id, g3.ga_id, ...) as ga_id
FROM
Videos v
LEFT JOIN Groups g1 ON g1.id = v.group_id
LEFT JOIN Groups g2 ON g2.id = g1.parent_id
LEFT JOIN Groups g3 ON g3.id = g2.parent_id
...
There's obvious flaws with this approach: We don't know how many parents there will be so we don't know how many times we should JOIN, forcing us to implement a "max depth". Then even with a max depth, if a person only has a single level of groups we still perform multiple JOINs because our queries can't know how deep they need to go. MySQL offers recursive queries, but while looking into if that was the right option I found a smarter schema that produced the same results
New Schema (Take 2):
Looking into better ways to handle a tree structure, I learned about Adjacency Lists (my prior solution), Nested Sets, Materialized Paths, and Closure Tables. Other than Adjacency Lists (which depend on JOINs to grab the entire tree structure and so produces a single row with multiple columns per node on the tree), the other three solutions all return multiple rows for each node on the tree
I ended up going with a Closure Table solution like so:
CREATE TABLE Groups (
id INT PRIMARY KEY,
name VARCHAR(255),
ga_id VARCHAR(20)
)
CREATE TABLE Group_Closure (
ancestor_id INT,
descendant_id INT,
PRIMARY KEY (ancestor_id, descendant_id)
)
Now given a Video I can get all of its parents like so:
SELECT
v.id,
v.ga_id,
g.id,
g.ga_id
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor;
This returns each group in the hierarchy as a separate row:
+------+---------+------+---------+
| v.id | v.ga_id | g.id | g.ga_id |
+------+---------+------+---------+
| 1 | abc123 | 2 | new_val |
| 1 | abc123 | 1 | default |
| 2 | NULL | 4 | xyz987 |
| 2 | NULL | 3 | NULL |
| 2 | NULL | 1 | default |
| 3 | NULL | 3 | NULL |
| 3 | NULL | 1 | default |
+------+---------+------+---------+
What I wish to do now is somehow achieve the same result I would have expected from using COALESCE on multiple self-joined Group tables: a single value for ga_id based on whichever node is "lowest" in the tree
Because I have multiple rows per Video, I suspect that this can be accomplished using GROUP BY and some kind of aggregate function:
SELECT
v.id,
COALESCE(v.ga_id, FIRST_NON_NULL(g.ga_id))
FROM
Videos v
JOIN Group_Closure gc ON v.group_id = gc.descendant
JOIN Groups g ON g.id = gc.ancestor
GROUP BY v.id, v.ga_id;
Note that because (ancestor, descendant) is my primary key, I believe the order of the group closure table can be guaranteed to always come back the same - meaning if I put the lowest node first, it will be the first row in the resulting query... If my understanding of this is incorrect, please let me know.
If you were to stick with an adjacency list, you could use a recursive CTE. This one traverses up from each video id value until it finds a non-NULL ga_id:
WITH RECURSIVE CTE AS (
SELECT id, ga_id, group_id
FROM videos
UNION ALL
SELECT CTE.id, COALESCE(CTE.ga_id, g.ga_id), g.parent_id
FROM `groups` g
JOIN CTE ON g.id = CTE.group_id AND CTE.ga_id IS NULL
)
SELECT id, ga_id
FROM CTE
WHERE ga_id IS NOT NULL
For my attempt to reconstruct your data from your question, this yields:
id ga_id
1 abc123
2 xyz987
3 default
Demo on dbfiddle

Joining pre-defined, possibly non-existing keys with table data

In MySQL (or SQL in general), is it possible to generate a list of pre-defined identifiers, joined with matching table data?
Take for instance the following table data, let's call it my_table:
id | value
---+------
1 | 'a'
3 | 'c'
Now, I have a list of possible id values and would like to get a full list of these values, together with joined data from the table above. With a list [1, 2, 3, 4], the desired result is:
item | id | value
-----+------+------
1 | 1 | 'a'
2 | NULL | NULL
3 | 3 | 'c'
4 | NULL | NULL
Obviously, a query like SELECT * FROM my_table WHERE id IN (1, 2, 3, 4) yields only results for two rows (values 'a' and 'c').
For a solution, I am thinking along the line of some form of temporary table, fed with the full list of id's ([1, 2, 3, 4]) and left joining that with the table data, such as
SELECT t1.`item`, t2.`id`, t2.`value`
FROM
...
AS t1
LEFT JOIN `my_table` AS t2 ON t2.`id` = t1.`item`
But how do I do that?
Is this even possible? Or is it really necessary to compare the result with the initial list in external code? (This would be possible, but not trivial as in my case, the identifiers are not integers)
(The ultimate idea of this, is that I would like a result set from the DB with all input id's so that I can easily identify the non-existing records)
Update: I guess it boils down to the question: how can I get a result set such as
id
---
1
2
3
4
from a (My)SQL server without having this as data in some table, but from setting the data in some query?
A new approach flashed into my mind... using a union.
SELECT t1.`item`, t2.`id`, t2.`value`
FROM (
select 1 as `item`
union select 2
union select 3
union select 4
) AS t1
LEFT JOIN `my_table` AS t2 ON t2.`id` = t1.`item`
It answers the question, but it remains to be seen whether this is the 'best' answer. It works as long as the list of items is not too long (which is the case for me).
Anyone a better solution?

not sure why join query is returning resultset longer than i'd expect, and taking long to execute

I have reached an impasse with my knowledge regarding mysql joins, and the query I'm trying to execute is taking way too long... Although I'm only a short while into learning mysql on my own, I have put time into reading about the mechanics of indexes and joins, done many google searches and tried a few different query formats. To no avail, I need help please.
Firstly, I will say that my database is, at the moment, to be optimized for speed of select queries. I know I have a few too many indexes... my theory of learning mysql is to make a few too many indexes and examine what the mysql optimizer chooses for my purposes (determined by using explain) and then determine why it has chosen said index.
Anyhow, I have four tables: table1, table2, table3, table4...
table1.ID1 is the primary key, and other data in table1 might be divided into multiple content in table2.
table2.ID1 identifies every entry in table1 that is built upon content form table1
table2.ID2 is the primary key for table2
table3.ID2 identifies every entry in table3 that is built upon content form table2
table3.ID3 is the primary key for table3
table4.ID3 identifies every entry in table4 that is built upon content form table3
Not every entry in table1 has corresponding data in table2, and similarly table2 to table3, and table3 to table4.
What I need to do is retrieve the distinct values of ID2 that appear within a date range, and also only if the table2 content eventually appears in table4. The challenge I'm facing is that only table1 has a date column, and I need only entries that also appear in table4.
The following query takes approx 2 minutes.
select table2.ID2 from table1
left join table2 on
table1.ID1 = table2.ID1
left join table3 on
table3.ID2 = table2.ID2
left join table4 on
table4.ID3 = table3.ID3
where table1.Date between "2012-03-11" and "2012-03-18
by using explain with the above query I see no reason why it should take so long.
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| 1 | SIMPLE | table1 | range | ... | Datekey | 9 | NULL | 17528 | Using where; Using index |
| 1 | SIMPLE | table2 | ref | ... | ID1key | 8 | mydata.table1.POSTID | 1 | |
| 1 | SIMPLE | table3 | ref | ... | ID2key | 8 | mydata.table2.SrcID | 20 | |
| 1 | SIMPLE | table4 | ref | ... | ID3key | 8 | mydata.table3.ParsedID | 10 | Using index |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
I've replaced the names of possible keys with '...' as its not that important. In any case, a key is selected.
Moreover, the number of rows in the resultset in the query is much more than the purported matching 17528 rows in the explain resultset. How could it be more??
What am I doing wrong? I've also tried inner join with no luck. The way I interpret my query is a 4-way venn diagram, with very few number of rows with overlapping criteria, and further optimized by an index on the daterange.
I at least get the resultset that i want if I add 'distinct(table2.ID2)', but why am I otherwise getting a resultset much longer than what I'd expect, and why is it taking so long?
Sorry if any part of my question has been ambiguous, I'd be happy to clarify as needed.
Thanks,
Brian
EDIT:
All indexes refer to a BIGINT column, as I expect my database to get rather large and need quite a number of unique row identifiers... perhaps bigint is overkill and reducing the size of that column and/or the index would speed things up further.
Here's my final solution, based on the accepted answer below:
select ID2 from table2
where exists
(select 1 from table1 r
where table1.Date between "2012-03-11" and "2012-03-18" and table2.ID1 = table1.ID1
)
and exists
(select 1 from table3
where exists
(select 1 from table4 where table4.ID3 = table3.ID3)
)
Additionally, I realized I was missing a multi-field index, associating table2.ID1 and table2.ID2... After adding this index, this statement returns in about 11 seconds, and returns approx 20,000 rows.
I think this is reasonable considering the number of rows in each of my tables
table1: ~480,000
table2: ~480,000
table3: ~6,000,000
table4: ~60,000,000
Does this sound efficient? I'll accept the answer after I get confirmation this is the best performance I should expect. I'm running on a Xeon 3GHz system with 3gb mem, ubuntu 12.04, mysql 5.5.24
In all likelihood, your tables have multiple matches between them. Say table1 matches 5 rows in table2 and 10 rows in table3. Then you end up with 50 rows in the output.
So solve this, you need to limit your joins to one row per table.
One way is to use the in clause. If you are using the joins for filtering, then you can use a where clause instead:
where table2.id1 in (select table1.id1 from table1)
The "in" prevents duplicates.
The other alternative is to pre-aggregate the queries in the joins by doing joins.
Mysql seems to prefer a slightly different construct for the where clause, from an optimization perspective:
where exists (select 1 from table1 where table1.id = table2.id)

How to find next free unique 4-digit number

In my db application I have a requirement for a unique, 4-digit number field for each customer. Up until 9999 I can just use autoincrements, but after that I will have to reuse numbers of customers that have been deleted (there won't be more than 5000 customers at a given time but there may be more than 9999 customers over the lifetime of the system).
Question 1: Is there a (My)SQL statement to find the next reusable free number?
Question 2: If I get the number, assign it to a new customer and save the customer all in one transaction, similar transactions taking place at the same time will be sequentialized by the database so the numbers won't collide, right?
You'd be better off storing a table with all 10,000 possible values defined, and an "in-use" flag on each. That way, releasing the number for re-use is a simple update to set "inuse=false".
Also makes finding the lowest available value a simple
SELECT idstring
FROM idstringtable
ORDER BY idstring ASC
WHERE (available = 1)
LIMIT 1
Doing that with appropriate locks/transactions would prevent two or more requests getting the same ID, and since it's a small table, doing a global table lock would not significantly impact performance.
Otherwise, you'd be stuck rummaging around your users table, trying to find the first "gap" in the numbering sequence.
If you MUST use this model (and I would recommend against it) then I would create a pool of "available" numbers and when creating the account, just grab the TOP 1 from that. Then, when a user is deleted return the number to the pool.
This is to find the first available slot:
select i1.id + 1 as FirstAvailable
from issues i1 left join issues i2 on (i1.id = i2.id - 1)
where i2.id is null
limit 1
This was run against a production Redmine instance to find the first missing id. Adjust accordingly to your needs.
The recommendations to use a separate table to track the IDs that are in use will work, but if you do not want to use a separate table to track used IDs you could do a self join to find a gap in the id numbers. The self join is pretty simple:
select top 1 t1.id + 1
from table t1
left join table t2 on t1.id = t2.id - 1
where t1.id < 10000
and t2.id is null
In MS SQL Server I use TOP 1 to get the topmost result, but it may be different syntax in MySQL.
The above answer (by Adrian Carneiro) is fantastic, and works unless the table uses a different field as primary key and does NOT have a key for 'id'.
Given a table with a primary key of userid :-
MariaDB [unixua]> select userid, uid from accounts;
+---------+----------+
| userid | uid |
+---------+----------+
| acc0001 | 89814678 |
| acc0002 | 38000474 |
| acc0005 | 38000475 |
| acc0017 | 38000478 |
+---------+----------+
4 rows in set (0.00 sec)
We'd expect the lowest free number to be 38000476.
MariaDB [unixua]> SELECT t1.uid +1 FROM accounts t1
LEFT JOIN accounts t2 ON (t1.uid +1 = t2.uid)
WHERE t2.uid IS NULL AND t1.uid>38000474 LIMIT 1;
+-----------+
| t1.uid +1 |
+-----------+
| 89814679 |
+-----------+
1 row in set (0.00 sec)
But, because MySQL / MariaDB is selecting them in primary key order, this fails, and gives the next highest after "acc001".
By adding a key to the uid column and only performing the SELECT on the "uid" column, MySQL/MariaDB will use the index to retrieve data (instead of reading the table). Since the index is "ordered", the result is different :-
MariaDB [unixua]> alter table accounts add unique index (uid);
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
MariaDB [unixua]> SELECT t1.uid +1 FROM accounts t1
LEFT JOIN accounts t2 ON (t1.uid +1 = t2.uid)
WHERE t2.uid IS NULL AND t1.uid>38000474 LIMIT 1;
+-----------+
| t1.uid +1 |
+-----------+
| 38000476 |
+-----------+
1 row in set (0.00 sec)
Make sure your table has a key for the Customer ID field (and that customer ID field is numeric).
This works because the optimiser can retrieve all necessary data for the select from the index (aka accounts.myi, not accounts.myd), not the table data.