Apache Drill: Providing a limit in the subquery for a lateral join is not returning the correct results - apache-drill

I am trying to create a simple query with a inner lateral join but I want to restrict the join to a single result in the subquery
select b.`CODE`
from foo.bar.`BRANCH` b
inner join lateral (
select branch_id
from foo.bar.`BRANCH_DISTANCE`
where branch_id=b.CODE
and distance < 100
limit 1
) on true
The BRANCH_DISTANCE table contains the distances between any two branches and I want to return all branches that are within 100 km of another branch, which is why in the subquery, as long as there is one record that contains the branch and its distance is less than 100, it should return the branch (and stop looking for any further matches).
But when I add the limit, the query returns only one record. On removing the limit, around 2000 records are returned.
If I replace the select b.CODE with select distinct b.CODE, the get around 500 results (which is the correct answer).
My objective is to not use the distinct keyword in the select statement and that is why I was adding the limit in the subquery so that the join is done not on every record in the BRANCH_DISTANCE table that contains the branch code and distance < 100 (because it is possible for a branch to be less than 100 km away from more than one branch).

Join may multiply resulting rows count for the case when joining is happening on the column with duplicate values (in this one, or both of branch_id and b.CODE columns have duplicate values).
To restrict the join to a single result in a subquery, please use IN clause.
So something like this should work as expected:
select b.`CODE`
from foo.bar.`BRANCH` b
where b.`CODE` in (
select branch_id
from foo.bar.`BRANCH_DISTANCE`
and distance < 100
)

Related

In query with joins and multi-table/field ORDER BY, how to set LIMIT offset to start from a particular row identified by a unique id field?

Suppose I have four tables: tbl1 ... tbl4. Each has a unique numerical id field. tbl1, tbl2 and tbl3 each has a foreign key field for the next table in the sequence. E.g. tbl1 has a tbl2_id foreign key field, and so on. Each table also has a field order (and other fields not relevant to the question).
It is straightforward to join all four tables to return all rows of tbl1 together with corresponding fields from the other three fields. It is also easy to order this result set by a specific ORDER BY combination of the order fields. It is also easy to return just the row that corresponds to some particular id in tbl1, e.g. WHERE tbl1.id = 7777.
QUESTION: what query most efficiently returns (e.g.) 100 rows, starting from the row corresponding to id=7777, in the order determined by the specific combination of order fields?
Using ROW_NUMBER or (an emulation of it in MySQL version < 8) to get the position of the id=7777 row, and then using that in a new version of the same query to set the offset in the LIMIT clause would be one approach. (With a read lock in between.) But can it be done in a single query?
# FIRST QUERY: get row number of result row where tbl1.id = 7777
SELECT x.row_number
FROM
(SELECT #row_number:=#row_number+1 AS row_number, tbl1.id AS id
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <some conditions>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
WHERE id=7777;
Store the row number from the above query and use it to bind :offset in the following query.
# SECOND QUERY : Get 100 rows starting from the one with id=7777
SELECT x.field1, x.field2, <etc.>
FROM
(SELECT #row_number:=#row_number+1 AS row_number, field1, field2
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <same conditions as before>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
LIMIT :offset, 100;
Clarify question
In the general case, you won't ask for WHERE id1 > 7777. Instead, you have a tuple of (11,22,33,44) and you want to "continue where you left off".
Two discussions, with
That is messy, but not impossible. See Iterating through a compound key . Ig gives an example of doing it with 2 columns; 4 columns coming from 4 tables is an extension of such.
A variation
Here is another discussion of such: https://dba.stackexchange.com/questions/164428/should-i-store-data-pre-ordered-rather-than-ordering-on-the-fly/164755#164755
In actually implementing such, I have found that letting the "100" (LIMIT) be flexible can be easier to think through. The idea is: reach forward 100 rows (with LIMIT 100,1). Let's say you get (111,222,333,444). If you are currently at (111, ...), then deal with id2/3/4. If it is, say, (113, ...), then do WHERE id1 < 113 and leave off any specification of id2/3/4. This means fetching less than 100 rows, but it lands you just shy of starting id1=113.
That is, it involves constructing a WHERE clause with between 1 and 4 conditions.
In all cases, your query says ORDER BY id1, id2, id3, id4. And the only use for LIMIT is in the probe to figure out how far ahead the 100th row is (with LIMIT 100,1).
I think I can dig out some old Perl code for that.

Explaining MySQL query with multiple tables listed in FROM

a, b are not directly related.
What does a,b have to do with the results?
select * from a,b where b.id in (1,2,3)
can you explain sql?
Since you haven't specified a relationship between a and b, this produces a cross product. It's equivalent to:
SELECT *
FROM a
CROSS JOIN b
WHERE b.id IN (1, 2, 3)
It will combine every row in a with the three selected rows from b. If a has 100 rows, the result will be 300 rows.
What you using is Multitable SELECT.
Multitable SELECT (M-SELECT) is similar to the join operation. You
select values from different tables, use WHERE clause to limit the
rows returned and send the resulting single table back to the
originator of the query.
The difference with M-SELECT is that it would return multiply tables
as the result set. For more deatils: https://dev.mysql.com/worklog/task/?id=358
In other word, you query is :
SELECT *
FROM a
CROSS JOIN b
WHERE b.id in (1,2,3)

How to join a derived table

I have a complex query which results in a table which includes a time column. There are always two rows with the same time:
The result also contains a value column. The value of two rows with the same time is always different.
I now want to extend the query to join the rows with the same time together. So my thought was to join the derived table like this:
SELECT A.time, A.value AS valueA, B.value as valueB FROM
(
OLD_QUERY
) AS A INNER JOIN A AS B ON
A.time=B.time AND
A.value <> B.value;
However, the JOIN A AS B part of the query does not work. A is not recognized as the derived table. MySQL is searching for a table A in the database and does not find it.
So the question is: How can I join a derived table?
You cannot join a single reference to a table (or subquery) to itself; a subquery must be repeated.
Example: You cannot even do
SELECT A.* FROM sometable AS A INNER JOIN A ...
The A after the INNER JOIN is invalid unless you actually have a real table called A.
You can insert the subquery's results into another table, and use that; but it cannot be a true TEMPORARY table, as those cannot be joined to themselves or referenced twice at all in almost any query. _By referenced twice, I mean joined, unioned, used as an "WHERE IN" subquery when it is already referenced in the FROM.
If nothing else distinguishes the rows, you can just use aggregation to get the two values:
select time, min(value), max(value)
from (<your query here>) a
group by time;
In MySQL 8+, you can use a cte:
with a as (
<your query here>
)
select a1.time, a1.value, a2.value
from a a1 join
a a2
on a1.time = a2.time and a1.value <> a2.value;

How To Get The Number Interval Between Rows via MySQL

We have a table which has two columns -- ID and Value. The ID is the index of table row, and the Value consists of Fixed String and Key (a number) in hexadecimal storing as string in the database. Take 00001810010 as an example, the fixed string is 0000181 and the seconds part is the key -- 0010.
Table
ID Value
0 00001810000
1 00001810010
2 00001810500
3 00001810900
4 0000181090a
What I want to get from the above table is the Number Interval between rows, for above table the result is
[1, 9], [11, 4FF], [501, 8FF], [901, 909]
I can read all the records into memory and handle them via C++, but is it possible to implement it through MySQL statements only? How?
I would be tempted to match up a row with the previous row with something like this:-
SELECT sub1.id AS this_row_id,
sub1.value AS this_row_value,
z.id AS prev_row_id,
z.value AS prev_row_value
FROM
(
SELECT a.id, a.value, MAX(b.id) AS bid
FROM some_table a
INNER JOIN some_table b
ON a.id > b.id
GROUP BY a.id, a.value
) sub1
INNER JOIN some_table z
ON z.id = sub1.bid
You might want to use LEFT OUTER JOINs rather than INNER JOINs depending on what you want for the first record (where there is no previous record to match on).

count(*) and count(column_name), what's the diff?

count(*) and count(column_name), what's the difference in mysql.
COUNT(*) counts all rows in the result set (or group if using GROUP BY).
COUNT(column_name) only counts those rows where column_name is NOT NULL. This may be slower in some situations even if there are no NULL values because the value has to be checked (unless the column is not nullable).
COUNT(1) is the same as COUNT(*) since 1 can never be NULL.
To see the difference in the results you can try this little experiment:
CREATE TABLE table1 (x INT NULL);
INSERT INTO table1 (x) VALUES (1), (2), (NULL);
SELECT
COUNT(*) AS a,
COUNT(x) AS b,
COUNT(1) AS c
FROM table1;
Result:
a b c
3 2 3
Depending on the column definition -i.e if your column allow NULL - you could get different results (and it could be slower with count(column) in some situations as Mark already told).
There is no performance difference between COUNT (*), COUNT (ColumnName), COUNT (1).
Now, if you have COUNT (ColumnName) then the database has to check if the column has a NULL value, and NULLs are eliminated from aggregates. So COuNT (*) or COUNT (1) is preferable to COUNT (ColumnName) unless you want COUNT (DISTINCT ColumnName)
In most cases there's little difference, and COUNT(*) or COUNT(1) is generally preferred. However, there's one important situation where you must use COUNT(columnname): outer joins.
If you're performing an outer join from a parent table to a child table, and you want to get zero counts in rows that have no related items in the child table, you have to use COUNT(column in child table). When there's no matches, that column will be NULL, and you'll get the desired zero count (actually, you'll get NULL, but you can convert that to 0 with IFNULL() or COALESCE()). If you use COUNT(*), it counts the row from the parent table, so you'll get a count of 1.
SELECT c.name, COALESCE(COUNT(o.id), 0) AS order_count
FROM customers AS c
LEFT JOIN orders AS o ON o.customer_id = c.id