Group by bucket (with NULL values) - mysql

I have the following tables:
entries (id, title, text, duplicate_bucket_id)
duplicate_buckets (id, comment)
So every entry can be in a duplicate bucket. Now I want to get all entries without the duplicates:
SELECT MIN(id) FROM entries GROUP BY duplicate_bucket_id
The problem with this query is that it also groups all the entries without a duplicate_bucket_id to only one entry with NULL.
So I need something like this
(SELECT MIN(id) FROM entries WHERE duplicate_bucket_id IS NOT NULL GROUP BY duplicate_bucket_id)
UNION
(SELECT id FROM entries WHERE duplicate_bucket_id IS NULL)
This query gives me the correct result, but ActiveRecord can't use UNIONs.
Alternatively, I can use this query with a subquery:
SELECT * FROM entries WHERE duplicate_bucket_id IS NULL OR id IN
(SELECT MIN(id) FROM entries WHERE duplicate_bucket_id IS NOT NULL GROUP BY duplicate_bucket_id )
In this query, I must place additional where-clauses in AND outside of the subquery. So the query gets quite complicated and I don't know yet, how to use the Ransack Gem with such a query...
The query would be simple, if every "entry" would be in a "duplicate_bucket" - buckets of size 1 (I could use *SELECT * FROM entries GROUP BY duplicate_bucket_id*). But I want to avoid to have entries in a duplicate_bucket, if the entry don't have a duplicate. Is there a simple query (no unions, no subqueries) to get all entries without their duplicates?
Dataset
entries(id, title, text, duplicate_bucket_id)
1, 'My title', 'Bla bla', 1
2, 'Hello', 'Jaha', 1
3, 'Test', 'Bla bla', 1
4, 'Foo', 'Bla', NULL
5, 'Bar1', '', 2
6, 'Bar2', '', 2
duplicate_buckets (id, comment)
1, 'This bucket has 3 entries'
2, 'Bar1 and Bar2 are duplicates!'
Result
1, 'My title', 'Bla bla', 1
4, 'Foo', 'Bla', NULL
5, 'Bar1', '', 2

ANSI/ISO SQL:
select *
from entries as e1
where not exists (select null from entries as e2 where e2.duplicate_bucket_id = e1.duplicate_bucket_id and e2.id < e1.id)
;
MySQL Terrible, Horrible, No Good, Very Bad syntax
select *
from entries
group by coalesce(-duplicate_bucket_id,id)
;

Related

Optimizing mySQL query to scale better

I want to query a database to retrieve a list of names (the list of names is provided by the user in python). My criteria for looking up data for these names are the following: the results should appear in the order of the list of names the user provided, so if I say ...WHERE name = "Bob" OR name = "Alice" I want the results for Bob to come first followed by that of Alice. The second criteria is that if there is a search for a name twice, then the result should also contain it twice, so I want a way to write down ...WHERE name = 'Bob' OR name = 'Bob' so that the result also contains the rows for Bob twice.
I came up with the following query:
SELECT * FROM
(SELECT *, 1 order_position FROM table WHERE name = 'Alice'
UNION ALL
SELECT *, 2 order_position FROM table WHERE name = 'Bob'
UNION ALL
SELECT *, 3 order_position FROM table WHERE name = 'Charlie'
UNION ALL
SELECT *, 4 order_position FROM table WHERE name = 'Dan'
) r ORDER BY order_position
This query works well, but when the user submits hundreds of names and there are hundreds of UNION ALL sections, the query becomes extremely slow. Is there a way to improve the performance of the query while maintaining the two criteria mentioned before?
SELECT *, CASE name WHEN 'Alice' THEN 1
WHEN 'Bob' THEN 2
WHEN 'Charlie' THEN 3
WHEN 'Dan' THEN 4
END AS order_position
FROM table
WHERE name IN ('Alice', 'Bob', 'Charlie', 'Dan')
ORDER BY order_position;
or without additional column:
SELECT *
FROM table
WHERE name IN ('Alice', 'Bob', 'Charlie', 'Dan')
ORDER BY CASE name WHEN 'Alice' THEN 1
WHEN 'Bob' THEN 2
WHEN 'Charlie' THEN 3
WHEN 'Dan' THEN 4
END;
PS. For this names set ORDER BY name is enough.
How does this handle the requirement of repeating some results? –
Willem Renzema
If you need in repeating then you must convert the list to a rowset.
SELECT table.*
FROM table
JOIN ( SELECT 1 pos, 'Alice' name UNION ALL
SELECT 2 , 'Bob' UNION ALL
SELECT 3 , 'Charlie' UNION ALL
SELECT 4 , 'Bob' UNION ALL
SELECT 5 , 'Charlie' ) names USING (name)
ORDER BY names.pos
Somehow you have to construct the list of names with the order_position of each name.
You can do this in a query which uses UNION ALL to preserve duplicate names like this:
SELECT 'Alice' name, 1 order_position UNION ALL
SELECT 'Bob', 2 UNION ALL
SELECT 'Charlie', 3 UNION ALL
SELECT 'Dan', 4 UNION ALL
SELECT 'Alice', 1 UNION ALL
SELECT 'Bob', 2 UNION ALL
...............................
Then all you have to do is join it to the table:
SELECT t.*
FROM tablename t
INNER JOIN (
SELECT 'Alice' name, 1 order_position UNION ALL
SELECT 'Bob', 2 UNION ALL
SELECT 'Charlie', 3 UNION ALL
SELECT 'Dan', 4 UNION ALL
SELECT 'Alice', 1 UNION ALL
SELECT 'Bob', 2 UNION ALL
...............................
) n ON n.name = t.name
ORDER BY n.order_position;
In MySql 8.0+ you can use a CTE:
WITH cte(name, order_position) AS (VALUES
ROW('Alice', 1), ROW('Bob', 2), ROW('Charlie', 3),
ROW('Dan', 4), ROW('Alice', 1), ROW('Bob', 2),
...................................................
)
SELECT t.*
FROM tablename t INNER JOIN cte c
ON c.name = t.name
ORDER BY c.order_position;
The following will be "fast" if name is indexed:
WHERE name IN ('Alice', 'Bob', 'Charlie', 'Dan')
ORDER BY FIND_IN_SET(name, 'Alice,Bob,Charlie,Dan')
Note the syntax difference between Where and Order.
The following is likely to be somewhat slower because it cannot use any index but is simpler to code:
WHERE FIND_IN_SET(name, 'Alice,Bob,Charlie,Dan')
ORDER BY FIND_IN_SET(name, 'Alice,Bob,Charlie,Dan')
Note the restriction in FIND_IN_SET that commas cannot be used in the items.
In no case will CASE or FIND_IN_SET() use an index. (Cf "sargable")
Repeats
If, say, there are multiple "Bobs", then each of these have exactly the same effect as the above:
name IN ('Alice', 'Bob', 'Charlie', 'Bob', 'Dan')
FIND_IN_SET(name, 'Alice,Bob,Charlie,Bob,Dan')
That is, all the Bobs will be listed in the output before all the Charlies. Furthermore, no individual row is listed twice.

One row per group with multiple column sorting

Would like to return one row per group, where the one is selected by multiple sort columns. Treading lightly here in the land of greatest-n-per-group to avoid a duplicate question.
SCHEMA:
CREATE TABLE logs (
id INT NOT NULL,
ip_address INT NOT NULL,
status INT NOT NULL,
PRIMARY KEY id
);
DATA:
INSERT INTO logs (id, ip_address, status)
VALUES ('1', 19216800, 1),
('2', 19216801, 2),
('3', 19216800, 2),
('4', 19216803, 0),
('5', 19216804, 0),
('6', 19216803, 0),
('7', 19216804, 1);
CURRENT QUERY:
SELECT *
FROM logs
ORDER BY ip_address, status=1 DESC, id DESC
Note: sorting by status=1 effectively turns the status column into a boolean. The tie breaker after status=1 is id. This query currently returns the correct row for each ip_address first and then a bunch of other rows I don't want for that ip_address.
CURRENT OUTPUT:
1, 19216800, 1
3, 19216800, 2
2, 19216801, 2
6, 19216803, 0
4, 19216803, 0
7, 19216804, 1
5, 19216804, 0
WANTED OUTPUT:
1, 19216800, 1
2, 19216801, 2
6, 19216803, 0
7, 19216804, 1
Today my workaround is to filter in PHP with if ($lastIP == $row['ip_address']) continue;. But I would like to move this logic to MySQL.
Try this -
SELECT MIN(id), ip_address, status
FROM logs
GROUP BY ip_address, status
Since there are already hundreds of solutions for greatest-n-per-group problems in MySQL, I'm going to start answering these questions with CTE syntax with window functions, since that is now available in MySQL 8.0.3.
WITH sorted AS (
SELECT id, ip_address, status,
ROW_NUMBER() OVER (PARTITION BY ip_address ORDER BY status) AS rn
FROM logs
)
SELECT * FROM sorted WHERE rn = 1;
Here is different way to think about the problem. You want to find the "best" row for each id_address. Or in other words, you want to select rows where no better row exists.
This solution works for MySQL versions before 8.0. In other words, it works with the version you already have installed with RHEL 7. You can extend this technique easily for an arbitrary number of sort columns.
SELECT a.*
FROM (SELECT * FROM logs) a
LEFT JOIN (SELECT * FROM logs) b
ON (b.ip_address = a.ip_address AND (b.stat=1) > (a.stat=1))
OR (b.ip_address = a.ip_address AND (b.stat=1) = (a.stat=1) AND b.id > a.id)
WHERE b.id IS NULL
ORDER BY a.ip_address
If you have more columns to sort by then keeping adding OR clauses to handle tie breaks and select the "best" row for each ip_address. Regardless how complicated your subquery is or how many "SORT BY~ conditions you have, you will only need one LEFT JOIN with this technique.
Try this:
SELECT
l.`ip_address` , l.`status`
FROM
`logs` l
GROUP BY l.`ip_address`
ORDER BY l.`status` = 1 DESC

MySQL: troubles with HAVING COUNT 'exact', not 'at least'

I have a table that holds relations of users participating in conversations like follows:
CREATE TABLE `so` (
`id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`user_id` int(11) NOT NULL,
`conversation_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `so`
ADD UNIQUE KEY `uc` (`user_id`,`conversation_id`) USING BTREE;
INSERT INTO `so` (`id`, `user_id`, `conversation_id`) VALUES
(1, 1, 1),
(3, 1, 2),
(2, 2, 1),
(4, 2, 2),
(5, 3, 2);
According to sample data, users 1 and 2 have conversation with ID of 1 and users 1, 2, 3 - conversation with ID of 2.
I need to get unique conversation_id for the list of user ids.
My current query is:
SELECT conversation_id, COUNT(user_id) as usersCount
FROM so
WHERE user_id IN (1,2)
GROUP BY conversation_id
HAVING usersCount = 2
ORDER BY NULL
But it returns 2 rows for both conversations and I expect the row with conversation_id of 1.
How can I select the row that belongs exactly to users 1 and 2, and not to 1, 2, 3? Thanks.
UPDATE:
I can't use subqueries on joins for performance reasons because users list in the query may be up to 30 ids and I'm afraid 30 subqueries is not the case.
You can use group_concat
select conversation_id
from so
group by conversation_id
having group_concat(user_id order by user_id) = '1,2';
To avoid full index scan, you can put your original query in a subquery:
SELECT a.conversation_id
FROM (
SELECT conversation_id
FROM so
WHERE user_id IN (1,2)
GROUP BY conversation_id
HAVING COUNT(conversation_id) = 2) a
JOIN so b ON a.conversation_id = b.conversation_id
GROUP BY a.conversation_id
HAVING COUNT(a.conversation_id) = 2;
Instead of checking the user_id in the WHERE clause, compare the number of rows that satisfy that condition with the total rows for each conversation.
SELECT conversation_id, COUNT(*) AS allCount, SUM(user_id IN (1, 2)) AS userCount
FROM so
GROUP BY conversation_id
HAVING allCount = 2 AND allCount = userCount
This answer is an alternative to the already given, and will provide better efficiency through not using sub-selects.
HAVING COUNT(user_id IN ('1','2') OR NULL) > 0 specifies that you want conversations with userid 1 and 2.
COUNT(user_id) = 2 says that there can only be 2 users in the conversation.
You could even remove COUNT(user_id) as usersCount from the result set if you don't actually use it as part of your exercise.
SELECT conversation_id, COUNT(user_id) as usersCount
FROM so
GROUP BY conversation_id
HAVING COUNT(user_id IN ('1','2') OR NULL) > 0 AND
COUNT(user_id) = 2;
To avoid a full index scan you would have to use a where clause as #Fabricator has shown in his answer. When you apply conditions to groups of rows, it has to group them first, and then do the aggregations and conditions, and a where clause only applies conditions to single rows. How big is your table out of interest?

SELECT WHERE IN - mySQL

let's say I have the following Table:
ID, Name
1, John
2, Jim
3, Steve
4, Tom
I run the following query
SELECT Id FROM Table WHERE NAME IN ('John', 'Jim', 'Bill');
I want to get something like:
ID
1
2
NULL or 0
Is it possible?
How about this?
SELECT Id FROM Table WHERE NAME IN ('John', 'Jim', 'Bill')
UNION
SELECT null;
Start by creating a subquery of names you're looking for, then left join the subquery to your table:
SELECT myTable.ID
FROM (
SELECT 'John' AS Name
UNION SELECT 'Jim'
UNION SELECT 'Bill'
) NameList
LEFT JOIN myTable ON NameList.Name = myTable.Name
This will return null for each name that isn't found. To return a zero instead, just start the query with SELECT COALESCE(myTable.ID, 0) instead of SELECT myTable.ID.
There's a SQL Fiddle here.
The question is a bit confusing. "IN" is a valid operator in SQL and it means a match with any of the values (see here ):
SELECT Id FROM Table WHERE NAME IN ('John', 'Jim', 'Bill');
Is the same as:
SELECT Id FROM Table WHERE NAME = 'John' OR NAME = 'Jim' OR NAME = 'Bill';
In your answer you seem to want the replies for each of the values, in order. This is accomplished by joining the results with UNION ALL (only UNION eliminates duplicates and can change the order):
SELECT max(Id) FROM Table WHERE NAME = 'John' UNION ALL
SELECT max(Id) FROM Table WHERE NAME = 'Jim' UNION ALL
SELECT max(Id) FROM Table WHERE NAME = 'Bill';
The above will return 1 Id (the max) if there are matches and NULL if there are none (e.g. for Bill). Note that in general you can have more than one row matching some of the names in your list, I used "max" to select one, you may be better of in keeping the loop on the values outside the query or in using the (ID, Name) table in a join with other tables in your database, instead of making the list of ID and then using it.

Alternative to Intersect in MySQL

I need to implement the following query in MySQL.
(select * from emovis_reporting where (id=3 and cut_name= '全プロセス' and cut_name='恐慌') )
intersect
( select * from emovis_reporting where (id=3) and ( cut_name='全プロセス' or cut_name='恐慌') )
I know that intersect is not in MySQL. So I need another way.
Please guide me.
Microsoft SQL Server's INTERSECT "returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand" This is different from a standard INNER JOIN or WHERE EXISTS query.
SQL Server
CREATE TABLE table_a (
id INT PRIMARY KEY,
value VARCHAR(255)
);
CREATE TABLE table_b (
id INT PRIMARY KEY,
value VARCHAR(255)
);
INSERT INTO table_a VALUES (1, 'A'), (2, 'B'), (3, 'B');
INSERT INTO table_b VALUES (1, 'B');
SELECT value FROM table_a
INTERSECT
SELECT value FROM table_b
value
-----
B
(1 rows affected)
MySQL
CREATE TABLE `table_a` (
`id` INT NOT NULL AUTO_INCREMENT,
`value` varchar(255),
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
CREATE TABLE `table_b` LIKE `table_a`;
INSERT INTO table_a VALUES (1, 'A'), (2, 'B'), (3, 'B');
INSERT INTO table_b VALUES (1, 'B');
SELECT value FROM table_a
INNER JOIN table_b
USING (value);
+-------+
| value |
+-------+
| B |
| B |
+-------+
2 rows in set (0.00 sec)
SELECT value FROM table_a
WHERE (value) IN
(SELECT value FROM table_b);
+-------+
| value |
+-------+
| B |
| B |
+-------+
With this particular question, the id column is involved, so duplicate values will not be returned, but for the sake of completeness, here's a MySQL alternative using INNER JOIN and DISTINCT:
SELECT DISTINCT value FROM table_a
INNER JOIN table_b
USING (value);
+-------+
| value |
+-------+
| B |
+-------+
And another example using WHERE ... IN and DISTINCT:
SELECT DISTINCT value FROM table_a
WHERE (value) IN
(SELECT value FROM table_b);
+-------+
| value |
+-------+
| B |
+-------+
There is a more effective way of generating an intersect, by using UNION ALL and GROUP BY. Performances are twice better according to my tests on large datasets.
Example:
SELECT t1.value from (
(SELECT DISTINCT value FROM table_a)
UNION ALL
(SELECT DISTINCT value FROM table_b)
) AS t1 GROUP BY value HAVING count(*) >= 2;
It is more effective, because with the INNER JOIN solution, MySQL will look up for the results of the first query, then for each row, look up for the result in the second query. With the UNION ALL-GROUP BY solution, it will query results of the first query, results of the second query, then group the results all together at once.
Your query would always return an empty recordset since cut_name= '全プロセス' and cut_name='恐慌' will never evaluate to true.
In general, INTERSECT in MySQL should be emulated like this:
SELECT *
FROM mytable m
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE (o.col1 = m.col1 OR (m.col1 IS NULL AND o.col1 IS NULL))
AND (o.col2 = m.col2 OR (m.col2 IS NULL AND o.col2 IS NULL))
AND (o.col3 = m.col3 OR (m.col3 IS NULL AND o.col3 IS NULL))
)
If both your tables have columns marked as NOT NULL, you can omit the IS NULL parts and rewrite the query with a slightly more efficient IN:
SELECT *
FROM mytable m
WHERE (col1, col2, col3) IN
(
SELECT col1, col2, col3
FROM othertable o
)
I just checked it in MySQL 5.7 and am really surprised how no one offered a simple answer: NATURAL JOIN
When the tables or (select outcome) have IDENTICAL columns, you can use NATURAL JOIN as a way to find intersection:
For example:
table1:
id, name, jobid
'1', 'John', '1'
'2', 'Jack', '3'
'3', 'Adam', '2'
'4', 'Bill', '6'
table2:
id, name, jobid
'1', 'John', '1'
'2', 'Jack', '3'
'3', 'Adam', '2'
'4', 'Bill', '5'
'5', 'Max', '6'
And here is the query:
SELECT * FROM table1 NATURAL JOIN table2;
Query Result:
id, name, jobid
'1', 'John', '1'
'2', 'Jack', '3'
'3', 'Adam', '2'
For completeness here is another method for emulating INTERSECT. Note that the IN (SELECT ...) form suggested in other answers is generally more efficient.
Generally for a table called mytable with a primary key called id:
SELECT id
FROM mytable AS a
INNER JOIN mytable AS b ON a.id = b.id
WHERE
(a.col1 = "someval")
AND
(b.col1 = "someotherval")
(Note that if you use SELECT * with this query you will get twice as many columns as are defined in mytable, this is because INNER JOIN generates a Cartesian product)
The INNER JOIN here generates every permutation of row-pairs from your table. That means every combination of rows is generated, in every possible order. The WHERE clause then filters the a side of the pair, then the b side. The result is that only rows which satisfy both conditions are returned, just like intersection two queries would do.
Starting from MySQL 8.0.31 the INTERSECT is natively supported.
INTERSECT Clause:
SELECT ...
INTERSECT [ALL | DISTINCT] SELECT ...
[INTERSECT [ALL | DISTINCT] SELECT ...]
INTERSECT limits the result from multiple SELECT statements to those rows which are common to all.
Sample:
SELECT 1 AS col
INTERSECT
SELECT 1 AS col;
-- output
1
Break your problem in 2 statements: firstly, you want to select all if
(id=3 and cut_name= '全プロセス' and cut_name='恐慌')
is true . Secondly, you want to select all if
(id=3) and ( cut_name='全プロセス' or cut_name='恐慌')
is true. So, we will join both by OR because we want to select all if anyone of them is true.
select * from emovis_reporting
where (id=3 and cut_name= '全プロセス' and cut_name='恐慌') OR
( (id=3) and ( cut_name='全プロセス' or cut_name='恐慌') )
AFAIR, MySQL implements INTERSECT through INNER JOIN.
SELECT
campo1,
campo2,
campo3,
campo4
FROM tabela1
WHERE CONCAT(campo1,campo2,campo3,IF(campo4 IS NULL,'',campo4))
NOT IN
(SELECT CONCAT(campo1,campo2,campo3,IF(campo4 IS NULL,'',campo4))
FROM tabela2);