Exotic GROUP BY In MySQL - mysql

Consider a typical GROUP BY statement in SQL: you have a table like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| B | 2 |
| A | 3 |
| B | 4 |
+------+-------+
And you ask for
SELECT Name, SUM(Value) as Value
FROM table
GROUP BY Name
You'll receive
+------+-------+
| Name | Value |
+------+-------+
| A | 4 |
| B | 6 |
+------+-------+
In your head, you can imagine that SQL generates an intermediate sorted table like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| A | 3 |
| B | 2 |
| B | 4 |
+------+-------+
and then aggregates together successive rows: the "Value" column has been given an aggregator (in this case SUM), so it's easy to aggregate. The "Name" column has been given no aggregator, and thus uses what you might call the "trivial partial aggregator": given two things that are the same (e.g. A and A), it aggregates them into a single copy of one of the inputs (in this case A). Given any other input it doesn't know what to do and is forced to begin aggregating anew (this time with the "Name" column equal to B).
I want to do a more exotic kind of aggregation. My table looks like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| BC | 2 |
| AY | 3 |
| AZ | 4 |
| B | 5 |
| BCR | 6 |
+------+-------+
And the intended output is
+------+-------+
| Name | Value |
+------+-------+
| A | 8 |
| B | 13 |
+------+-------+
Where does this come from? A and B are the "minimal prefixes" for this set of names: they occur in the data set and every Name has exactly one of them as a prefix. I want to aggregate data by grouping rows together when their Names have the same minimal prefix (and add the Values, of course).
In the toy grouping model from before, the intermediate sorted table would be
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| AY | 3 |
| AZ | 4 |
| B | 5 |
| BC | 2 |
| BCR | 6 |
+------+-------+
Instead of using the "trivial partial aggregator" for Names, we would use one that can aggregate X and Y together iff X is a prefix of Y; in that case it returns X. So the first three rows would be aggregated together into a row with (Name, Value) = (A, 8), then the aggregator would see that A and B couldn't be aggregated and would move on to a new "block" of rows to aggregate.
The tricky thing is that the value we're grouping by is "non-local": if A were not a name in the dataset, then AY and AZ would each be a minimal prefix. It turns out that the AY and AZ rows are aggregated into the same row in the final output, but you couldn't know that just by looking at them in isolation.
Miraculously, in my use case the minimal prefix of a string can be determined without reference to anything else in the dataset. (Imagine that each of my names is one of the strings "hello", "world", and "bar", followed by any number of z's. I want to group all of the Names with the same "base" word together.)
As I see it I have two options:
1) The simple option: compute the prefix for each row and group by that value directly. Unfortunately I have an index on the Name, and computing the minimal prefix (whose length depends on the Name itself) prevents me from using that index. This forces a full table scan, which is prohibitively slow.
2) The complicated option: somehow convince MySQL to use the "partial prefix aggregator" for Name. This runs into the "non-locality" problem above, but that's fine as long as we scan the table according to my index on Name, since then every minimal prefix will be encountered before any of the other strings it is a prefix of; we would never try to aggregate AY and AZ together if A were in the dataset.
In a declarative programming language #2 would be rather easy: extract rows one at a time, in alphabetical order, keeping track of the current prefix. If your new row's Name has that as a prefix, it goes in the bucket you're currently using. Otherwise, start a new bucket with that as your prefix. In MySQL I am lost as to how to do it. Note that the set of minimal prefixes is not known beforehand.

Edit 2
It occurred to me that if the table is ordered by Name, this would be a lot easier (and faster). Since I don't know if your data is sorted, I've included a sort in this query, but if the data is sorted, you can strip out (SELECT * FROM table1 ORDER BY Name) t1 and just use FROM table1
SELECT prefix, SUM(`Value`)
FROM (SELECT Name, Value, #prefix:=IF(Name NOT LIKE CONCAT(#prefix, '_%'), Name, #prefix) AS prefix
FROM (SELECT * FROM table1 ORDER BY Name) t1
JOIN (SELECT #prefix := '~') p
) t2
GROUP BY prefix
Updated SQLFiddle
Edit
Having slept on the problem, I realised that there is no need to do the IN, it's enough to just have a WHERE NOT EXISTS clause on the JOINed table:
SELECT t1.Name, SUM(t2.Value) AS `Value`
FROM table1 t1
JOIN table1 t2 ON t2.Name LIKE CONCAT(t1.Name, '%')
WHERE NOT EXISTS (SELECT *
FROM table1 t3
WHERE t1.Name LIKE CONCAT(t3.Name, '_%')
)
GROUP BY t1.Name
Updated Explain (Name changed to UNIQUE key from PRIMARY)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t1 index Name Name 11 NULL 6 Using where; Using index; Using temporary; Using filesort
1 PRIMARY t2 ALL NULL NULL NULL NULL 6 Using where; Using join buffer (Block Nested Loop)
3 DEPENDENT SUBQUERY t3 index NULL Name 11 NULL 6 Using where; Using index
Updated SQLFiddle
Original Answer
Here is one way you could do it. First, you need to find all the unique prefixes in your table. You can do that by looking for all values of Name where it does not look like another value of Name with other characters on the end. This can be done with this query:
SELECT Name
FROM table1 t1
WHERE NOT EXISTS (SELECT *
FROM table1 t2
WHERE t1.Name LIKE CONCAT(t2.Name, '_%')
)
For your sample data, that will give
Name
A
B
Now you can sum all the values where the Name starts with one of those prefixes. Note we change the LIKE pattern in this query so that it also matches the prefix, otherwise we wouldn't count the values for A and B in your example:
SELECT t1.Name, SUM(t2.Value) AS `Value`
FROM table1 t1
JOIN table1 t2 ON t2.Name LIKE CONCAT(t1.Name, '%')
WHERE t1.Name IN (SELECT Name
FROM table1 t3
WHERE NOT EXISTS (SELECT *
FROM table1 t4
WHERE t3.Name LIKE CONCAT(t4.Name, '_%')
)
)
GROUP BY t1.Name
Output:
Name Value
A 8
B 13
An EXPLAIN says that both of these queries use the index on Name, so should be reasonably efficient. Here is the result of the explain on my MySQL 5.6 server:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t1 index PRIMARY PRIMARY 11 NULL 6 Using index; Using temporary; Using filesort
1 PRIMARY t3 eq_ref PRIMARY PRIMARY 11 test.t1.Name 1 Using where; Using index
1 PRIMARY t2 ALL NULL NULL NULL NULL 6 Using where; Using join buffer (Block Nested Loop)
3 DEPENDENT SUBQUERY t4 index NULL PRIMARY 11 NULL 6 Using where; Using index
SQLFiddle Demo

Here are some hints on how to do the task. This locates any prefixes that are useful. That's not what you asked for, but the flow of the query and the usage of #variables, plus the need for 2 (actually 3) levels of nesting, might help you.
SELECT DISTINCT `Prev`
FROM
(
SELECT #prev := #next AS 'Prev',
#next := IF(LEFT(city, LENGTH(#prev)) = #prev, #next, city) AS 'Next'
FROM ( SELECT #next := ' ' ) AS init
JOIN ( SELECT DISTINCT city FROM us ) AS dedup
ORDER BY city
) x
WHERE `Prev` = `Next` ;
Partial output:
+----------------+
| Prev |
+----------------+
| Alamo |
| Allen |
| Altamont |
| Ames |
| Amherst |
| Anderson |
| Arlington |
| Arroyo |
| Auburn |
| Austin |
| Avon |
| Baker |
Check the Al% cities:
mysql> SELECT DISTINCT city FROM us WHERE city LIKE 'Al%' ORDER BY city;
+-------------------+
| city |
+-------------------+
| Alabaster |
| Alameda |
| Alamo | <--
| Alamogordo | <--
| Alamosa |
| Albany |
| Albemarle |
...
| Alhambra |
| Alice |
| Aliquippa |
| Aliso Viejo |
| Allen | <--
| Allen Park | <--
| Allentown | <--
| Alliance |
| Allouez |
| Alma |
| Aloha |
| Alondra Park |
| Alpena |
| Alpharetta |
| Alpine |
| Alsip |
| Altadena |
| Altamont | <--
| Altamonte Springs | <--
| Alton |
| Altoona |
| Altus |
| Alvin |
+-------------------+
40 rows in set (0.01 sec)

Related

Mysql optimized query and index for exclusion

Mysql optimized query and index with exclusion
In the case of a select on a high volume table with a select criteria excluding results, what are the possible alternatives?
for example with the following table:
+----+---+---+----+----+
| id | A | B | C | D |
+----+---+---+----+----+
| 1 | a | b | c | d |
| 2 | a | b | c | d |
| 3 | a | b | c1 | d1 |
| 4 | a | b | c2 | d |
| 5 | a | b | c | d2 |
| 6 | a | b | c | d2 |
+----+---+---+----+----+
I would like to select all the tuples (C,D) where A=a and B=b and (C!=c or D!=d)
SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d) GROUP BY C,D;
expected result:
(c1,d1)
(c2,d)
(c,d2)
I tried to add an index like that: CREATE INDEX idx_my_index ON my_table(A, B, C, D); but response times are still very long
NB: I'm using MariadDB 10.3
The explain:
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| 1 | SIMPLE | my_table | ref | idx_my_index | idx_my_index | 6 | const,const | 12055772 | Using where; Using index |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
Is there some improvement to add on my index, on mariadb config or another select to do that?
Specific solution: If we use this query as a subquery we can use the FirstMatch strategy to avoid the full scan of the table. this is described into https://mariadb.com/kb/en/firstmatch-strategy/
SELECT * FROM my__second_table tbis
WHERE (tbis.C, tbis.D)
IN (SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d));
Your index is optimal. Discussion:
INDEX(A, B, -- see Note 1
C, D) -- see note 2
Note 1: A,B can be in either order. These will be used for filtering on = to find possible rows. Confirmed by "const,const".
Note 2: C,D can be in either order. != does not work well for filtering, hence these come after A and B. They are included here to make the index "covering". Confirmed by "Using index".
"response times are still very long" -- 12M rows in the table? How many rows before the GROUP BY? How many rows in the result? (These might give us clues into where to go next.)
"Alternative". Probably SELECT DISTINCT ... instead of SELECT ... GROUP BY ... would run at the same speed. (But you could try it. Also, the EXPLAIN might be the same`; the result should be the same.)
Please provide SHOW CREATE TABLE; it might give some more clues, such as NULL/NOT NULL and Engine type. (I don't hold out much hope here.)
Please provide EXPLAIN FORMAT=JSON SELECGT ... -- This might give some more insight. Also: Turn on the Optimizer Trace.

SQL Use Result from one Query for another Query

This is an excerpt from one table:
| id | type | other_id | def_id | ref_def_id|
| 1 | int | NULL | 5 | NULL |
| 2 | string | NULL | 5 | NULL |
| 3 | int | NULL | 5 | NULL |
| 20 | ref | 3 | NULL | 5 |
| 21 | ref | 4 | NULL | 5 |
| 22 | ref | 5 | NULL | 5 |
What I want is to find entries with type ref. Then I would for example have this one entry in my result:
| 22 | ref | 5 | NULL | 5 |
The problem I am facing is that I now want to combine this entry with other entries of the same table where def_id = 5.
So I would get all entries with def_id = 5 for this specific ref type as result. I somehow need the output from my first query, check what the ref_def_id is and then make another query for this id.
I really have problems to understand how to proceed. Any input is much appreciated.
If I understand correctly you need to find rows with a type of 'ref' and then use the values in their ref_def_id columns to get the rows with the same values in def_id. In that case you need to use a subquery for getting the rows with 'ref' type and combine it using either IN or EXISTS:
select *
from YourTable
where def_id in (select ref_def_id from YourTable where type='ref');
select *
from YourTable
where exists (select * from YourTable yt
where yt.ref_def_id=YourTable.def_id and yt.type='ref')
Both queries are equivalent, IN is easier to understand at first sight but EXISTS allow more complex conditions (for example you can use more than one column for combining with the subquery).
Edit: since you comment that you need also the id from the 'ref' rows then you need to use a subquery:
select source_id, YourTable.*
from YourTable
join (select id as source_id, ref_def_id
from YourTable
where type='ref')
as refs on refs.ref_def_id=YourTable.def_id
order by source_id, id;
With this for each 'ref' row you would get all the rows with the associated ref_id.
use below query to get column from sub query.
select a.ref_def_id
from (select ref_def_id from YourTable where type='ref') as a;
What you are looking for is a subquery or even better a join operation.
Have a look here: http://www.mysqltutorial.org/mysql-left-join.aspx
Joins / the left join allows you to combine rows of tables within one query on a given condition. The condition could be id = 5 for your purpose.
You would seem to want aggregation:
select max(id) as id, type, max(other_id) as other_id,
max(def_id) as def_id, ref_def_id
from t
where type = 'ref'
group by type, ref_def_id

MySQL table order by one column when other column has a particular value

I have two mysql tables record_items,property_values with the following structure.
table : property_values (column REC is foreign key to record_items)
id(PK)|REC(FK)| property | value|
1 | 1 | name | A |
2 | 1 | age | 10 |
3 | 2 | name | B |
4 | 3 | name | C |
5 | 3 | age | 9 |
table: record_items
id(PK) |col1|col2 |col3|
1 | v11| v12 | v13|
2 | v21| v22 | v23|
3 | v31| v32 | v33|
4 | v41| v42 | v43|
5 | v51| v52 | v53|
record_items table contains only basic information about the record, where as property_values table keeps record_item as a foreign key and each property and its value is saved in a separate row.
Now I want to get the record_items sorted based on a particular property, say by age.
My HQL query will be like
Select distinct rec from PropertyValues where property="age" order by value;
But this query will be skipping record 2 since it don't have an entry for property age.
I expect the result to have the records which contains age property in sort order appended by those which don't have age property at all. How can I query that?
Here is a raw MySQL query which should do the trick:
SELECT t1.*
FROM record_items t1
LEFT JOIN property_values t2
ON t1.id = t2.REC AND
t2.property = 'age'
ORDER BY CASE WHEN t2.value IS NULL THEN 1 ELSE 0 END, t2.Value
I notice that your Value column in property_values is mixing numeric and text data. This won't work well for sorting purposes.
Demo here

List Last record of each item in mysql

Each item(item is produced by Serial) in my table has many record and I need to get last record of each item so I run below code:
SELECT ID,Calendar,Serial,MAX(ID)
FROM store
GROUP BY Serial DESC
it means it must show a record for each item which in that record all data of columns be for last record related to each item but the result is like this:
-------------------------------------------------------------+
ID | Calendar | Serial | MAX(ID) |
-------------------------------------------------------------|
7031053 | 2016-05-14 14:05:14 79.5 | N10088 | 7031056 |
7053346 | 2016-05-14 15:17:28 79.8 | N10078 | 7053346 |
7051349 | 2016-05-14 15:21:29 86.1 | J20368 | 7051349 |
7059144 | 2016-05-14 15:50:27 89.6 | J20367 | 7059144 |
7045551 | 2016-05-14 15:15:15 89.2 | J20366 | 7045551 |
7056243 | 2016-05-14 15:25:34 85.2 | J20358 | 7056245 |
7042652 | 2016-05-14 15:18:33 83.9 | J20160 | 7042652 |
7039753 | 2016-05-14 11:48:16 87 | J20158 | 7039753 |
7036854 | 2016-05-14 15:18:35 87.5 | J20128 | 7036854 |
7033955 | 2016-05-14 15:20:45 83.4 | 9662 | 7033955 |
-------------------------------------------------------------+
the problem is why for example in record related to Serial N10088 the ID is "7031053", but MAX(ID) is "7031056"? or also for J20358?
each row must show last record of each item but in my output it is not true!
If you want the row with the max value, then you need a join or some other mechanism.
Here is a simple way using a correlated subquery:
select s.*
from store s
where s.id = (
select max(s2.id)
from store s2
where s2.serial = s.serial
);
You query uses a (mis)feature of SQL Server that generates lots of confusion and is not particularly helpful: you have columns in the select that are not in the group by. What value do these get?
Well, in most databases the answer is simple: the query generates an error as ANSI specifies. MySQL pulls the values for the additional columns from indeterminate matching rows. That is rarely what the writer of the query intends.
For performance, add an index on store(serial, id).
try this one.
SELECT MAX(id), tbl.*
FROM store tbl
GROUP BY Serial
You can try with this also...
SELECT ID,Calendar,Serial
FROM store s0
where ID = (
SELECT MAX(id)
FROM store s1
WHERE s1.serial = s0.serial
);

SQL query in MySQL containing mathematical comparison

I need to have a SQL that finds values from table B using (randomize) values on table A in comparative manner. Table A values has been produces in randomize manner. Table B values have been order in a way of cumulative distribution function. What is needed is that SQL will get the first row from table B which satisfy the criteria.
Table A:
+----+-------+
| ID | value |
+----+-------+
| 1 | 0.1234|
| 2 | 0.8923|
| 3 | 0.5221|
+----+-------+
Table B:
+----+-------+------+
| ID | value | name |
+----+-------+------+
| 1 | 0.2000| Alpha|
| 2 | 0.5000| Beta |
| 3 | 0.7500| Gamma|
| 4 | 1.0000| Delta|
+----+-------+------+
Result should be:
+----+-------+------+
| ID | value | name |
+----+-------+------+
| 1 | 0.1234| Alpha|
| 2 | 0.8923| Delta|
| 3 | 0.5221| Gamma|
+----+-------+------+
Value 0.1234 is smaller than all the values of B, but Alpha has smallest value.
Value 0.8923 is smaller than 1.000 --> Delta.
Value 0.5221 is smaller than both 0.7500 and 1.000 but 0.7500 is smallest --> Gamma.
This query works only if table A has one value:
select value, name
from B
where (select value from A) &lt value;
Any ideas how to get this work with full table A?
You can use subquery to get the data you need:
SELECT a.ID, a.value,
(SELECT b.name FROM TableB b WHERE a.value < b.value ORDER BY b.ID ASC LIMIT 1) as name
FROM TableA a
In this case for each row in table A you find the first record in table B, that has larger number in column value. Depending on your requirements the operator < might beed to be updated to <= - it depends on your requirements