MariaDB subquery use whole row - mysql

Usually subqueries compare single or multiple fields and delete statements usually delete values by an ID. Unfortunately I don't have a ID field and I have to use an generic approach for differnt kind of tables.
That's why I am working with a subquery using limit and offset as resolving rows.
I know that approach is risky, however is there any way to delete rows by subquerying and comparing the whole row?
DELETE FROM table WHERE * = ( SELECT * FROM table LIMIT 1 OFFSET 6 )
I am using the latest version of MariaDB

This sounds like a really strange need, but who am I to judge? :)
I would simply rely on the primary key:
DELETE FROM table WHERE id_table = (SELECT id_table FROM table LIMIT 1 OFFSET 6)
update: oh, so you don't have a primary key? You can join on the whole row this way (assuming it has five columns named a, b, c, d, e):
DELETE t
FROM table t
INNER JOIN (
SELECT a, b, c, d, e
FROM table
ORDER BY a, b, c, d, e
LIMIT 1 OFFSET 6
) ROW6 USING (a, b, c, d, e);
Any subset of columns (e.g. a, c, d) that uniquely identify a row will do the trick (and is probably what you need as a primary key anyway).
Edit: Added an ORDER BY clause as per The Impaler's excellent advice. That's what you get for knocking an example up quickly.

DELETE FROM t
ORDER BY ... -- fill in as needed
LIMIT 6
(Works on any version)

Related

Optimize range query with group by

Having trouble with a query. Here is the outline -
Table structure:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
This table contains data like and has over 600,000,000 rows:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
This represents the number of people with a given forename-surname combination in a place.
I would like to be able to query all the forenames that a surname is attached to; and then perform another search for where those forenames exist, with a count of the sum incidence. For Example: get all the forenames of people who have the surname "Smith"; then get a list of all those forenames, grouped by place and with the sum incidence. I can do this with the following query:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
However, this query takes about a minute to execute and will take more time if the surname being searched for is common.
The root problem is: performing a range query with a group doesn't utilize the full index.
Any suggestions how the speed could be improved?
In my experience, if your query has a range condition (i.e. any kind of predicate other than = or IS NULL), the column for that condition is the last column in your index that can be used to optimize search, sort, or grouping.
In other words, suppose you have an index on columns (a, b, c).
The following uses all three columns. It is able to optimize the ORDER BY c, because since all rows matching the specific values of a and b will by definition be tied, and then those matching rows will already be in order by c, so the ORDER BY is a no-op.
SELECT * FROM mytable WHERE a = 1 AND b = 2 ORDER BY c;
But the next example only uses columns a, b. The ORDER BY needs to do a filesort, because the index is not in order by c.
SELECT * FROM mytable WHERE a = 1 AND b > 2 ORDER BY c;
A similar effect is true for GROUP BY. The following uses a, b for row selection, and it can also optimize the GROUP BY using the index, because each group of values per distinct value of c is guaranteed to be grouped together in the index. So it can count the rows for each value of c, and when it's done with one group, it is assured there will be no more rows later with that value of c.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b = 2 GROUP BY c;
But the range condition spoils that. The rows for each value of c are not grouped together. It's assumed that the rows for each value of c may be scattered among each of the higher values of b.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b > 2 GROUP BY c;
In this case, MySQL can't optimize the GROUP BY in this query. It must use a temporary table to count the rows per distinct value of c.
MySQL 8.0.13 introduced a new type of optimizer behavior, the Skip Scan Range Access Method. But as far as I know, it only applies to range conditions, not ORDER BY or GROUP BY.
It's still true that if you have a range condition, this spoils the index optimization of ORDER BY and GROUP BY.
Unless I don't understand the task, it seems like this works:
SELECT placeRef, SUM( incidence )
FROM world
WHERE surnameRef = 214488
GROUP BY placeRef;
Give it a try.
It would benefit from a composite index in this order:
INDEX(surnameRef, placeRef, incidence)
Is incidence being updated a lot? If so, leave it off my Index.
You should consider moving from MyISAM to InnoDB. It will need a suitable PK, probably
PRIMARY KEY(placeRef, surnameRef, forenameRef)
and it will take 2x-3x the disk space.

MySQL: How do I efficiently reuse the results of a query in other queries?

I'm running the exact same query four times, twice as a subquery, gathering different information each time. What is the best way to pass the results of the first query to the other three so it doesn't have to run three more times?
On the average, it returns around 2,000 rows, but can be anywhere from 0 (in which case I skip the other three) to all. The primary table has nearly 300,000 rows, is growing by about 800 per day, rows are never deleted, and thousands of rows are updated throughout the day, many multiple times.
I looked into query cache, but it doesn't look like it has a bright future:
disabled-by-default since MySQL 5.6 / MariaDB 10.1.7
depreciated as of MySQL 5.7.20
removed in MySQL 8.0
I considered using GROUP_CONCAT with IN, but somehow I doubt that would work very well (if at all) with larger queries.
This is in a library I use to format the results for other scripts, so the original query can be nearly anything. Usually, it is on indexed columns, but can be horribly complicated using stored functions and take several minutes. It always involves the primary table, but may also join in other tables (but only to filter results from the primary table).
I am using Perl 5.16 and MariaDB 10.1.32 (will upgrade to 10.2 shortly) on CentOS 7. I am using prepare_cached and placeholders. The user this library runs as has SELECT-only access to tables plus EXECUTE on a couple stored functions, but I can change that if needed.
I've minimized the below as much as I can and used metasyntactic variables (inside angle brackets) as much as possible in an attempt to make the logic clear. id is 16 bytes and the primary key of the primary table (labeled a below).
I'm accepting three parameters as input. <tables> always includes a and may include a join like a join b on a.id=b.id. <where> can be simple like e=3 or horribly complex. I'm also getting an array of data for the placeholders, but I've left that out of the below because it doesn't affect the logic.
<search> = FROM <tables> WHERE (<where>)
<foo> = k < NOW() - INTERVAL 3 HOUR
<bar> = j IS NOT NULL OR <foo>
<baz> = j IS NULL AND k > NOW() - INTERVAL 3 HOUR
so <baz> is !<bar>. Every row should match one or the other
<where> often includes 1 or more of foo/bar/baz
SELECT a.id, b, c, d, <foo> x <search> ORDER BY e, id
SELECT COUNT(*) <search> AND <baz>
I really only need to know if any of the above rows match <baz>
SELECT c, COUNT(*) t, SUM(<bar>) o FROM a WHERE c IN (SELECT c <search> GROUP BY c) GROUP BY c
SELECT d, COUNT(*) t, SUM(<bar>) o FROM a WHERE d IN (SELECT d <search> GROUP BY d) GROUP BY d
The last two get a list of all unique c or d from the rows in the original query and then count how many total rows (not just the ones in the original query) have matching c or d and how many of those match <bar>. Those results get dumped into hashes so I can look up those counts while I iterate through the rows from the original query. I'm thinking running those two queries once is more efficient than running two smaller queries for each row.
Thank you.
Edited to add solution:
A temporary table was the answer, just not quite in the way Raymond suggested. Using EXPLAIN on my queries indicates that MariaDB was already using a temporary table for each, and deleting it when each was complete.
An inner join only returns rows that exist in both tables. So by making a temporary table of IDs that match my first SELECT, and then joining it to the primary table for the other SELECTs, I only get the data I want, without having to copy all that data to the temporary table.
"To create a temporary table, you must have the CREATE TEMPORARY TABLES privilege. After a session has created a temporary table, the server performs no further privilege checks on the table. The creating session can perform any operation on the table, such as DROP TABLE, INSERT, UPDATE, or SELECT." - https://dev.mysql.com/doc/refman/5.7/en/create-temporary-table.html
I also figured out that GROUP BY sorts by default, and you can get better performance if you don't need the data sorted by telling it not to.
DROP TEMPORARY TABLE IF EXISTS `temp`;
CREATE TEMPORARY TABLE temp AS ( SELECT a.id FROM <tables> WHERE <where> );
SELECT a.id, b, c, d, <foo> x FROM a JOIN temp ON a.id=temp.id ORDER BY e, id;
SELECT COUNT(*) FROM a JOIN temp WHERE <baz>;
SELECT c, COUNT(*) t, SUM(<bar>) o FROM a WHERE c IN (SELECT c FROM a JOIN temp GROUP BY c ORDER BY NULL) GROUP BY c ORDER BY NULL;
SELECT d, COUNT(*) t, SUM(<bar>) o FROM a WHERE d IN (SELECT d FROM a JOIN temp GROUP BY d ORDER BY NULL) GROUP BY d ORDER BY NULL;
DROP TEMPORARY TABLE IF EXISTS `temp`;
The best i could think of is by using a TEMPORARY table.
p.s iám using valid MySQL SQL code mixed with the same pseudo code as the topicstarter
CREATE TEMPORARY TABLE <name> AS ( SELECT FROM <tables> WHERE (<where>) )
<foo> = k < NOW() - INTERVAL 3 HOUR
<bar> = j IS NOT NULL OR <foo>
<baz> = j IS NULL AND k > NOW() - INTERVAL 3 HOUR
so <baz> is !<bar>. Every row should match one or the other
<where> often includes 1 or more of foo/bar/baz
SELECT a.id, b, c, d, <foo> x FROM <name> ORDER BY e, id
SELECT COUNT(*) FROM <name> WHERE <baz>
SELECT c, COUNT(*) t, SUM(<bar>) o FROM a WHERE c IN (SELECT c FROM <name> GROUP BY c) GROUP BY c
SELECT d, COUNT(*) t, SUM(<bar>) o FROM a WHERE d IN (SELECT d FROM <name> GROUP BY d) GROUP BY d

TSQL verify sort order / UNION ALL

CREATE PROCEDURE Test
AS
BEGIN
SELECT * FROM (
SELECT 1 AS a,'test1' as b, 'query1' as c
UNION ALL
SELECT 2 AS a,'test22' as b, 'query22' as c
UNION ALL
SELECT 2 AS a,'test2' as b, 'query2' as c
UNION ALL
SELECT 3 AS a,'test3' as b, 'query3' as c
UNION ALL
SELECT 4 AS a,'test4' as b, 'query4' as c
) As sample
FOR XML RAW
END
Can we guarantee that the stored procedure returns results in given order?
Normally it says when we insert these select query to temporary table we can't guarantee its inserting order. So we have to use order by clause. But most of time it gives same order. Can we enforce to give it some different order? Is this related with clustered and non clustered indices.
In second case can we enforce inserting order by adding Identity column?
When you insert data, SQL refers to it as a set. When even writing data to disc it tries to take minimum space and starts inserting rows in free pages it finds in non-uniform extents at first. So when you query data the result depends on the order of the information which is in the cash and the order of the information which is read from hard disc. I think it is almost impossible to predict that orders as it depends on the work of OS , other programs and so on.

processing blocks of records in composite primary key order

I'm using mysql and would like to process a very large table with a primary key of 4 parts in blocks of 10,000 (marshalling data to another system). The database is offline when I am doing the processing so I don't have to worry about any modifications. Say the primary key is (A, B, C, D) which are all integers. I first tried using LIMIT OFFSET to achieve this like this:
SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000 OFFSET 0;
Where I increased the offset by 10000 on each call. This seemed to get very slow as it got towards the higher rows in the table. Is it not possible to do this LIMIT OFFSET efficiently?
Then I tried a different approach that uses comparison on the composite primary key. I can get the first block like this:
SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000;
If the last row of that block has A = a, B = b, C = c, and D = d then I can get the next block with:
SELECT * FROM LargeTable
WHERE
A > a OR
(A = a AND B > b) OR
(A = a AND B = b AND C > c) OR
(A = a AND B = b AND C = c AND D > d)
ORDER BY (A, B, C, D) LIMIT 10000;
And then repeat that for each block. This also seemed to slow down greatly as I got to the higher rows in the table. Is there a better way to do this? Am I missing something obvious?
Start processing data from the very start using just plain
SELECT *
FROM LargeTable
ORDER BY (A, B, C, D)
and fetch rows one by one in your client code. You can fetch 10000 rows in your fetch loop if you want, or add LIMIT 10000 clause. When you want to stop this block, remember last tuple (A, B, C, D) that was processed, lets call it (A1, B1, C1, D1).
Now, when you want to restart from last point, fetch rows again one by one, but this time use tuple comparison in your WHERE clause:
SELECT *
FROM LargeTable
WHERE (A, B, C, D) > (A1, B1, C1, D1)
ORDER BY (A, B, C, D)
(you can also add LIMIT 10000 clause if you don't want to rely on client code exiting fetch loop prematurely).
Key to this solution is that MySQL correctly implements tuple comparison.
EDIT: mentioned that optional LIMIT 10000 can be added.
You're probably invoking a sequential scan of the table in some way.
Further, you're conditional SELECT is not doing what you think it does. It's short circuiting on the first condition A > a.
It'll be more efficient if you skip the ORDER BY and LIMIT and use a statement like:
SELECT *
FROM LargeTable
WHERE A = a AND B = b AND C = c;
And just iterate through sets of a, b, and c.
A lot depends on the context in which you're doing your 'marshalling' operations, but is there a reason why you can't let the unconstrained SELECT run, and have your code do the grouping into blocks of 10,000 items?
In pseudo-code:
while (fetch_row succeeds)
{
add row to marshalled data
if (10,000 rows marshalled)
{
process 10,000 marshalled rows
set number of marshalled rows to 0
}
}
if (marshalled rows > 0)
{
process N marshalled rows
}
Limit with offset needs to throw away rows until it finds the ones you actually want so it gets slow as you have a higher offset.
Here's an idea. Since your database is offline while you do this the data doesn't actually have to be present during the job. Why not move all processed rows to another table while processing them? I'm not sure it will be faster, it depends on how many indexes the table have but you should try it.
CREATE TABLE processed AS LargeTable;
SELECT * FROM LargeTable LIMIT 10000;
INSERT INTO processed SELECT * FROM LargeTable LIMIT 10000;
DELETE FROM LargeTable LIMIT 10000;
DELETE TABLE LargeTable;
RENAME TABLE processed TO LargeTable;

MySQL: Indexes on GROUP BY

I have a reasonably big table (>10.000 rows) which is going to grow much bigger fast. On this table I run the following query:
SELECT *, MAX(a) FROM table GROUP BY b, c, d
Currently EXPLAIN tells me that there are no keys, no possible keys and it's "Using temporary; Using filesort". What would the best key be for such a table?
What about composite key b+c+d+a?
Btw, SELECT * makes no sense in case when you have GROUP BY
A primary index on field b,c,d would be nice if applicable.
In that case you just do a
SELECT * FROM table1
group by <insert PRIMARY KEY here>
If not put an index on b,c,d.
And maybe on a, depends on the performance.
If b,c,d are always used in unison, use a composite index on all three.
Very important! Always declare a primary key. Without it performance on InnoDB will suck.
To elaborate on #zerkms, you only need to put those columns in the group by clause that completely define the rows that you are selecting.
If you select * that may be OK, but than the max(a) is not needed and neither is the group by.
Also note that the max(a) may come from a different row than the rest of the fields.
The only use case that does make sense is:
select t1.*, count(*) as occurrence from t1
inner join t2 on (t1.id = t2.manytoone_id)
group by t1.id
Where t1.id is the PK.
I think you need to rethink that query.
Ask a new question explaining what you want with the real code.
And make sure to ask how to make the outcomedeterminate, so that all values shown are functionally dependent on the group by clause.
In the end what worked was a modification to the query as follows:
SELECT b, c, d, e, f, MAX(a) FROM table GROUP BY b, c, d
And creating an index on (b, c, d, e, f).
Thanks a lot for your help: the tips here were very useful.