I struggle to write a DELETE query in MariaDB 5.5.44 database.
The first of the two following code samples works great, but I need to add a WHERE statement there. That is displayed in the second code sample.
I need to delete only rows from polozkyTransakci where puvodFaktury <> FAKTURA VO CZ in transakce_tmp table. I thought that my WHERE statement in the second sample could have worked ok with the inner SELECT, but it takes forever to process (it takes about 40 minutes in my cloud based ETL tool) and even then it does not leave the rows I want untouched.
1.
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy';
2.
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy'
AND idTransakce NOT IN (
SELECT idTransakce
FROM transakce_tmp
WHERE puvodFaktury = 'FAKTURA VO CZ');
Thaks a million for any help
David
IN is very bad on performance .. Try using NOT EXISTS()
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy'
AND NOT EXISTS (SELECT 1
FROM transakce_tmp r
WHERE r.puvodFaktury = 'FAKTURA VO CZ'
AND r.idTransakce = polozkyTransakci.idTransakce );
Before you can performance tune, you need to figure out why it is not deleting the correct rows.
So first start with doing selects until you get the right rows identified. Build your select a bit at time checking the results at each stage to see if you are getting the results you want.
Once you have the select then you can convert to a delete. When testing the delete do it is a transaction and run some test of the data that is left behind to ensure it deleted properly before rolling back or committing. Since you likely want to performance tune, I would suggest rolling back, so that you can then try again on the performance tuned version to ensure you got the same results. Of course, you only want to do this on a dev server!
Now while I agree that not exists may be faster, some of the other things you want to look at are:
do you have cascade deletes happening? If you end up deleting many
child records, that could be part of the problem.
Are there triggers affecting the delete? especially look to see if someone set one up to run through things row by row instead of as a set. Row by row triggers are a very bad thing when you delete many records. For instance suppose you are deleting 50K records and you have a delete trigger to an audit table. If it inserts to that table one record at a time, it is being executed 50K times. If it inserts all the deleted records in one step, that insert individually might take a bit longer but the total execution is much shorter.
What indexing do you have and is it helping the delete out?
You will want to examine the explain plan for each of your queries to
see if they are improving the details of how the query will be
performed.
Performance tuning is a complex thing and it is best to get read up on it in detail by reading some of the performance tuning books available for your specific database.
I might be inclined to write the query as a LEFT JOIN, although I'm guessing this would have the same performance plan as NOT EXISTS:
DELETE pt
FROM polozkyTransakci pt LEFT JOIN
transakce_tmp tt
ON pt.idTransakce = tt.idTransakce AND
tt.puvodFaktury = 'FAKTURA VO CZ'
WHERE pt.typPolozky = 'odpocetZalohy' AND tt.idTransakce IS NULL;
I would recommend indexes, if you don't have them: polozkyTransakci(typPolozky, idTransakce) and transakce_tmp(idTransakce, puvodFaktury). These would work on the NOT EXISTS version as well.
You can test the performance of these queries using SELECT:
SELECT pt.*
FROM polozkyTransakci pt LEFT JOIN
transakce_tmp tt
ON pt.idTransakce = tt.idTransakce AND
tt.puvodFaktury = 'FAKTURA VO CZ'
WHERE pt.typPolozky = 'odpocetZalohy' AND tt.idTransakce IS NULL;
The DELETE should be slower (due to the cost of logging transactions), but the performance should be comparable.
Related
I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html
Warning: This is a soft question, where you'll be answering to someone who has just started teaching himself SQL from the ground up. I haven't gotten my database software set up yet, so I can't provide tables to run queries against. Some patience required.
Warnings aside, I'm experimenting with basic SQL but I'm having a little bit of a rough time getting a clear answer about the inner workings of subqueries and their execution order within my query.
Let us say my query looks something like shit:
SELECT * FROM someTable
WHERE someFirstValue = someSecondValue
AND EXISTS (
SELECT * FROM someOtherTable
WHERE someTable.someFirstValue = someOtherTable.someThirdValue
)
;
The reason I'm here, is because I don't think I understand fully what is going on in this query.
Now I don't want to seem lazy, so I'm not going to ask you guys to "tell me what's going on here", so instead, I'll provide my own theory first:
The first row in someTable is checked so see if someFirstValue is the same as someSecondValue in that row.
If it isn't, it goes onto the second row and checks it too. It continues like this until a row passes this little inspection.
If a row does pass, it opens up a new query. If the table produced by this query contains even a single row, it returns TRUE, but if it's empty it returns FALSE.
My theory ends here, and my confusion begins.
Will this inner query now compare only the rows that passed the first WHERE? Or will it check all the items someTable and someOtherTable?
Rephrased; will only the rows that passed the first WHERE be compared in the someTable.someFirstValue = someOtherTable.someThirdValue subquery?
Or will the subquery compare all the elements from someTable to all the elements in someOtherTable regardless of which passed the first WHERE and which didn't?
UPDATE: Assume I'm using MySQL 5.5.32. If that matters.
The answer is that SQL is a descriptive language that describes the result set being produced from a query. It does not specify how the query is going to be run.
In your case the query has several options on how it might run, depending on the database engine, what the tables look like, and indexes. The query itself:
SELECT t.*
FROM someTable t
WHERE t.someFirstValue = t.someSecondValue AND
EXISTS (SELECT *
FROM someOtherTable t2
WHERE t.someFirstValue = t2.someThirdValue
);
Says: "Get me all columns from SomeTable where someFirstValue = someSecondValue and there is a corresponding row in someOtherTable where that's table column someThirdValue is the same as someFirstValue".
One possible way to approach this query would be to scan someTable and first check for the first condition. When the two columns match, then look up someFirstValue in an index on someOtherTable(someThirdValue) and keep the row if the values match. As I say, this is one approach, and there are others.
So my expertise is not in MySQL so I wrote this query and it is starting to run increasingly slow as in 5 minutes or so with 100k rows in EquipmentData and 30k or so in EquipmentDataStaging (which to me is very little data):
CREATE TEMPORARY TABLE dataCompareTemp
SELECT eds.eds_id FROM equipmentdatastaging eds
INNER JOIN equipment e ON e.e_id_string = eds.eds_e_id_string
INNER JOIN equipmentdata ed ON e.e_id = ed.ed_e_id
AND eds.eds_ed_log_time=ed.ed_log_time
AND eds.eds_ed_unit_type=ed.ed_unit_type
AND eds.eds_ed_value = ed.ed_value
I am using this query to compare data rows pulled from a clients device to current data sitting within their database. From here I take the temp table and use the ID's off it to make conditional decisions. I have the e_id_string indexed and I have e_id indexed and everything else is not. I know that it looks stupid that I have to compare all this information, but the clients system is spitting out redundant data and I am using this query to find it. Any type of help on this would be greatly appreciated whether it be a different approach by SQL or MySql Management. I feel like when I do stuff like this in MSSQL it handles the requests much better, but that is probably because I have something set up incorrectly.
TIPS
index all necessary columns which are using with ON or WHERE condition
here you need to index eds_ed_log_time,eds_e_id_string, eds_ed_unit_type, eds_ed_value,ed_e_id,ed_log_time,ed_unit_type,ed_value
change syntax to SELECT STRAIGHT JOIN ... see more reference
DELETE FROM keywords
WHERE NOT EXISTS
(SELECT keywords_relations.k_id FROM keywords_relations WHERE keywords.k_id = keywords_relations.k_id)
It is taking too long...I have 583,000 keywords (utf_unicode) and 1million keywords_relations. In past the query used to happen in 20-60 seconds, but I try running it now and it hasnt happened in half an hour.
Could you please suggest what might be wrong. Also, any other alterantives to this query.
I am trying to delete the keywords from the keywords table whose id dont exist in the keywords relations table.
Thanks
The site is http://domainsoutlook.com/
You can try and going on it and also see that all the queries are running slowly.
PS. The server crashed about a few days ago and a fsck check or something was carried out on the disk by my server maintenance support.
Indexes on keywords = k_id(primary), keywords(unique)
indexes on keywords_relations = k_id(index)
try this instead and see if it makes a difference:
DELETE keywords
FROM keywords
LEFT JOIN keywords_relations
ON keywords.k_id = keywords_relations.k_id
WHERE keywords_relations.k_id IS NULL
WHERE NOT EXISTS (subquery) is known to cause performance issues in MySQL <5.4. Use LEFT JOIN instead
WARNING: Test it before running on live database. I claim no responsibility for data lost.
DELETE
k
FROM
keywords AS k
LEFT JOIN
keywords_relations AS kr
USING (k_id)
WHERE
kr.k_id IS NULL
I assume that you don't want to delete keywords, if there are one or more keyword_relations belonging to it. So the first thing you could do is adding a LIMIT 1 to your SELECT-Query.
You sure have set indexes for k_id, right? If yes, this actually shouldn't be a thing for MySQL...
At the moment, I have a table in mysql that records transactions. These transactions may be updated by users - sometimes never, sometimes often. However, I need to track changes to every field in this table. So, what I have at the moment is a TINYINT(1) field in the table called 'is_deleted', and when a transaction is 'updated' by the user, it simply updates the is_deleted field to 1 for the old record and inserts a brand new record.
This all works well, because I simply have to run the following sql statement to get all current records:
SELECT id, foo, bar, something FROM trans WHERE is_deleted = 0;
However, I am concerned that this table will grow unnecessarily large over time, and I am therefor thinking of actually deleting the old record and 'archiving' it off to another table (trans_deleted) instead. That means that the trans table will only contain 'live' records, therefor making SELECT queries that little bit faster.
It does mean, however, the updating records will take slightly longer, as it will be running 3 queries:
1. INSERT INTO trans_deleted old record;
2. DELETE from trans WHERE id = 5;
3. INSERT INTO trans new records
So, updating records will take a bit more work, but reading will be faster.
Any thoughts on this?
I would suggest a table trans and a table_revision
Where trans has the fields id and current_revision and revision has the fields id, transid, foo and bar.
To get all current items then:
SELECT r.foo, r.bar FROM trans AS t
LEFT JOIN revision AS r ON t.id = r.trans_id
WHERE t.current_revision = r.id
If you now put indexes on r.id and r.trans_id archiving woun't make it much faster for you.
Well typically, you read much more often than you write (and you additionally say that some records may be never changed). So that's one reason to go for the archive table.
There's also another one: You also have to account for programmer time, not only processor time :) If you keep the archived rows in the same table with the live ones, you'll have to remember and take care of that in every single query you perform on that table. Not to speak of future programmers who may have to deal with the table... I'd recommend the archiving approach based on this factor alone, even if there wasn't any speed improvement!