Preventing duplicate rows based on a column (MySQL)? - mysql

I'm building a system that updates its local database from other APIs frequently. I have Python-scripts set as cron jobs, and they do the job almost fine.
However, the one flaw is, that the scripts take ages to perform. When they are ran for the first time, the process is quick, but after that it takes nearly 20 minutes to go through a list of 200k+ items received from the third-party API.
The problem is that the script first gets all the rows from the database and adds their must-be-unique column value to a list. Then, when going through the API results, it checks if the current items must-be-unique value exists in the list. This gets really heavy, as the list has over 200k values in it.
Is there a way to check in an INSERT-query that, based on a single column, there is no duplicate? If there is, simply not add the new row.
Any help will be appreciated =)

If you add a UNIQUE key to the column(s) that have to contain UNIQUE values, MySQL will complain when you insert a row that violates this constraint.
You then have three options:
INSERT IGNORE will try to insert, and in case of violation, do nothing.
INSERT ... ON DUPLICATE KEY UPDATE will try to insert, and in case of violation, update the row to the new values
REPLACE will try to insert, and in case of violation, DELETE the offending existing row, and INSERT the new one.

Related

Adding a UNIQUE key to a large existing MySQL table which is receiving INSERTs/DELETEs

I have a very large table (dozens of millions of rows) and a UNIQUE index needs to be added to a column on that table. I know for a fact that the table does contain duplicated values on that key, which I need to clean up (by deleting rows/resetting the value of the column to something unique that I can automatically generate). A plus is that the rows which are already duplicated do not get modified anymore.
What would be the right approach to perform a change like this, given that I will be probably using the Percona pt-osc tool and there are continuous deletes/inserts on the table? My plan was:
Add code that ensures no dupe IDs get inserted anymore. Probably I need to add a separate table for this temporarily, since I want the database to enforce this for me and not the application - so insert into the "shadow table" with a unique index in a transaction together with my main table, rollback all inserts that try to insert duplicate values
Backfill the table by zapping all invalid column values which are within the primary key range below $current_pkey_value
Then add the index and use pt-osc to changeover the table
Is there anything I am missing?
Since we use pt-online-schema-change we are using triggers for performing the synchronisation from the existing table to a temp table. The tool actually has a special configuration key for this, --no-check-unique-key-change, which will do exactly what we need - agree to perform the ALTER TABLE and set up triggers in such a way that if a conflict occurs, INSERT .. IGNORE will be applied and the first row having used the now-unique value will win in the insert during synchronisation. For us this is a good tradeoff because all the duplicates we have seen resulted from data races, not from actual conflicts in the value generation process.

re-inserting a table record and updating an auto increment primary index

I'm running MariaDB 5.5.56.
I'm looking to copy an entire row in a database, change one column, then insert the entire row back into the original database (I don't want to have to specify the individual fields because there's a lot of them). The problem I'm running into is how to deal with an auto-increment/primary key column.
example:
create temporary table t_ownership like ownership;
insert into t_ownership (select * from ownership where name='x' LIMIT 1);
update t_ownership set id='something else';
insert into ownership (select * from t_ownership);
I have a column "recno" that is an auto-increment that will create a collision in the database when I try to re-insert the slightly changed record back into the original table.
Something like this seems to work but doesn't result in an insert:
insert into ownership (select * from t_ownership) ON DUPLICATE KEY UPDATE recno=LAST_INSERT_ID(ownership.recno);
The above statement executes without error but does not add a row to table ownership.
So I think I'm close but not quite there...
What would be the best way to do this? I'd like to avoid doing an insert where I manually specify field/values. I just need to regenerate a new A.I. recno column on the insert.
NULL values inserted into auto-incremented fields end up just getting the next auto-increment value, behaving equivalent to INSERTing without specifying the field; so you should be able to update the source (temp copy) to have NULL for that field.
However, one potential issue that could present itself in scenarios like yours is that the CREATE TEMPORARY TABLE ... LIKE could result in a table that would not allow you to set such fields to NULL; this would require you to either ALTER the temporary table, or create it in a more explicit manner. Either way, it now makes code/queries that do not specify columns even more reliant on knowing columns.
Personally, I would take this route in the first place.
INSERT INTO theTable([list all but the auto-inc column])
SELECT [list all but the auto-inc column, with any replacements or modifications desired]
FROM ...[original query]...
It accomplishes the task in one query, makes the queries more self documenting, and only at the cost of a little typing (most of which a decent database browser, or query builder, will do for you).
The only argument really in favor of your current approach is that the table involved can be changed without necessarily breaking your queries; but that begs the question of whether it would be better for such table changes to break the queries, forcing them to be re-examined. If it is not an issue, it is a minor revision; but the alternative is queries that continue to be valid that have the potential to cause unexpected behavior due to copying information they were never intended to.

Why not to delete tho old row and insert updated row?

I have a table (MySql) that some rows need to be updated when a user desires.
i know the right way is just using Sql UPDATE statement and i don't speak about 'Which is faster? Delete and insert or just update!'. but as my table update operation needs more time to write a code (cause of table's relations) why i don't delete the old row and insert updated field?
Yes, you can delete and insert. but what keeps the record in your database if the program crash a moment before it can insert data to Database?
Update keeps this from happening. It keeps the data in your database and change the value that needed to be changed. Maybe it is complicated to use in your database, but you can certain that your record still safe.
finally i get the answer!
in a RDBMS system there are relations between records and one record might have some dependencies. in such situations you cannot delete and insert new record because foreign key constraint cause data lose. records dependent (ie user posts) to main record (ie an user record) will be deleted!
if there are situations that you don't have records dependencies (not as exceptions! but in data models nature) (like no-sql) and you have some problems in updating a record (ie file checking) you can use this approach.

Performing an UPDATE or INSERT depending whether a row exists or not in MySQL

In MySQL, I'm trying to find an efficient way to perform an UPDATE if a row already exists in a table, or an INSERT if the row doesn't exist.
I've found two possible ways so far:
The obvious one: open a transaction, SELECT to find if the row exists, INSERT if it doesn't exist or UPDATE if it exists, commit transaction
first INSERT IGNORE into the table (so no error is raised if the row already exists), then UPDATE
The second method avoids the transaction.
Which one do you think is more efficient, and are there better ways (for example using a trigger)?
INSERT ... ON DUPLICATE KEY UPDATE
You could also perform an UPDATE, check the number of rows affected, if it's less than 1, then it didn't find a matching row, so perfom the INSERT.
There is another way - REPLACE.
REPLACE INTO myTable (col1) VALUES (value1)
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 12.2.5, “INSERT Syntax”.
In mysql there's a REPLACE statement that, I believe, does more or less what you want it to do.
REPLACE INTO would be a solution, it uses the UNIQUE INDEX for replacing or inserting something.
REPLACE INTO
yourTable
SET
column = value;
Please be aware that this works differently from what you might expect, the REPLACE is quite literally. It first checks if there is a UNIQUE INDEX collision which would prevent an INSERT, it removes (DELETE) all rows which collide and then INSERTs the row you've given it.
This, for example, leads to subtle problems like Triggers not firing (because they check for an update, which never occurs) or values reverted to the defaults (because you must specify all values).
If you're doing a lot of these, it might be worth writing them to a file, and then using 'LOAD DATA INFILE ... REPLACE ...'

MySQL is handling one SQL query at the time?

If you got 100 000 users, is MySQL executing one SQL query at the time?
Because in my PHP code I check if a certain row exists; if it doesn't it creates one. If it does, it just updates the row counter.
It crossed my mind that perhaps 100 users are checking if the row exists at the same time, and when it doesn't they all create one row each.
If MySQL is handling them sequentially I know that it won't be an issue, then one user will check if it exists, if not, create it. The other user will check if it exists, and since that's the case, it just updates the counter.
But if they all check if it exists at the same time and let's say it doesn't, then they all create one row and the whole table structure will fail.
Would be great if someone could shed some light on this topic.
Use a UNIQUE constraint or, if viable, make the primary key one of your data items and the SQL server will prevent duplicate rows from being created. You can even use the "ON DUPLICATE KEY UPDATE ..." syntax to specify the alternate operation if the row already exists.
From your comments, it sounds like you could use the user_id as your primary key, in which case, you'd be able to use something like this:
INSERT INTO usercounts (user_id,usercount)
VALUES (id-goes-here,1)
ON DUPLICATE KEY UPDATE usercount=usercount+1;
If you put the check and insert into a transaction then you can avoid this problem. This way, the check and create will be run as one one query and there shouldn't be any confusion