Normalize data before loading to database or use database?

Normalize data before loading to database or use database? - mysql

I have some data which I want to add to an existing mysql database. The new data may have entries, which are already saved on DB. Since some of my columns are unique, I get, as expected, an ER_DUP_ENTRY error.
Bulk Insert
Let's say I want to use following statement to save "A", "B" and "C" in a column names of table mytable and "A" is already saved there.
insert into mytable (names) values ("A"), ("B"), ("C");
Is there a way to directly use bulk insert to save "B" and "C" while ignoring "A"? Or do I have to build an insert statement for every new row? This leads to another question:
Normalize Data
Should I assure not to upload duplicate entries before the actual insert statement? In my case I would need to select the data from database, eliminate duplicates and then perform the above seen insert. Or is that a task which is supposed to be done by a database?

If you have UNIQUE constraints that are blocking import, you have a few ways you can work around that:
INSERT IGNORE INTO mytable ...
If any individual rows violate a UNIQUE constraint, they are skipped. Other rows are inserted.
REPLACE INTO mytable ...
If any rows violate a UNIQUE constraint, DELETE the existing row, then INSERT the new row. Keep in mind side-effects of doing this, like if you have foreign keys that cascade on delete referencing the deleted row. Or if the INSERT generates a new auto-increment id.
INSERT INTO mytable ... ON DUPLICATE KEY UPDATE ...
More flexibility. This does not delete the original row, but allows you to set new values for any columns you choose on a case by case basis. See also my answer to "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"
If you want to use bulk-loading with mysqlimport or the SQL statement equivalent LOAD DATA INFILE, there are options that match the INSERT IGNORE or REPLACE solutions, but not the INSERT...ON DUPLICATE KEY UPDATE solution.
Read docs for more information:
https://dev.mysql.com/doc/refman/8.0/en/insert.html
https://dev.mysql.com/doc/refman/8.0/en/replace.html
https://dev.mysql.com/doc/refman/8.0/en/insert-on-duplicate.html
https://dev.mysql.com/doc/refman/8.0/en/mysqlimport.html
https://dev.mysql.com/doc/refman/8.0/en/load-data.html

In some situations, I like to do this:
LOAD DATA into a temp table
Clean up the data
Normalize as needed. (2 SQLs per column that needs normalizing -- details)
Augment Summary table(s) (INSERT .. ON DUPLICATE KEY .. SELECT x, y, count(*), sum(z), .. GROUP BY x,y)
Copy clean data from temp table to real table(s) ("Fact" table). (INSERT [IGNORE] .. SELECT [DISTINCT] .. or IODKU with SELECT.)
More on Normalizing:
I do it outside any transactions. There are multiple reasons why this is better.
At worst (as a result of other failures), I occasionally throw an unused entry in the normalization table. No big deal.
No burning of AUTO_INCREMENT ids (except in edge cases).
Very fast.
Since REPLACE is a DELETE plus INSERT it is almost guaranteed to be worse than IODKU. However, both burn ids when the rows exist.
If at all possible, do not "loop" through the rows; instead find SQL statements to handle them all at once.
Depending on the details, de-dup in step 2 (if lots of dups) or in step 5 (dups are uncommon).

Related

Adding a UNIQUE key to a large existing MySQL table which is receiving INSERTs/DELETEs

I have a very large table (dozens of millions of rows) and a UNIQUE index needs to be added to a column on that table. I know for a fact that the table does contain duplicated values on that key, which I need to clean up (by deleting rows/resetting the value of the column to something unique that I can automatically generate). A plus is that the rows which are already duplicated do not get modified anymore.
What would be the right approach to perform a change like this, given that I will be probably using the Percona pt-osc tool and there are continuous deletes/inserts on the table? My plan was:
Add code that ensures no dupe IDs get inserted anymore. Probably I need to add a separate table for this temporarily, since I want the database to enforce this for me and not the application - so insert into the "shadow table" with a unique index in a transaction together with my main table, rollback all inserts that try to insert duplicate values
Backfill the table by zapping all invalid column values which are within the primary key range below $current_pkey_value
Then add the index and use pt-osc to changeover the table
Is there anything I am missing?

Since we use pt-online-schema-change we are using triggers for performing the synchronisation from the existing table to a temp table. The tool actually has a special configuration key for this, --no-check-unique-key-change, which will do exactly what we need - agree to perform the ALTER TABLE and set up triggers in such a way that if a conflict occurs, INSERT .. IGNORE will be applied and the first row having used the now-unique value will win in the insert during synchronisation. For us this is a good tradeoff because all the duplicates we have seen resulted from data races, not from actual conflicts in the value generation process.

re-inserting a table record and updating an auto increment primary index

I'm running MariaDB 5.5.56.
I'm looking to copy an entire row in a database, change one column, then insert the entire row back into the original database (I don't want to have to specify the individual fields because there's a lot of them). The problem I'm running into is how to deal with an auto-increment/primary key column.
example:
create temporary table t_ownership like ownership;
insert into t_ownership (select * from ownership where name='x' LIMIT 1);
update t_ownership set id='something else';
insert into ownership (select * from t_ownership);
I have a column "recno" that is an auto-increment that will create a collision in the database when I try to re-insert the slightly changed record back into the original table.
Something like this seems to work but doesn't result in an insert:
insert into ownership (select * from t_ownership) ON DUPLICATE KEY UPDATE recno=LAST_INSERT_ID(ownership.recno);
The above statement executes without error but does not add a row to table ownership.
So I think I'm close but not quite there...
What would be the best way to do this? I'd like to avoid doing an insert where I manually specify field/values. I just need to regenerate a new A.I. recno column on the insert.

NULL values inserted into auto-incremented fields end up just getting the next auto-increment value, behaving equivalent to INSERTing without specifying the field; so you should be able to update the source (temp copy) to have NULL for that field.
However, one potential issue that could present itself in scenarios like yours is that the CREATE TEMPORARY TABLE ... LIKE could result in a table that would not allow you to set such fields to NULL; this would require you to either ALTER the temporary table, or create it in a more explicit manner. Either way, it now makes code/queries that do not specify columns even more reliant on knowing columns.
Personally, I would take this route in the first place.
INSERT INTO theTable([list all but the auto-inc column])
SELECT [list all but the auto-inc column, with any replacements or modifications desired]
FROM ...[original query]...
It accomplishes the task in one query, makes the queries more self documenting, and only at the cost of a little typing (most of which a decent database browser, or query builder, will do for you).
The only argument really in favor of your current approach is that the table involved can be changed without necessarily breaking your queries; but that begs the question of whether it would be better for such table changes to break the queries, forcing them to be re-examined. If it is not an issue, it is a minor revision; but the alternative is queries that continue to be valid that have the potential to cause unexpected behavior due to copying information they were never intended to.

Do UPDATE in the first place and then INSERT for new data (reports) into mysql

I get a report in a tab delimited file which stores some SKUs and the current quantities of them.
Which means most of the time the inventory is the same and we just have to update the quantities.
But it can happen, that a new SKU is in the list which we have to insert instead of updating.
We are using an INNODB table for storing those SKUs. At the moment we just cut the file by tabs and line breaks and make an INSERT ... ON DUPLICATE KEY UPDATE query which is quite inefficient, because INSERT is expensive at INNODB, right? Also tricky because when a list with a lot of SKUs coming in > 20k it just take some minutes.
So my resolution for now is to just make a LOAD DATA INFILE into an tmp table and afterwards do the INSERT ... ON DUPLICATE KEY UPDATE, which should be faster i think.
Also is there another solution which does a simple UPDATE in the first place and only if there are some left, it performs and INSERT? This would be perfect, but yet i could not find anything about it. Is there a way to delete rows which returned an update: 1?

Sort the CSV file by the PRIMARY KEY of the table.
LOAD DATA INFILE into a separate table (as you said)
INSERT INTO real_table SELECT * FROM tmp_table ON DUPLICATE KEY UPDATE ... -- Note: This is a single INSERT.
Caveat: This may block the table from other uses during step 3. A solution: Break the CSV into 1000-row chunks. COMMIT after each chunk.

Verify a query is going to work before executing another query in reverse order

Ok, I have an update function with a weird twist. Due to the nature of the structure, I run a delete query then insert query, rather than an actual "Update" query. They are specifically run in that order so that the new items inserted are not deleted. Essentially, items are deleted by an attribute id that matches in the insert query. Since the attribute is not a primary index, "ON DUPLICATE KEY UPDATE" is not working.
So here's the dilemma. During development and testing, The delete query will run without fail, but if I'm screwing around with the input for the INSERT query and it fails, then the DATA has been deleted without being reinserted, which means regenerating new test data, and even worse, if it fails in production, then the user will lose everything they were working on.
So, I know MySQL validates a query before it is actually run, so is it possible to make sure the INSERT query validates before running the DELETE query?
<cfquery name="delete" datasource="DSOURCE">
DELETE FROM table
WHERE colorid = 12
</cfquery>
<!--- check this query first before running delete --->
<cfquery name="insert" datasource="DSOURCE">
INSERT INTO table (Name, ColorID)
VALUES ("tom", 12)
</cfquery>

You have 2 problems.
Since the attribute is not a primary index, "ON DUPLICATE KEY UPDATE"
is not working.
Attribute doesn't have to be PRIMARY KEY. It's sufficient if it's defined as UNIQUE KEY, which you can do without penalties.
And number two: if you want to execute a series of queries in sequence, with ALL of them being successful and none failing - the term is transaction. Either all succeed or nothing happens. Google about MySQL transactions to get better overview of how to use them.

Since you use WHERE colorid = 12 as your delete criterium, colorid must be a unique key. This gives you two ways of approachng this with a single query
UPDTAE table SET NAME="tom"
WHERE colorid=12
OR
REPLACE INTO table (Name, ColorID)
VALUES ("tom", 12)

Performing an UPDATE or INSERT depending whether a row exists or not in MySQL

In MySQL, I'm trying to find an efficient way to perform an UPDATE if a row already exists in a table, or an INSERT if the row doesn't exist.
I've found two possible ways so far:
The obvious one: open a transaction, SELECT to find if the row exists, INSERT if it doesn't exist or UPDATE if it exists, commit transaction
first INSERT IGNORE into the table (so no error is raised if the row already exists), then UPDATE
The second method avoids the transaction.
Which one do you think is more efficient, and are there better ways (for example using a trigger)?

INSERT ... ON DUPLICATE KEY UPDATE

You could also perform an UPDATE, check the number of rows affected, if it's less than 1, then it didn't find a matching row, so perfom the INSERT.

There is another way - REPLACE.
REPLACE INTO myTable (col1) VALUES (value1)
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 12.2.5, “INSERT Syntax”.

In mysql there's a REPLACE statement that, I believe, does more or less what you want it to do.

REPLACE INTO would be a solution, it uses the UNIQUE INDEX for replacing or inserting something.
REPLACE INTO
yourTable
SET
column = value;
Please be aware that this works differently from what you might expect, the REPLACE is quite literally. It first checks if there is a UNIQUE INDEX collision which would prevent an INSERT, it removes (DELETE) all rows which collide and then INSERTs the row you've given it.
This, for example, leads to subtle problems like Triggers not firing (because they check for an update, which never occurs) or values reverted to the defaults (because you must specify all values).

If you're doing a lot of these, it might be worth writing them to a file, and then using 'LOAD DATA INFILE ... REPLACE ...'

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Normalize data before loading to database or use database? - mysql

Related

Adding a UNIQUE key to a large existing MySQL table which is receiving INSERTs/DELETEs

re-inserting a table record and updating an auto increment primary index

Do UPDATE in the first place and then INSERT for new data (reports) into mysql

Verify a query is going to work before executing another query in reverse order

Performing an UPDATE or INSERT depending whether a row exists or not in MySQL

Categories

Resources