I have 2 equal databases (A and B) with one table each running in separate offline machines.
Every day I export their data (as csv) and "merge" it into a 3rd database (C). I first process A, then B (I insert the content from A to C, then the contents from B to C)
Now, it could happen that I get duplicate rows. I consider a duplicate if some field, for example "mail" already exists. I don't care if the rest of the fields are the same.
How can I insert A and B into C excluding those rows that are duplicates?
Thanks in advance!
Easiest solution should be to create a unique index on the columns in question and run the second insert as INSERT IGNORE
Personally I use the ON DUPLICATE KEY UPDATE as using INSERT IGNORE causes any errors to be thrown as warnings.
This may have some side effects and may result in behavior you may not expect. See this post for details on some of the side effects.
If you end up using the ON DUPLICATE KEY UPDATE syntax, it will also provide a means of changing your logic to update specific fields with new data should business requirements change.
For instance, you can tally how many times a duplicate record was inserted by saying ON DUPLICATE KEY UPDATE quantity = quantity+1.
The post referenced above has a ton more information.
Related
I'm currently trying to create a table of theater locations that only has three locations. I imported denormalized data that I tried to normalize with this statement:
insert into theater(`name`, email, address, phone)
select distinct theater, theater_email, theater_address, theater_phone
from denormalized_tickets;
when I comment out the first line and run it I get the result I'm looking for.
When I write a query to see the theater table select * from theater;, it returns the theaters duplicated each 12 times.
How should I solve this? Is there anything I'm overlooking?
As discussed in the comments above, INSERT creates new rows each time you execute it. If you do that multiple times, you may add more rows every time.
Vasya recommended creating a UNIQUE index to block new rows from being created with the same values. This may or may not be appropriate for a given table. For instance, what if you want to allow multiple rows to have the same values?
Another thing you might like to read about is MySQL's REPLACE statement. The syntax is similar to INSERT, but if there's a duplicate in column(s) of a primary key or unique key, it first deletes the old row and then inserts the new row. But this won't help if you don't have the unique key defined, because how would MySQL know it's a conflict?
I'm trying to write a MySQL trigger for a table update (and a similar one for insert) that will take the updated columns and update corresponding columns in another table.
My set-up is this: I have one table (A) with several columns of numerical values and a record number Primary Key. I have another table (B) with identical column names but with short text descriptors that relate to each numerical value and also a record number as a Foreign Key referring to table A. Both of these tables may grow over time to include more columns - always matching each other - each with a simple predictable name (sticking with integers for now). All records are 1:1.
My hope was that I could write triggers for both update and insert on table A that would look at the numbers and, based on some simple logic, assign a descriptor to the corresponding record in table B (inserting that record in the case of the insert trigger). It got rather complicated quickly because I had to query INFORMATION_SCHEMA.COLUMNS to identify all current column names in table A, check each OLD vs NEW to verify that column was updated (for the update trigger anyway), do some logic to determine the appropriate descriptor, then INSERT/UPDATE the corresponding column in table B. I can't figure out how to set up a procedure/trigger that doesn't require storing column names in a variable to dynamically build an SQL statement. This is, of course, not allowed in a trigger and I have made some attempts at getting around this by moving the dynamic SQL statement into a separate stored procedure. None of this has worked and I've run into so many roadblocks, I'm coming to the conclusion that I'm going about this in entirely the wrong way.
Since I'm very new to database design, I just don't know what question to ask at this point other than, is there a better way or alternatively, is there a fix to my approach outlined above?
As always, I've searched thoroughly and not found any questions that answer mine but, if you see one that does, please point me that way!
I am considering using INSERT ON DUPLICATE KEY UPDATE for my application which routinely has to submit many rows to the database in one transaction. However I am slightly confused regarding one thing. The usage examples online seem to be many in their variations for this functionality.
The behavior I am looking for is that I want to Insert the row if it does not already exist in the unique index, but if it does exist I simply want to return the ID but update nothing. Am I correct in assuming that this is the intended functionality for this statement.
Also I don't want to go creating dummy fields in my tables to utilize this functionality, as is suggested in many examples. That in my opinion is just bad practice.
Any advice is greatly appreciated. Below is an example from mysql's website that illustrates close to what I want but the c=3 part is not explained on it. I am wondering if this is required to make the last_insert_id actually work or if its just part of their example. I have read that without some dummy operation after the last_insert_id part then the last_insert_id won't work.
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), c=3;
Instead you can just SELECT the unique ID to determine whether it exists. If it does, just return it. Otherwise, do the INSERT and return the new ID.
You cannot do this with a single statement in MySQL. A SELECT statement returns existing values; data modification statements (including INSERT) do not return data. (They usually return a count of some sort.) This includes the INSERT...ON DUPLICATE KEY UPDATE statement—it does not return data.
You can probably do what you want with a stored procedure, but the procedure will contain more than one statement. If that doesn't work for you, then do as #Explosion Pills suggests and use a SELECT followed, if needed, by an INSERT.
I have a web crawler. The web crawler gathers the links from web pages I give it but when it is retrieving the links, some links are duplicated due to the website. is there a way in MYSQL to overwrite data if a new row is the exact same as an old row.
Say if I have http://www.facebook.com in a link field
I also manage to pick up http://www.facebook.com again, I would like the latter to overwrite the old row. therefore I don't have clashes on my search engine.
I'm assuming that you want to update a last_updated date if the url already exists. Else there is no good reason to do an update.
INSERT INTO `scrapping_table`
(`url`)
VALUES
("www.facebook.com")
ON DUPLICATE KEY UPDATE
`date_updated` = `datetime.now()`
look into ON DUPLICATE KEY actions
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
Basically make the columns you're concerned with a unique key write your insert statement and then add
ON DUPLICATE KEY UPDATE col = overwriting value
If your link field is unique than you can use
INSERT INTO "mytable" (link_field, x_column, y_column) VALUES ("www.facebook.com",'something new for x','something new for y')
ON DUPLICATE KEY UPDATE x_column='something new for x', y_column='something new for y'
Just make sure your link field is unique and if you have more unique fields in your column, I suggest use this second method because they suggest avoid using an ON DUPLICATE KEY clause on tables with multiple unique indexes.
set your link field as unique.
before inserting a row try
Select "primary_id" from mytable where link_field="www.facebook.com"
Count the number of returned row from this SQL.
=>If count>0 then UPDATE the row using the "primary_id" we just grabbed through the SELECT SQL
=> if count==0 , just insert your row
beware!!
while operating a web crawler that probably will find millions of links
you want to minimize the query's each "crawl" process fires...
do you want to create a unique link table that will feed the bots? or do you want to prevent duplicate search results?
unique url pool table:
while crawling the page - you should save url's to an array (or list) and making sure (!in_array()) that the its a unique value array, you will find that each page you crawl includes alot of repeated links - so clean them before using sql.
covert the urls to hashes ("simhash" of 32 digits [1,0]).
now open a connection to db and check if exists if it does dump them! don't update (its making a second process). you should match the links using the hashes over an indexed table it will be far more faster.
prevent duplicate results search:
if you indexed the url in the above methodology you should not find duplicate url's, if you have, it means there is a problem in your crawling operation.
even if you have duplicate values in another table and you want to search it but not returning duplicate results you can use DISTINCT in your query.
good luck!
I read this question, which highlights a solution to conditionally insert values into a table if they don't already exist. My question: is it possible to conditionally insert multiple values at once.
For instance, say I have a table that just contains user names (this is a pointless table, but let's keep it simple). The table's contents look like this:
matthew 20
mark 24
luke 25
john 56
buddy 68
A user enters jimmy 34, mark 25 and bobby 54 in a web form and submits, and I'd like to check whether those three values exist in the table already and insert the ones that don't in one statement. Yes, for this example, I'm assuming names are unique.
Here is a paraphrase of the code snippet from the question I linked to, adapted to this example:
INSERT INTO users(name)
SELECT 'jimmy'
FROM dual
WHERE NOT EXISTS (SELECT * FROM users
WHERE user = 'jimmy')
How can I adapt this for multiple values being inserted at once? It's also important that the solution work independently of the number of values entered. In my example, I give three (jimmy, mark and bobby) but there may only be one or there may be 20.
Second question: is this wise? I know that reducing the number of queries is desirable but is it worth it here? Should I just set up a for loop and loop through, alternately checking if a value exists and inserting if it doesn't?
Thanks for any help.
Sorry I don't have code that I've tried myself to show, I'm not even sure what to try here.
Update: added an extra column to the table. I wanted to keep things simple but need two columns to illustrate the fact that deleting a row and then inserting or updating are not what I want as they would favor the user's input over what is already in the table.
See INSERT IGNORE combined with a UNIQUE KEY on the field in question.
When a unique key constraint fails the whole row is silently ignored.
ALTER TABLE users ADD UNIQUE KEY `name` (`name`);
INSERT IGNORE INTO `users` (`name`, `age`)
VALUES ("jimmy", 22), ("bob", 45),
("luke", 300), ("john", 456);
Note: this will also suppress errors when datatypes mismatch and accurate conversion is impossible. (e.g. DECIMAL vs INT) MySQL will continue using the nearest result possible. (e.g. INT) You should ensure only pre-validated data is inserted with such statement.
I am not big on programming(especially on php), but can you parse the string into separate names and then do the loop?
something like (hopefuly the syntax is somewhat correct)
$names = split(" ", $user_names);
for ($i=0; $i<count($names); i++)
{
//sql query check - add new name function
function_sql_query($names[$i]);
}
In addition, you can add rules to your SQLDBMS to disallow dublicates in this attribute and add only unique records; this process will be maintained by DBMS
You can use MySQL's REPLACE
http://dev.mysql.com/doc/refman/5.0/en/replace.html
OR INSERT ... ON DUPLICATE KEY UPDATE
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c;