I have a tab separated text file in the format
id | field 1 | field 2 ...
I want to insert this into a mysql database with id as the primary key but the text file may contain duplicate id's .
How to make sure that there's just one entry corresponding to each id.
How to make a choice between two lines having the same id (Yes, they might not be consistent, but it's okay to choose one over other like the first or the last occurrence )
Read line by line from text file, parse that line and use INSERT ... ON DUPLICATE KEY UPDATE Syntax.
I would do a SELECT before INSERT and count the number of rows returned by the SELECT. Something like this:
SELECT * FROM yourTable WHERE yourTable.id = :id
If that returns any row, don't insert and go to next. Otherwise insert it.
Edit: This would be a post strategy. It would be good if you could add a Unique Constraint to guarantee uniqueness. Something like:
ALTER TABLE yourTable ADD CONSTRAINT ukID UNIQUE (id)
Presuming a Unix shell, I'd do this:
awk '!x[$1]++' inputfile.tsv > uniqfile.tsv
then do your import off of the uniqfile.
edit: to be clear, that script uniq's the input file based on the first field by only outputting rows that do not already have a non-zero value in a hash keyed off of the first field.
Related
If I have a table that has these rows:
animal (primary)
-------
man
dog
cow
and I want to delete all the rows and insert my new rows (that may contain some of the same data), such as:
animal (primary)
-------
dog
chicken
wolf
I could simply do something like:
delete from animal;
and then insert the new rows.
But when I do that, for a split second, 'dog' won't be accessible through the SELECT statement.
I could simply insert ignore the new data and then delete the rest, one by one, but that doesn't feel like the right solution when I have a lot of rows.
Is there a way to insert the new data and then have MySQL automatically delete the rest afterward?
I have a program that selects data from this table every 5 minutes (and the code I'm writing now will be updating this table once every 30 minutes), so I would like to be as accurate as possible at all times, and I would rather have too many rows for a split second than too few rows for the same time.
Note: I know that this may seem like it is unnecessary but I just feel like if I leave too many of those unlikely possibilities in different places, there will be times where things go wrong.
You may want to use TRUNCATE instead of DELETE here. TRUNCATE is faster than DELETE and resets the table back to its empty state (meaning IDENTITY columns are reset to original values as well).
Not sure why you're having problems with selecting a value that was deleted and re-added, maybe I'm missing some context. But if you're wiping the table clean, you might want to use truncate instead.
You could add another column timestamp and change the select statement to accommodate this scenario where it needs to check for the latest value.
If this is for school, I would argue that you need a timestamp and that is what your professor is looking for. You shouldn't need to truncate a table to get the latest values, you need to adjust the thinking behind the table and how you are querying data. Hope this helps!
Check out these:
How to make a mysql table with date and time columns?
Why not update values instead?
My other questions would be:
How are you loading this into the table?
What does that code look like?
Can you change the way you Select from the table?
What values are being "updated" and change in such a way that you need to truncate the entire table?
If you don't want to add new column, there is an other method.
1. At first step, update table in any way that mark all existing rows for deletion in future. For example:
UPDATE `table_name` SET `animal`=CONCAT('MUST_BE_DELETED_', `animal`)
At second step, insert new rows.
On final step, remove all marked rows:
DELETE FROM `table_name` WHERE `animal` LIKE 'MUST_BE_DELETED_%'
You could implement this by having the updated_on column as timestamp and you may even utilize some default values, but let's go with an example without them.
I presume the table would look something like this:
CREATE TABLE `new_table` (
`animal` varchar(255) NOT NULL,
`updated_on` timestamp,
PRIMARY KEY (`animal`)
) ENGINE=InnoDB
This is just a dummy table example. What's important are the two queries later on.
You would simply perform a query to insert the data, such as:
insert into my_table(animal)
select animal from my_view where animal = 'dogs'
on duplicate key update
updated_on = current_timestamp;
Please notice that my_view is your table/view/query by which you supply the values to insert into your table. Also notice that you need to have primary/unique key constraint on your animal column in this example, in order to work.
Then, you proceed with the following query, to "purge" (delete) the old values:
delete from my_table
where updated_on < (
select *
from (
select max(updated_on) from my_table
) as max_date
);
Please notice that you could make a separate view in order to obtain this max_date value for updated_on entry. This entry should indicate the timestamp for your last updated/inserted values in a previous query, so you could proceed with utilizing it in a where clause in order to issue deletion of old records that you don't want/need anymore.
IMPORTANT NOTE:
Since you are doing multiple queries and it's supposed to be a single operation, I'd advise you to utilize it within a single trancations and to utilize a proper rollback on various potential outcomes (i.e. in case of mysql exceptions). You might wish to utilize a proper stored procedure for that.
This seems like it should be simple, but I couldn't figure out a way to do it. Let's say I have a table with 5,000 rows, each with an ID (primary key) of 1–5000. I am blindly inserting a new value with an existing ID, and it could be something like 2677. What I want to happen is that if the ID already exists, it will use the auto_increment value, in this case 5001. That or the maximum existing value + 1.
Most importantly, I can't use PHP (or anything else other than SQL) to do this, because the output is a query that needs to be directly importable without errors.
I have looked at two similar questions on SO:
Can you use aggregate values within ON DUPLICATE KEY
– the problem here is that they're selecting from an existing table which I can't do.
on duplicate key update with a condition? – the problem here is that I have no information on the table I'm importing to (except the basic structure), and don't know what the maximum value is.
INSERT INTO table (column1,column2) VALUES (1,2) ON DUPLICATE KEY UPDATE id=VALUES(id)
Obviously this requires an id column with AUTO_INCREMENT.
Moreover if you later need to select the inserted id just like if it was a new Insert, you do:
ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(VALUES(id));
I'm trying to make a mysql query that checks if a column in a row is contained within another column in the same row. Is there a way to do that kinda of query?
for example:
Key Value runHash
2500 tacos night.2500.293849284
1775 windows day.176555.43035842
I am trying to write a query that will return the second row and not the first because for the first row, Key is in runHash.
I tried to do:
select * from table where key not in runHash
However this doesn't appear to be valid for mysql.
You are looking for like:
where runHash like concat('%', key, '%')
You can put periods in the pattern as well, if those are important for your pattern matching.
I am doing the following SQL tutorial: http://sql.learncodethehardway.org/book/ex11.html
and in this exercise the author says in the second paragraph:
In this situation, I want to replace my record with another guy but
keep the unique id. Problem is I'd have to either do a DELETE/INSERT
in a transaction to make it atomic, or I'd need to do a full UPDATE.
Could anyone explain to me what the problem is with doing an UPDATE, and when we might choose REPLACE instead of UPDATE?
The UPDATE code:
UPDATE person SET first_name = "Frank", last_name = "Smith", age = 100
WHERE id = 0;
Here is the REPLACE code:
REPLACE INTO person (id, first_name, last_name, age)
VALUES (0, 'Frank', 'Smith', 100);
EDIT: I guess another question I have is why would you ever do a DELETE/INSERT instead of just an UPDATE as is discussed in the quoted section?
According to the documentation, the difference is:
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.
So what it does:
Try to match the row using one of the available indexes;
If the row doesn't exist already: add a new one;
If the row exists already: delete the existing row and add a new one afterwards.
When might using this become useful over separate insert and update statements?
You can safely call this, and you don't have to worry about existing rows (one statement vs. two);
If you want related data to be removed when inserting / updating, you can use replace: it deletes all related data too);
When triggers need to fire, and you expect an insert (bad reason, okay).
First Replace isn't widely understood in all database engines.
Second replace inserts/updates a record based on the primary key. While with update you can specify more elaborate conditions:
UPDATE person SET first_name = 'old ' + first_name WHERE age > 50
Also UPDATE won't create records.
UPDATE will have no effect if the row does not exist.
Where as the INSERT or REPLACE will insert if the row doesn't exists or replace the values if it does.
Update will change the existing records value in table based on particular condition. So you can change one or many records in single query.
Insert or Replace will insert a new record if records is not present in table else will replace. Replace will only work if and only if you provide the primary key value in the insert or replace query. If you forget to add primary key field value than a new record will created in table.
Case example:-
Update: You have a calculation of wages to be done based on a formula using the column values. In this case you will always use update query as using one single query you can update multiple records.
Insert or Replace: Already mentioned in the link you shared.
How the REPLACE INTO statement works:
AS INSERT:
REPLACE INTO table_name (column1name, column2name, ...)
VALUES (value1, value2, ...);
AS UPDATE:
REPLACE INTO table_name SET column1name = value, column2name = value, ... ;
The REPLACE statement checks whether the intended data record's unique key value already exists in the table before inserting it as a new record or updating it.
The REPLACE INTO statement attempts to insert a new record or modify an existing record. In both cases, it checks whether the unique key of the proposed record already exists in the table. Suppose a value of NO or FALSE is returne. In that case, the REPLACE statement inserts the record similar to the INSERT INTO statement.
Suppose the key value already exists in the table (in other words, a duplicate key). In that case, the REPLACE statement deletes the existing record of data and replaces it with a new record of data. This happens regardless of whether you use the first or the second REPLACE statement syntax.
Once the REPLACE INTO statement is used to insert or modify data, it determines first whether the new data record already exists in the table. It checks if the PRIMARY or the UNIQUE KEY matches one of the existing records.
If there is no matching key, the REPLACE works like a normal INSERT statement. Otherwise, it deletes the existing record and replaces it with the new one. This is considered a sort of modification or update of an existing record. However, it would be best if you were careful here. Suppose you do not specify a value for a column in the SET clause. In that case, the REPLACE statement uses the default value (if a default value has been set). Otherwise, it's set as NULL.
I am inserting some words into a two-column table with this command:
INSERT IGNORE INTO terms (term) VALUES ('word1'), ('word2'), ('word3');
How can I get the ID (Primary Key) of the row in which each word is inserted. I mean returning a value like "55,56,57" after executing INSERT. Does MySQL have such a response?
The term column is UNIQUE. If a term already exists, MySQL will not insert it. Is it possible to return the reference for this duplication (i.e. the ID of the row in which the term exists)? A response like "55,12,56".
You get it via SELECT LAST_INSERT_ID(); or via having your framework/MySQL library (in whatever language) call mysql_insert_id().
That won't work. There you have to query the IDs after inserting.
Why not just:
SELECT ID
FROM terms
WHERE term IN ('word1', 'word2', 'word3')
First, to get the id just inserted, you can make something like :
SELECT LAST_INSERT_ID() ;
Care, this will work only after your last INSERT query and it will return the first ID only if you have a multiple insert!
Then, with the IGNORE option, I don't think that it is possible to get the lines that were not inserted. When you make an INSERT IGNORE, you just tell MySQL to ignore the lines that would have to create a duplicate entry.
If you don't put this option, the INSERT will be stopped and you will have the line concerned by the duplication.