I have a web crawler. The web crawler gathers the links from web pages I give it but when it is retrieving the links, some links are duplicated due to the website. is there a way in MYSQL to overwrite data if a new row is the exact same as an old row.
Say if I have http://www.facebook.com in a link field
I also manage to pick up http://www.facebook.com again, I would like the latter to overwrite the old row. therefore I don't have clashes on my search engine.
I'm assuming that you want to update a last_updated date if the url already exists. Else there is no good reason to do an update.
INSERT INTO `scrapping_table`
(`url`)
VALUES
("www.facebook.com")
ON DUPLICATE KEY UPDATE
`date_updated` = `datetime.now()`
look into ON DUPLICATE KEY actions
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
Basically make the columns you're concerned with a unique key write your insert statement and then add
ON DUPLICATE KEY UPDATE col = overwriting value
If your link field is unique than you can use
INSERT INTO "mytable" (link_field, x_column, y_column) VALUES ("www.facebook.com",'something new for x','something new for y')
ON DUPLICATE KEY UPDATE x_column='something new for x', y_column='something new for y'
Just make sure your link field is unique and if you have more unique fields in your column, I suggest use this second method because they suggest avoid using an ON DUPLICATE KEY clause on tables with multiple unique indexes.
set your link field as unique.
before inserting a row try
Select "primary_id" from mytable where link_field="www.facebook.com"
Count the number of returned row from this SQL.
=>If count>0 then UPDATE the row using the "primary_id" we just grabbed through the SELECT SQL
=> if count==0 , just insert your row
beware!!
while operating a web crawler that probably will find millions of links
you want to minimize the query's each "crawl" process fires...
do you want to create a unique link table that will feed the bots? or do you want to prevent duplicate search results?
unique url pool table:
while crawling the page - you should save url's to an array (or list) and making sure (!in_array()) that the its a unique value array, you will find that each page you crawl includes alot of repeated links - so clean them before using sql.
covert the urls to hashes ("simhash" of 32 digits [1,0]).
now open a connection to db and check if exists if it does dump them! don't update (its making a second process). you should match the links using the hashes over an indexed table it will be far more faster.
prevent duplicate results search:
if you indexed the url in the above methodology you should not find duplicate url's, if you have, it means there is a problem in your crawling operation.
even if you have duplicate values in another table and you want to search it but not returning duplicate results you can use DISTINCT in your query.
good luck!
Related
I'm trying to filter rows from the MySQL table where all the $_POST data is stored from an online form. Sometimes the user's internet connection stalls or the browser screws up, and the new page after form submission is not displayed (though the INSERT worked and the table row was created). They then hit refresh, and submit their form twice, creating a duplicate row (except for the timestamp and autoincrement id columns).
I'd like to select unique form submissions. This has to be a really common task, but I can't seem to find something that lets me call with DISTINCT applying to every column except the timestamp and id in a succinct way (sort of like SELECT id, timestamp, DISTINCT everything_else FROM table;. At the moment, I can do:
CREATE TEMPORARY TABLE IF NOT EXISTS temp1 AS (
SELECT DISTINCT everything,except,id,and,timestamp
FROM table1
);
SELECT * FROM table1 LEFT OUTER JOIN temp1
ON table1.everything = temp1.everything
...
;
My table has 20k rows with about 25 columns (classification features for a machine learning exercise). This query takes forever (as I presume it traverses the 20k rows 20K times?) I've never even let it run to completion. What's the standard practice way to do this?
Note: This question suggests add an index to the relevant columns, but there can be max 16 key parts to an index. Should I just choose the most likely unique ones? I can find about 700 duplicates in 2 seconds this way, but I can't be sure of not throwing away a unique row because I also have to ignore some columns when specifying the index.
If you have a UNIQUE key (other than an AUTO_INCREMENT), simply use INSERT IGNORE ... to silently avoid duplicate rows. If you don't have a UNIQUE key, do you never need to find a row again?
If you have already allowed duplicates and you need to get rid of them, that is a different question.
I would try to eliminate the problem in the first place. There are techniques to eliminate this issue. The first one on my mind is that you could generate a random string and store it in both the session and as a hidden field in the form. This random string should be generated each time the form is displayed. When the user submits the form you need to check that the session key and the input key matches. Make sure to generate a different key on each request. Thus when a user refreshes the page he will submit an old key and it will not match.
Another solution could be that if this data should always be unique in the database check if there is that exact data in the database first before inserting. And if the data is unique by lets say the email address you can create a unique key index. Therefore that field will have to be unique in the table.
This has been discussed before, however I cannot understand the answers I have found.
Essentially I have a table with three columns memo, user and keyid (the last one is primary and AUTO_INC). I insert a pair of values (memo and user). But if I try to insert that same pair again it should not happen.
From what I found out, the methods to do this all depend on a unique key (which I've got, in keyid) but what I don't understand is that you still need to do a second query just to get the keyid of the existing couple (or get nothing, in which case you go ahead with the insertion).
Is there any way to do all of this in a single query? Or am I understanding what I've read (using REPLACE or IGNORE) wrong?
You need to set a UNIQUE KEY on user + memo,
ALTER TABLE mytable
ADD CONSTRAINT unique_user_memo UNIQUE (memo,user)
and then using INSERT IGNORE or REPLACE according to your needs when inserting. Your current unique key is the primary key, that is all well and good, but you need a 2nd one in order to not allow the insertion of duplicate data. If you do not create a new unique key on the two columns together, then you'll need to do a SELECT query before every insert to check if the pair already exists.
I have a table with 55 columns. This table is going to be populated with data from a CSV file. I have created a PHP script which reads in the CSV file and inserts the records.
Whilst scanning through the CSV file I noticed there are some rows that are duplicates. I want to eliminate all duplicate records.
My question is, what would be the best way of doing this? I assume it will be either one of these two options:
Remove / skip duplicate records at source, i.e. duplicate records will not be inserted in the table.
Insert all records from the CSV file, then query the table to find and remove all duplicate records.
For option one, would this be possible to do using MS Excel or even just a text editor?
For option 2, I came across some possible solutions but surely this would result in a rather large query. I am looking for something short and simple. Is this at all possible to do?
A good way is to define a key for the table. A key is a set of fields that make each record unique and all other fields depend on it. (In the worst case the key will consist of all the columns in your table but usually you can define a smaller key). Then you can use the database itself to enforce that key for example using a primary key constraint or an unique index.
I have 2 equal databases (A and B) with one table each running in separate offline machines.
Every day I export their data (as csv) and "merge" it into a 3rd database (C). I first process A, then B (I insert the content from A to C, then the contents from B to C)
Now, it could happen that I get duplicate rows. I consider a duplicate if some field, for example "mail" already exists. I don't care if the rest of the fields are the same.
How can I insert A and B into C excluding those rows that are duplicates?
Thanks in advance!
Easiest solution should be to create a unique index on the columns in question and run the second insert as INSERT IGNORE
Personally I use the ON DUPLICATE KEY UPDATE as using INSERT IGNORE causes any errors to be thrown as warnings.
This may have some side effects and may result in behavior you may not expect. See this post for details on some of the side effects.
If you end up using the ON DUPLICATE KEY UPDATE syntax, it will also provide a means of changing your logic to update specific fields with new data should business requirements change.
For instance, you can tally how many times a duplicate record was inserted by saying ON DUPLICATE KEY UPDATE quantity = quantity+1.
The post referenced above has a ton more information.
hey let me explain my problem. I have a mysql table in which i store data feeds from say 5 different sites. Now i update the feeds once daily. I have this primary key FeedId which auto-increments. Now what i do is when i update feeds from particular site i delete previous data for that site from my table and enter the new one. This way the new data is filled in the rows occupied by previous deleted data and if this time there are more feeds rest are entered at the end of table. But the FeedId is incremented for all the new data.
What i want is that the feeds stored in old locations retain previous Id n only the extra ones being saved at the end get new incremented Ids. Please help as i cant figure out how to do that.
A better solution would be to set a unique key on the feed (aside from the auto-incremented key). Then use INSERT ON DUPLICATE KEY UPDATE
INSERT INTO feeds (name, url, etc, etc2, `update_count`)
VALUES ('name', 'url', 'etc', 'etc2', 1)
ON DUPLICATE KEY UPDATE
`etc` = VALUES(`etc`),
`etc2` = VALUES(`etc2`),
`update_count` = `update_count` + 1;
The benefit is that you're not incrementing the ids, and you're still doing it in one atomic query. Plus, you're only updating / changing what you need to change. (Note that I included the update_count column to show how to update a field)...
Marking the post as delete based on the comments
Try REPLACE INTO to merge the data.
More information #:
http://dev.mysql.com/doc/refman/5.0/en/replace.html