Delete duplicate rows, do not preserve one row - mysql

I need a query that goes through each entry in a database, checks if a single value is duplicated elsewhere in the database, and if it is - deletes both entries (or all, if more than two).
Problem is the entries are URLs, up to 255 characters, with no way of identifying the row. Some existing answers on Stack Overflow do not work for me due to performance limitations, or they use uniqueid which obviously won't work when dealing with a string.
Long Version:
I have two databases containing URLs (and only URLs). One database has around 3,000 urls and the other around 1,000.
However, a large majority of the 1,000 urls were taken from the 3,000 url database. I need to merge the 1,000 into the 3,000 as new entries only.
For this, I made a third database with combined URLs from both tables, about 4,000 entries. I need to find all duplicate entries in this database and delete them (Both of them, without leaving either).
I have followed the query of a few examples on this site, but whenever I try to delete both entries it ends up deleting all the entries, or giving sql errors.
Alternatively:
I have two databases, each containing the separate database. I need to check each row from one database against the other to find any that aren't duplicates, and then add those to a third database.

Since you were looking for a SQL solution here is one. Lets assume that your table has a single column for simplicity sake. However this will work for any number of fields of course:
CREATE TABLE `allkindsofvalues` (
`value` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The following series of queries will accomplish what you are looking for:
CREATE TABLE allkindsofvalues_temp LIKE allkindsofvalues;
INSERT INTO allkindsofvalues_temp SELECT * FROM allkindsofvalues akv1 WHERE (SELECT COUNT(*) FROM allkindsofvalues akv2 WHERE akv1.value = akv2.value) = 1;
DROP TABLE allkindsofvalues;
RENAME TABLE allkindsofvalues_temp to allkindsofvalues;

The OP wrote:
I've got my own PHP solution which is pretty hacky, but works.
I went with a PHP script to accomplish this, as I'm more familiar with PHP than MySQL.
This generates a simple list of urls that only exist in the target
database, but not both. If you have more than 7,000 entries to parse
this may take awhile, and you will need to copy/paste the results
into a text file or expand the script to store them back into a
database.
I'm just doing it manually to save time.
Note: Uses MeekroDB
<pre>
<?php
require('meekrodb.2.1.class.php');
DB::$user = 'root';
DB::$password = '';
DB::$dbName = 'testdb';
$all = DB::query('SELECT * FROM old_urls LIMIT 7000');
foreach($all as $row) {
$test = DB::query('SELECT url FROM new_urls WHERE url=%s',
$row['url']);
if (!is_array($test)) {
echo $row['url'] . "\n";
}else{
if (count($test) == 0) {
echo $row['url'] . "\n";
}
}
}
?>
</pre>

Related

is there a better way to use mysql_insert_id()

I have the following sql statement written in PHP.
$sql='INSERT INTO pictures (`picture`) VALUES ("'.$imgs[$i]['name'].'",)';
$db->query($sql);
$imgs[$i]['sqlID'] = $this->id=mysql_insert_id();
$imgs[$i]['newImgName'] = $imgs[$i]['sqlID'].'_'.$imgs[$i]['name'];
$sql='UPDATE pictures SET picture="'.$imgs[$i]['newImgName'].'" WHERE id='.$imgs[$i]['sqlID'];
$db->query($sql);
Now this writes the image name to database table pictures. After that is done I get the mysql_insert_id() and than I'll update the picture name with the last id in front of the name with underscore.
I'll do this to make sure no picture name can be the same. Cause all those pictures get saved in the same folder. Is there another way to save that ID already the first time I set the sql query? Or are there other better ways to achieve this result?
Thanks for all advices
Using the native auto_increment - there is no other way. You need to do the 3 steps you described.
As Dan Bracuk mentioned, you can create a stored proc to do the 3 queries (you can still get the insert id after executing it).
Other possible options are:
not storing the id in the filename - you can concatenate it later if you want (when selecting)
using an ad-hoc auto increment instead of the native one - I would not recommend it in this case, but it's possible
using some sort of UUID instead of auto increment
generating unique file names using the file system (Marcell Fülöp's answer)
I don't think in this particular case it's reasonable to use MySQL insert Id in the file name. It might be required in some cases but you provided no information why it would be in this one.
Consider something like:
while( file_exists( $folder . $filename ) {
$filename = rand(1,999) . '_' . $filename;
}
$imgs[$i]['newImgName'] = $filename;
Of course you can use a larger range for rand or a loop counter if you wanted tot systematically increment the number used to prefix the original file name.

Unique fields on mySQL table - generating promo codes

I am developing a PHP script and I have a table like this:
TABLE_CODE
code varchar 8
name varchar 30
this code column has to be a code using random letters from A to Z and characters from 0 to 9 and has to be unique. all uppercase. Something like
A4RTX33Z
I have create a method to generate this code using PHP. But this is a intensive task because I have to query the database to see if the generated code is unique before proceeding and the table may have a lot of records.
Because I know mySQL is a bag of tricks but not having advanced knowledge about it now, I wonder if there's some mechanism that could be built in a table to run a script (or something) every time a new record in created on that table to fill the code column with a unique value.
thanks
edit: What I wonder is if there's a way to created the code on-the-fly, as the record is being added to the table and that code being unique.
Better generate these codes in SQL. This is 8-character random "Promo code generator":
INSERT IGNORE INTO
TABLE_CODE(name, code)
VALUES(
UPPER(SUBSTRING(MD5(RAND()) FROM 1 FOR 8)), -- random 8 characters fixed length
'your code name'
)
Add UNIQUE on code field as #JW suggested, and some error-handling in PHP, because sometimes generated value may be not UNIQUE, and MySQL will raise error in that situation.
Adding a UNIQUE constraint on the code column is the first thing you would need to do. Then, to insert the code I would write a small loop like this:
// INSERT IGNORE will not generate an error if the code already exists
// rather, the affected rows will be 0.
$stmt = $db->prepare('INSERT IGNORE INTO table_code (code, name) VALUES (?, ?)');
$name = 'whatever name';
do {
$code = func_to_generate_code();
$stmt->execute(array($code, $name));
} while (!$stmt->rowCount()); // repeat until at least one row affected
As the table grows the number of loops may increase, so if you feel it should only try three times, you could add it as a loop condition and throw an error if that happens.
Btw, I would suggest using transactions to make sure if an error occurs after the code generation, rolling back will make sure the code is removed (can be reused).

MySQL - Fastest way to check if data in InnoDB table has changed

My application is very database intensive. Currently, I'm running MySQL 5.5.19 and using MyISAM, but I'm in the process of migrating to InnoDB. The only problem left is checksum performance.
My application does about 500-1000 "CHECKSUM TABLE" statements per second in peak times, because the clients GUI is polling the database constantly for changes (it is a monitoring system, so must be very responsive and fast).
With MyISAM, there are Live checksums that are precalculated on table modification and are VERY fast. However, there is no such thing in InnoDB. So, CHECKSUM TABLE is very slow...
I hoped to be able to check the last update time of the table, Unfortunately, this is not available in InnoDB either. I'm stuck now, because tests have shownn that the performance of the application drops drastically...
There are simply too much lines of code that update the tables, so implementing logic in the application to log table changes is out of the question...
The Database ecosystem consists of one master na 3 slaves, so local file checks is not an option.
I thought of a method to mimic a checksum cache - a lookup table with two columns - table_name, checksum, and update that table with triggers when changes in a table occurs, but i have around 100 tables to monitor and this means 3 triggers per table = 300 triggers. Hard to maintain, and i'm not sure that this wont be a performance hog again.
So is there any FAST method to detect changes in InnoDB tables?
Thanks!
The simplest way is to add a nullable column with type TIMESTAMP, with the trigger: ON UPDATE CURRENT_TIMESTAMP.
Therefore, the inserts will not change because the column accepts nulls, and you can select only new and changed columns by saying:
SELECT * FROM `table` WHERE `mdate` > '2011-12-21 12:31:22'
Every time you update a row this column will change automatically.
Here are some more informations: http://dev.mysql.com/doc/refman/5.0/en/timestamp.html
To see deleted rows simply create a trigger which is going to log every deletion to another table:
DELIMITER $$
CREATE TRIGGER MyTable_Trigger
AFTER DELETE ON MyTable
FOR EACH ROW
BEGIN
INSERT INTO MyTable_Deleted VALUES(OLD.id, NOW());
END$$
I think I've found the solution. For some time I was looking at Percona Server to replace my MySQL servers, and now i think there is a good reason for this.
Percona server introduces many new INFORMATION_SCHEMA tables like INNODB_TABLE_STATS, which isn't available in standard MySQL server.
When you do:
SELECT rows, modified FROM information_schema.innodb_table_stats WHERE table_schema='db' AND table_name='table'
You get actual row count and a counter. The Official documentation says the following about this field:
If the value of modified column exceeds “rows / 16” or 2000000000, the
statistics recalculation is done when innodb_stats_auto_update == 1.
We can estimate the oldness of the statistics by this value.
So this counter wraps every once in a while, but you can make a checksum of the number of rows and the counter, and then with every modification of the table you get a unique checksum. E.g.:
SELECT MD5(CONCAT(rows,'_',modified)) AS checksum FROM information_schema.innodb_table_stats WHERE table_schema='db' AND table_name='table';
I was going do upgrade my servers to Percona server anyway so this bounding is not an issue for me. Managing hundreds of triggers and adding fields to tables is a major pain for this application, because it's very late in development.
This is the PHP function I've come up with to make sure that tables can be checksummed whatever engine and server is used:
function checksum_table($input_tables){
if(!$input_tables) return false; // Sanity check
$tables = (is_array($input_tables)) ? $input_tables : array($input_tables); // Make $tables always an array
$where = "";
$checksum = "";
$found_tables = array();
$tables_indexed = array();
foreach($tables as $table_name){
$tables_indexed[$table_name] = true; // Indexed array for faster searching
if(strstr($table_name,".")){ // If we are passing db.table_name
$table_name_split = explode(".",$table_name);
$where .= "(table_schema='".$table_name_split[0]."' AND table_name='".$table_name_split[1]."') OR ";
}else{
$where .= "(table_schema=DATABASE() AND table_name='".$table_name."') OR ";
}
}
if($where != ""){ // Sanity check
$where = substr($where,0,-4); // Remove the last "OR"
$get_chksum = mysql_query("SELECT table_schema, table_name, rows, modified FROM information_schema.innodb_table_stats WHERE ".$where);
while($row = mysql_fetch_assoc($get_chksum)){
if($tables_indexed[$row[table_name]]){ // Not entirely foolproof, but saves some queries like "SELECT DATABASE()" to find out the current database
$found_tables[$row[table_name]] = true;
}elseif($tables_indexed[$row[table_schema].".".$row[table_name]]){
$found_tables[$row[table_schema].".".$row[table_name]] = true;
}
$checksum .= "_".$row[rows]."_".$row[modified]."_";
}
}
foreach($tables as $table_name){
if(!$found_tables[$table_name]){ // Table is not found in information_schema.innodb_table_stats (Probably not InnoDB table or not using Percona Server)
$get_chksum = mysql_query("CHECKSUM TABLE ".$table_name); // Checksuming the old-fashioned way
$chksum = mysql_fetch_assoc($get_chksum);
$checksum .= "_".$chksum[Checksum]."_";
}
}
$checksum = sprintf("%s",crc32($checksum)); // Using crc32 because it's faster than md5(). Must be returned as string to prevent PHPs signed integer problems.
return $checksum;
}
You can use it like this:
// checksum a signle table in the current db
$checksum = checksum_table("test_table");
// checksum a signle table in db other than the current
$checksum = checksum_table("other_db.test_table");
// checksum multiple tables at once. It's faster when using Percona server, because all tables are checksummed via one select.
$checksum = checksum_table(array("test_table, "other_db.test_table"));
I hope this saves some trouble to other people having the same problem.

What's the fastest way to check if a URL already exists in a MySQL table? [duplicate]

This question already has an answer here:
If url already exists in url table in mysql. Break operation in php script
(1 answer)
Closed 9 months ago.
I have a varchar(255) column where I store URL's in a MySQL database. This column has a unique index.
When my crawler encounters a URL, it has to check the database to see if that URL already exists. If it exists, the crawler selects data about that entry. If it does not exist, the crawler adds the url. I currently do this with the following code:
$sql = "SELECT id, junk
FROM files
WHERE url = '$url'";
$results = $this->mysqli->query( $sql );
// the file already exists in the system
if( $results->num_rows > 0 )
{
// store data to variables
}
// the file does not exists yet... add it
else
{
// insert new file
$sql = "INSERT INTO files( url )
VALUES( '$url' )";
$results = $this->mysqli->query( $sql );
}
I realize there are lots of ways to do this. I've read that using a MySQL if/else statement could speed this up. Can someone explain how MySQL would handle that differently, and why that may be faster? Are there other alternatives I should test? My crawlers are doing a lot of checking like this, and speeding up this process could be a significant speed boost for my system.
First of all, URLs are going to get much longer than varchar(256).
Second of all, because they're that long you don't want to do string compares, it gets very slow as the table grows. Instead, create a column with a hash value and compare that.
You should index the hash column, of course.
As for the actual insert, an alternative is to put a unique constraint on the hash. Then do your inserts blindly, allowing SQL to reject the dupes. (But you'll have to put an exception handler into your code, which has its own overhead.)
Considering not using transactions, to insert a new row if an old row does not exist by the WHERE condition, you can use:
"INSERT INTO files( url ) VALUES ( $url ) WHERE NOT EXISTS ( SELECT * FROM files WHERE url = $url );"
I can't think of a "one-line-commond" to select and insert at the same time.
I would do the insert first and check for success(affected_rows), then select. If you check first, and then do the insert, the possibility exists that the url got inserted during that small time window. And, you would need to add more code to handle this situation.

processing data with perl - selecting for update usage with mysql

I have a table that is storing data that needs to be processed. I have id, status, data in the table. I'm currently going through and selecting id, data where status = #. I'm then doing an update immediately after the select, changing the status # so that it won't be selected again.
my program is multithreaded and sometimes I get threads that grab the same id as they are both querying the table at a relatively close time to each other, causing the grab of the same id. i looked into select for update, however, i either did the query wrong, or i'm not understanding what it is used for.
my goal is to find a way of grabbing the id, data that i need and setting the status so that no other thread tries to grab and process the same data. here is the code i tried. (i wrote it all together for show purpose here. i have my prepares set at the beginning of the program as to not do a prepare for each time it's ran, just in case anyone was concerned there)
my $select = $db->prepare("SELECT id, data FROM `TestTable` WHERE _status=4 LIMIT ? FOR UPDATE") or die $DBI::errstr;
if ($select->execute($limit))
{
while ($data = $select->fetchrow_hashref())
{
my $update_status = $db->prepare( "UPDATE `TestTable` SET _status = ?, data = ? WHERE _id=?");
$update_status->execute(10, "", $data->{_id});
push(#array_hash, $data);
}
}
when i run this, if doing multiple threads, i'll get many duplicate inserts, when trying to do an insert after i process my transaction data.
i'm not terribly familiar with mysql and the research i've done, i haven't found anything that really cleared this up for me.
thanks
As a sanity check, are you using InnoDB? MyISAM has zero transactional support, aside from faking it with full table locking.
I don't see where you're starting a transaction. MySQL's autocommit option is on by default, so starting a transaction and later committing would be necessary unless you turned off autocommit.
It looks like you simply rely on the database locking mechanisms. I googled perl dbi locking and found this:
$dbh->do("LOCK TABLES foo WRITE, bar READ");
$sth->prepare("SELECT x,y,z FROM bar");
$sth2->prepare("INSERT INTO foo SET a = ?");
while (#ary = $sth->fetchrow_array()) {
$sth2->$execute($ary[0]);
}
$sth2->finish();
$sth->finish();
$dbh->do("UNLOCK TABLES");
Not really saying GIYF as I am also fairly novice at both MySQL and DBI, but perhaps you can find other answers that way.
Another option might be as follows, and this only works if you control all the code accessing the data. You can create lock column in the table. When your code accesses the table it (pseudocode):
if row.lock != 1
row.lock = 1
read row
update row
row.lock = 0
next
else
sleep 1
redo
again though, this trusts that all users/script that access this data will agree to follow this policy. If you cannot ensure that then this won't work.
Anyway thats all the knowledge I have on the topic. Good Luck!