MYSQL: Find and delete similar records - Updated with example - mysql

I'm trying to dedup a table, where I know there are 'close' (but not exact) rows that need to be removed.
I have a single table, with 22 fields, and uniqueness can be established through comparing 5 of those fields. Of the remaining 17 fields, (including the unique key), there are 3 fields that cause each row to be unique, meaning the dedup proper method will not work.
I was looking at the multi table delete method outlined here: http://blog.krisgielen.be/archives/111 but I can't make sense of the final line of code (AND M1.cd*100+M1.track > M2.cd*100+M2.track) as I am unsure what the cd*100 part achieves...
Can anyone assist me with this? I suspect I could do better exporting the whole thing to python, doing something with it, then re-importing it, but then (1)I'm stuck with knowing how to dedup the string anyway! and (2) I had to break the record into chunks to be able to import it into mysql as it was timing out after 300 seconds so it turned into a whole debarkle to get into mysql in the first place.... (I am very novice at both mysql and python)
The table is a dump of some 40 log files from some testing. The test set for each log is some 20,000 files. The repeating values are either the test conditions, the file name/parameters or the results of the tests.
CREATE SHOW TABLE:
CREATE TABLE `t1` (
`DROID_V` int(1) DEFAULT NULL,
`Sig_V` varchar(7) DEFAULT NULL,
`SPEED` varchar(4) DEFAULT NULL,
`ID` varchar(7) DEFAULT NULL,
`PARENT_ID` varchar(10) DEFAULT NULL,
`URI` varchar(10) DEFAULT NULL,
`FILE_PATH` varchar(68) DEFAULT NULL,
`NAME` varchar(17) DEFAULT NULL,
`METHOD` varchar(10) DEFAULT NULL,
`STATUS` varchar(14) DEFAULT NULL,
`SIZE` int(10) DEFAULT NULL,
`TYPE` varchar(10) DEFAULT NULL,
`EXT` varchar(4) DEFAULT NULL,
`LAST_MODIFIED` varchar(10) DEFAULT NULL,
`EXTENSION_MISMATCH` varchar(32) DEFAULT NULL,
`MD5_HASH` varchar(10) DEFAULT NULL,
`FORMAT_COUNT` varchar(10) DEFAULT NULL,
`PUID` varchar(15) DEFAULT NULL,
`MIME_TYPE` varchar(24) DEFAULT NULL,
`FORMAT_NAME` varchar(10) DEFAULT NULL,
`FORMAT_VERSION` varchar(10) DEFAULT NULL,
`INDEX` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`INDEX`)
) ENGINE=MyISAM AUTO_INCREMENT=960831 DEFAULT CHARSET=utf8
The only unique field is the PriKey, 'index'.
Unique records can be established by looking at DROID_V,Sig_V,SPEED.NAME and PUID
Of the ¬900,000 rows, I have about 10,000 dups that are either a single duplicate of a record, or have upto 6 repetitions of the record.
Row examples: As Is
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
5;"v37";"slow";"12766";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"193977"
5;"v37";"slow";"12768";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"193978"
5;"v37";"slow";"12769";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"193979"
5;"v37";"slow";"12770";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"193980"
Row Example: As It should be
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
Please note, you can see from the index column at the end that I have cut out some other rows - I have only idenitified a very small set of repeating rows. Please let me know if you need any more 'noise' from the rest of the DB
Thanks.

I figured out a fix - using the count function, I was using a COUNT(*) that just returned everything in the table, by using a COUNT (distinct NAME) function I am able to weed out the dup rows that fit the dup critera (as set out by the field selection in a WHERE clause)
Example:
SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct NAME) as Hit FROM sourcelist, main_small WHERE sourcelist.SourcePUID = 'MyVariableHere' AND main_small.NAME = sourcelist.SourceFileName
GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V` ASC, `SPEED`;

Related

MySQL query issue with combining two tables

I have two tables:
`search_chat` (
`pubchatid` varchar(255) NOT NULL,
`profile` varchar(255) DEFAULT NULL,
`prefs` varchar(255) DEFAULT NULL,
`init` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`session` varchar(255) DEFAULT NULL,
`device` varchar(255) DEFAULT NULL,
`uid` int(10) DEFAULT NULL,
PRIMARY KEY (`pubchatid`)
and
`chats` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`chatlog` varchar(255) DEFAULT NULL,
`block` varchar(2) DEFAULT '',
`whenadded` datetime DEFAULT NULL,
`pubchatid1` varchar(255) DEFAULT NULL,
`pubchatid2` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
So basically people chat with each other through a search system based on prefrences. The further they are apart, the worse it is. So the query I have is simple:
SELECT *
FROM search_chat
WHERE levenshtein(profile, "[user_prefs]") < 20
AND pubchatid <> "[user_pubchatid]"
ORDER BY
levenshtein(profile, "[user_prefs]")
LIMIT 1
It is a shitty query in itself, but it does the job (everything between "[]" is a variable I put in, just to make it clear).
As you can see this query only makes a selection between two peoples preferences (prefs) and how they are (profile). So far so good.
I have been bugging around some to make this query also check if they have had previous chats. That is where "chats" comes in. I can not get the query to check for a proper user and see if they have an open chat.
In chats, the "search_chat.pubchatid" can be either "chats.pubchatid1" or "chats.pubchatid2"
So somehow I have got to make these two work, making chats rule out options in search_chat.
Do you want something like this:
-- ... ( start of query as per your question )
and not exists (
select *
from chats
where ( ( chats.pubchatid1 = search_chat.pubchatid )
or ( chats.pubchatid2 = search_chat.pubchatid ) )
and -- ... add any restriction on how recent the chat was
)

Can't fetch a geolocation result from imported MySQL database

I've downloaded Software77's free geoip country DB. I've imported it using Sequel Pro - the resulting table looks to have the same amount of lines as in the .csv.
The following queries however, return 0 results:
SELECT * FROM s77_country WHERE ip_num_start = "3000762368";
SELECT country_code FROM s77_country WHERE 3000831958 BETWEEN ip_num_start AND ip_num_end
In the .csv, line 49046:
"3000762368","3001024511","ripencc","1269993600","RS","SRB","Serbia"
That looks to be in the range, so the result should have been "RS".
Here's the table setup:
CREATE TABLE `s77_country` (
`ip_num_start` int(11) DEFAULT NULL,
`ip_num_end` int(11) DEFAULT NULL,
`registry` varchar(255) DEFAULT NULL,
`assigned` bigint(11) DEFAULT NULL,
`country_code` varchar(255) DEFAULT NULL,
`country_code_long` varchar(255) DEFAULT NULL,
`country_name` varchar(255) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
What am I missing here?
you need to use BIGINT or int unsigned for your IP_NUM. IP num value goes above 2b, INT is limited there
your maximum value for IP_NUM is theoretically 255*256^3+255*256^2+255*256+255 which is 4294967295, double of what signed int can store
http://dev.mysql.com/doc/refman/5.0/en/integer-types.html
edit: int unsigned fits exactly that value
Actually you have missed one phase of error.
The value 3000831958 doesn't fit into the range of int. Use bigint(11) or int(11) unsigned instead both for ip_num_start and ip_num_end.
The record you have specified will not even be inserted into the table because of the range violation.
Your select query is fine.

MySQL Merged table duplicates

Here is what I currently have:
Archive tables (one for each year, 2008-2011) and 4 newly created tables for 2012 broken into quarters. All of these tables, including the new one, have the same structure and keys. The naming convention for these is ARCHIVE_PLAYS. I then have a "live" table (Called PLAYS) for current data. I have a merged table that combines all tables so that I can run reports. The issue I have, which I didn't have before, is that this merged table is showing duplicates. They have the same Primary keys so this shouldn't be the case, right? It must have something to do with the new tables I just created as I didn't have this issue before.
Structure:
**COMPANY**
COMPANY.MERGED_PLAYS
COMPANY.ARCHIVE_PLAYS_2008
COMPANY.ARCHIVE_PLAYS_2009
COMPANY.ARCHIVE_PLAYS_2010
COMPANY.ARCHIVE_PLAYS_2011
COMPANY.ARCHIVE_PLAYS_2012Q1
COMPANY.ARCHIVE_PLAYS_2012Q2
COMPANY.ARCHIVE_PLAYS_2012Q3
COMPANY.ARCHIVE_PLAYS_2012Q4
**COMPANY2**
COMPANY2.PLAYS
Each table, with the exception of the Merged_Plays, has the following Create:
CREATE TABLE `ARCHIVE_PLAYS_2011` (
`ENTRY_ID` BIGINT(20) NOT NULL,
`NODE_ID` VARCHAR(48) NOT NULL,
`HW_ID` VARBINARY(64) NOT NULL,
`LOG_DAY` DATE NOT NULL,
`ROW_NUMBER` INT(11) NOT NULL,
`NODE_NAME` VARCHAR(128) NOT NULL,
`FILE_NAME` VARCHAR(1024) NOT NULL,
`PRESENTATION_NAME` VARCHAR(1024) NULL DEFAULT NULL,
`SMIL_SEQUENCE_ID` VARCHAR(256) NULL DEFAULT NULL,
`SMIL_CONTENT_ID` VARCHAR(256) NULL DEFAULT NULL,
`PLAY_TIME_MS` BIGINT(20) NOT NULL,
`PLAY_TIME` TIME NOT NULL,
`STATUS_CODE` VARCHAR(48) NULL DEFAULT NULL,
`NUM_SCREENS_CONNECTED_AND_ON` INT(11) NULL DEFAULT NULL,
`NUM_SPEAKERS_CONNECTED_AND_ON` INT(11) NULL DEFAULT NULL,
`SCREEN_LAYOUT_MATCHES` CHAR(1) NULL DEFAULT NULL,
`ENTRY_PROCESSED` CHAR(1) NULL DEFAULT NULL,
`FILE_PATH` VARCHAR(1024) NULL DEFAULT NULL,
PRIMARY KEY (`NODE_ID`, `LOG_DAY`, `ROW_NUMBER`),
INDEX `PLAYLOG_ENTRY_ID` (`ENTRY_ID`),
INDEX `PLAYLOG_LOG_DAY` (`LOG_DAY`),
INDEX `PLAYLOG_LOG_DAY_PLAY_TIME` (`LOG_DAY`, `PLAY_TIME`),
INDEX `PLAYLOG_FILE_NAME` (`FILE_NAME`(600)),
INDEX `PLAYLOG_NODE_NAME` (`NODE_NAME`),
INDEX `PLAYLOG_FILE_NAME_NODE_NAME` (`FILE_NAME`(600), `NODE_NAME`),
INDEX `PLAYLOG_ENTRY_ID_PROCESSED` (`ENTRY_ID`, `ENTRY_PROCESSED`)
)
COLLATE='latin1_swedish_ci'
ENGINE=MyISAM;
A primary key only assures unique data within a single table. You must have duplicate records across multiple tables. Make sure you have deleted all of the 2012 data from the live table. Make sure there are no dups between any of the quarter tables.
Also if the records are 100% dups, if you do a UNION between all of your tables (instead of UNION ALL) you will get unique results, however this will decrease query performance.

Avoid UNION for two almost identical tables in MySQL

I'm not very good at MySQL and i'm going to write a query to count messages sent by an user, based on its type and is_auto field.
Messages can be of type "small text message" or "newsletter". I created two entities with a few fields that differs between them. The important one is messages_count that is absent in table newsletter and it's used in the query:
CREATE TABLE IF NOT EXISTS `small_text_message` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`messages_count` int(11) NOT NULL,
`username` varchar(255) NOT NULL,
`method` varchar(255) NOT NULL,
`content` longtext,
`sent_at` datetime DEFAULT NULL,
`status` varchar(255) NOT NULL,
`recipients_count` int(11) NOT NULL,
`customers_count` int(11) NOT NULL,
`sheduled_at` datetime DEFAULT NULL,
`sheduled_for` datetime DEFAULT NULL,
`is_auto` tinyint(1) NOT NULL,
`user_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
And:
CREATE TABLE `newsletter` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`subject` varchar(78) DEFAULT NULL,
`content` longtext,
`sent_at` datetime DEFAULT NULL,
`status` varchar(255) NOT NULL,
`recipients_count` int(11) NOT NULL,
`customers_count` int(11) NOT NULL,
`sheduled_at` datetime DEFAULT NULL,
`sheduled_for` datetime DEFAULT NULL,
`is_auto` tinyint(1) NOT NULL,
`user_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
I ended up with a UNION query. Can this query be shortened or optimized since the only difference is messages_count that should be always 1 for newsletter?
SELECT
CONCAT('sms_', IF(is_auto = 0, 'user' , 'auto')) AS subtype,
SUM(messages_count * (customers_count + recipients_count)) AS count
FROM small_text_message WHERE status <> 'pending' AND user_id = 1
GROUP BY is_auto
UNION
SELECT
CONCAT('newsletter_', IF(is_auto = 0, 'user' , 'auto')) AS subtype,
SUM(customers_count + recipients_count) AS count
FROM newsletter WHERE status <> 'pending' AND user_id = 1
GROUP BY is_auto
I don't see any easy way to avoid a UNION (or UNION ALL) operation, that will return the specified result set.
I would recommend you use a UNION ALL operator in place of the UNION operator. Then the execution plan will not include the step that eliminates duplicate rows. (You already have GROUP BY operations on each query, and there is no way that those two queries can produce an identical row.)
Otherwise, your query looks fine just as it is written.
(It's always a good thing to consider the question, might there be a better way? To get the result set you are asking for, from the schema you have, your query looks about as good as it's going to get.)
If you are looking for more general DB advice, I recommend restructuring the tables to factor the common elements into one table, perhaps called outbound_communication or something, with all of your common fields, then perhaps have "sub tables" for the specific types to host the fields which are unique to that type. It does mean a simple JOIN is necessary to select all of the fields you want, but the again, it's normalized and actually makes situations like this one easier (one table holds all of the entities of interest). Additionally, you have the option of writing that JOIN just once as a "view", and then your existing code would not even need to change to see the two tables as if they never changed.
CREATE TABLE IF NOT EXISTS `outbound_communicaton` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`content` longtext,
`sent_at` datetime DEFAULT NULL,
`status` varchar(255) NOT NULL,
`recipients_count` int(11) NOT NULL,
`customers_count` int(11) NOT NULL,
`sheduled_at` datetime DEFAULT NULL,
`sheduled_for` datetime DEFAULT NULL,
`is_auto` tinyint(1) NOT NULL,
`user_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
CREATE TABLE `small_text_message` (
`oubound_communication_id` int(11) NOT NULL,
`messages_count` int(11) NOT NULL,
`username` varchar(255) NOT NULL,
`method` varchar(255) NOT NULL,
PRIMARY KEY (`outbound_communication_id`),
FOREIGN KEY (outbound_communication_id)
REFERENCES outbound_communicaton(id)
) ENGINE=InnoDB;
CREATE TABLE `newsletter` (
`oubound_communication_id` int(11) NOT NULL,
`subject` varchar(78) DEFAULT NULL,
PRIMARY KEY (`outbound_communication_id`),
FOREIGN KEY (outbound_communication_id)
REFERENCES outbound_communicaton(id)
) ENGINE=InnoDB;
Then selecting a text msg is like this:
SELECT *
FROM outbound_communication AS parent
JOIN small_text_message
ON parent.id = small_text_message.outbound_communication_id
WHERE parent.id = 1234;
The nature of the query is inherently the union of the data from the small text message and the newsletter tables, so the UNION query is the only realistic formulation. There's no join of relevance between the two tables, for example.
So, I think you're very much on the right lines with your query.
Why are you worried about a UNION?

MySQL indexes creation strategy and inner logic

This question expects a generic answer to the wide problematic of indexes creation on MySQL database.
Let's take this table example :
CREATE TABLE IF NOT EXISTS `article` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) NOT NULL DEFAULT '0',
`author_id` int(11) unsigned NOT NULL,
`modificator_id` int(11) unsigned DEFAULT NULL,
`category_id` int(11) unsigned DEFAULT NULL,
`title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`headline` text COLLATE utf8_unicode_ci NOT NULL,
`content` text COLLATE utf8_unicode_ci NOT NULL,
`url_alias` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`priority` mediumint(11) unsigned NOT NULL DEFAULT '50',
`publication_date` datetime NOT NULL,
`creation_date` datetime NOT NULL,
`modification_date` datetime NOT NULL,
PRIMARY KEY (`id`)
);
Over such a sample there is a wide range of queries that could be performed on different criterions :
category_id
published
publication_date
e.g.:
SELECT id FROM article WHERE NOT published AND category_id = '2' ORDER BY publication_date;
On many tables you can see a wide range of state fields (like published here), date fields or reference fields (like author_id or category_id). What strategy should be picked to make indexes ?
Which can be developed under the following points:
Make an index on every fields that can be used in query (either as where argument or order by) even if this can lead to have a lot of indexes per table ?
Also make an index on fields that have only a small set of values like boolean or enum, this just does reduce the scope size of the scan by a n factor (assuming n being the number of inputs and every value homogeneously used) ?
I've read that MySQL prior to 5.0 used only one index per request how do the system picks it ? (by choosing the more restrictive one ?)
How does a OR statement is processed ?
How much does this is going to slow insert ?
Does InnoDB/MyISAM change anything to this problem ?
I know the EXPLAIN statement could be used to know whether a request is optimized or not, but a bit of concrete theoretical stuff would really be more constructive than a purely empirical approach !