Avoid duplicate rows, without reference to keys or indexes?

Avoid duplicate rows, without reference to keys or indexes? - mysql

I have a MySQL table in which each row is a TV episode. It looks like this:
showTitle | season | episode | episodeTitle | airdate | absoluteEpisode
----------------------------------------------------------------------------------------
The X-Files 5 12 Bad Blood 1998-02-22 109
The X-Files 5 13 Patient X 1998-03-01 110
(Where absoluteEpisode is the episode's overall number counting from episode 1.)
It is populated using a Ruby program I wrote which fetches the data from a web service. Periodically, I'd like to run the program again to fetch new episodes. The question then becomes, how do I avoid adding duplicates of the already-existing rows? None of the columns in this table are suitable for use as a primary key or unique field.
I had two ideas. The first was to create a new column, md5, with an MD5 hash of all of those values, and make that a unique column, to prevent two rows with identical data from being added. That seems like it would work, but be messy.
My second was to use this solution from StackOverflow. But I can't quite get that to work. My SQL query is
INSERT INTO `tv`.`episodes` (showTitle,episodeTitle,season,episode,date,absoluteEpisode)
SELECT '#{show}','#{title}','#{y['airdate']}' FROM `tv`.`episodes`
WHERE NOT EXISTS (SELECT * from `tv`.`episodes`
WHERE showTitle='#{show}' AND episodeTitle='#{title}' AND season='#{season_string}' AND episode='#{y['seasonnum']}' AND date='#{y['airdate']}' AND absoluteEpisode='#{y['epnum']}'")
The #{...} bits are Ruby variables. This gets me the obvious error You have an error in your SQL syntax.
Flipping through the books and documentation I can find on the subject, I'm still not sure how to properly execute this query, or if it's not a smart way of solving my problem. I'd appreciate any advice!

why not create a primary key from the showTitle, season, and episode, this will solve the problem because I guess because the episode number can not be duplicate under the same season, and that's apply for the same TV show,
example
x-files==>season 1==>episode 1 this will be primary key as one unit

Related

Mysql Update one column of multiple rows in one query

I've looked over all of the related questions i've find, but couldn't get one which will answer mine.
i got a table like this:
id | name | age | active | ...... | ... |
where "id" is the primary key, and the ... meaning there are something like 30 columns.
the "active" column is of tinyint type.
My task:
Update ids 1,4,12,55,111 (those are just an example, it can be 1000 different id in total) with active = 1 in a single query.
I did:
UPDATE table SET active = 1 WHERE id IN (1,4,12,55,111)
its inside a transaction, cause i'm updating something else in this process.
the engine is InnoDB
My problem:
Someone told me that doing such a query is equivalent to 5 queries at execution, cause the IN will translate to the a given number of OR, and run them one after another.
eventually, instead of 1 i get N which is the number in the IN.
he suggests to create a temp table, insert all the new values in it, and then update by join.
Does he right? both of the equivalency and performance.
What do you suggest? i've thought INSERT INTO .. ON DUPLICATE UPDATE will help but i don't have all the data for the row, only it id, and that i want to set active = 1 on it.
Maybe this query is better?
UPDATE table SET
active = CASE
WHEN id='1' THEN '1'
WHEN id='4' THEN '1'
WHEN id='12' THEN '1'
WHEN id='55' THEN '1'
WHEN id='111' THEN '1'
ELSE active END
WHERE campaign_id > 0; //otherwise it throws an error about updating without where clause in safe mode, and i don't know if i could toggle safe mode off.
Thanks.

It's the other way around. OR can sometimes be turned into IN. IN is then efficiently executed, especially if there is an index on the column. If you have 1000 entries in the IN, it will do 1000 probes into the table based on id.
If you are running a new enough version of MySQL, I think you can do EXPLAIN EXTENDED UPDATE ...OR...; SHOW WARNINGS; to see this conversion;
The UPDATE CASE... will probably tediously check each and every row.
It would probably be better on other users of the system if you broke the UPDATE up into multiple UPDATEs, each having 100-1000 rows. More on chunking .
Where did you get the ids in the first place? If it was via a SELECT, then perhaps it would be practical to combine it with the UPDATE to make it one step instead of two.

I think below is better because it uses primary key.
UPDATE table SET active = 1 WHERE id<=5

Mysql Auto Increment For Group Entries

I need to setup a table that will have two auto increment fields. 1 field will be a standard primary key for each record added. The other field will be used to link multiple records together.
Here is an example.
field 1 | field 2
1 1
2 1
3 1
4 2
5 2
6 3
Notice that each value in field 1 has the auto increment. Field 2 has an auto increment that increases slightly differently. records 1,2 and 3 were made at the same time. records 4 and 5 were made at the same time. record 6 was made individually.
Would it be best to read the last entry for field 2 and then increment it by one in my php program? Just looking for the best solution.

You should have two separate tables.
ItemsToBeInserted
id, batch_id, field, field, field
BatchesOfInserts
id, created_time, field, field field
You would then create a batch record, and add the insert id for that batch to all of the items that are going to be part of the batch.
You get bonus points if you add a batch_hash field to the batches table and then check that each batch is unique so that you don't accidentally submit the same batch twice.
If you are looking for a more awful way to do it that only uses one table, you could do something like:
$batch = //Code to run and get 'SELECT MAX(BATCH_ID) + 1 AS NEW_BATCH_ID FROM myTable'
and add that id to all of the inserted records. I wouldn't recommend that though. You will run into trouble down the line.

MySQL only offers one auto-increment column per table. You can't define two, nor does it make sense to do that.
Your question doesn't say what logic you want to use to control the incrementing of the second field you've called auto-increment. Presumably your PHP program will drive that logic.
Don't use PHP to query the largest ID number, then increment it and use it. If you do your system is vulnerable to race conditions. That is, if more than one instance of your PHP program tries that simultaneously, they will occasionally get the same number by mistake.
The Oracle DBMS has an object called a sequence which gives back guaranteed-unique numbers. But you're using MySQL. You can obtain unique numbers with a programming pattern like the following.
First create a table for the sequence. It has an auto-increment field and nothing else.
CREATE TABLE sequence (
sequence_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`sequence_id`)
)
Then when you need a unique number in your program, issue these three queries one after the other:
INSERT INTO sequence () VALUES ();
DELETE FROM sequence WHERE sequence_id < LAST_INSERT_ID();
SELECT LAST_INSERT_ID() AS sequence;
The third query is guaranteed to return a unique sequence number. This guarantee holds even if you have dozens of different client programs connected to your database. That's the beauty of AUTO_INCREMENT.
The second query (DELETE) keeps the table from getting big and wasting space. We don't care about any rows in the table except for the most recent one.

Alternative to inner join on very slow server 2

I have a very simple mysql query on a remote windows 7 server on which i cannot change most of the parameters. I need to execute it only once now, to create a table, but in upcoming projects i'll be confronted to the same issue.
The query's the following, and has been running for 24 hours now, it's a basic filtering query :
CREATE TABLE compute_us_gum_2013P13
SELECT A.HHID, UPC, DIVISION, yearweek, CAL_DT, RETAILER, DEAL, RAW_PURCH_QTY,
UNITS,VOL,GROSS_DOL,NET_DOL, CREATE_DATE
FROM work_us_gum_2013P13_digital_purchases_with_yearweek A
INNER JOIN compute_us_gum_2013_digital_panelists B
on A.hhid = B.hhid;
Table A is quite big, around 250 million lines.
table B is 5 million lines
hhid is indexed on both tables, i haven't put a unique index in table B though i could, but will it change things dramatically ?
My ram of 12 GB is completely saturated (actually there's 1GB free but i think mysql can't touch it). Of course i closed everything i could, and the processor is basically not used. The status of the query has been stuck on "sending data" for most of the time.
Table A has also a cover index on 7 column, that i could drop as it's not used, but i don't think it would change something would it ?
One big issue i have is i cannot test a lot of things because i wouldn't know if it works until it works, and i think this query will be long no matter what. Also I don't want to lose for nothing the computation time that's already been spent.
I could also if it helps keep only the columns HHID, UPC and yearweek (resp bigint(20),bigint(20),and int(11), though the columns i would drop are only decimal and dates.
And what if i split table B in several parts ? the operation is only a filtering one, so it can be done in several steps, would i win time ? If i don't gain time but don't lose either, at least i could see my progress.
Another possibility would be to directly delete rows from table A (and if really necessary, columns), so i wouldn't have to write another table, would it be faster ?
I can change some database parameters if i send an email to my client, but it take some tim and is not suitable for a lot of tweeking and testing.
Any solution would be appreciate, even the dirtiest one :), i'm really stuck here.
EDIT:
Explain gives me this:
Id select_type table type possible_keys key keylen ref row Extra
1 Simple B index hhidx hhidx 8 NULL 5003865 Using Index
1 Simple A ref hhidx hhidx 8 ncsmars.B.hhid 6 [nothing]

What is the Engine? Is it InnoDB?
What are the primary keys for both tables?
Did you start both primary keys with your HHID (if HHID is not a candidate key for a table - you can create composite key and start it from that field)?
If you start both PK from HHID and then will join your tables on that field - disk seek will be reduced dramatically so you should see much better performance. If you cannot alter both tables - start from smaller one - alter its PK to have HHID on the first place of it and then check execution plan.
ALTER TABLE compute_us_gum_2013_digital_panelists ADD PRIMARY KEY(HHID, [other necessary fields (if any)])
Suppose it will be better than before.

Using sql to keep track of words and their count

I have a situation where a user enters certain words at a time,say {bat,ball,tennis,car,actor,ping}.I have a database with the following structure
------------------------------
word (PK) | count
------------------------------
ball | 4
cat | 2
gear | 1
|
I want to insert each word into the table .If the word is already present,increment the counter by 1 else insert the word (as it is new) and set its count to 1.
Is it possible using a single query?If yes, how can I do it?

If your word column is truly the primary key, you should be able to do something like this.
INSERT INTO table_name (`word`, `count`) VALUES("ball", 1)
ON DUPLICATE KEY UPDATE `count` = `count` + 1
Pretty straight forward taking advantage of the database to perform the update in the database layer.

Normally I avoid answer that amount to links, but in this case I think the question is really just a two-parted, each of which has been asked here before.
There are two steps you have to do.
You have to split your keywords set into a table of some flavor, Last I knew MySQL did not have a split strings, but how to do it has been asked several times on SO. See Can Mysql Split a column?
Then you can use INSERT...ON DUPLICATE as discussed in "How do I update if exists, insert if not (AKA “upsert” or “merge” in MySQL?"

Matching algorithm in SQL Server 2008

I have more than 3 million rows in my table. When the user try to insert or update this table I have to check the following conditions sequentially.(Business Need)
Does any of the row has same address?
Does any of the row has same postcode?
Does any of the row has same DOB?
Obviously the newly inserted or updated row will match lot of the records from this table.
But the business need is, the matching process should end when the first match (row) found and that row has to returned.
I can easily achieve this using simple "SELECT" query . But it's taking very long time to find the match.
Please suggest some efficient way to do this.

If you're just looking for a way to return after the first match, use LIMIT 1.
You may want to maintain a table of either birth dates or postcodes and have each row link to a user, so that you can easily filter customers down to a smaller set. It would allow you to perform a much faster search on the database.
Example:
dob | userID
1/1/1980 | 235
1/1/1980 | 482
1/1/1980 | 123
2/1/1980 | 521
In that scenario, you only have to read 3 rows from the large users table if your target date is 1/1/1980. It's via a primary key index, too, so it'll be really fast.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008