SQL: Fastest Way to Dedupe to Canonical Ids

SQL: Fastest Way to Dedupe to Canonical Ids - mysql

I have an interesting SQL task and though I would ask the community if anyone knows a fast way to accomplish it. I have 2 slow solutions, but I'm wondering if I am missing something faster.
Here is the task:
Given a list of records in a table, table A, with a column that references the primary key of another table, table B, logically speaking only though this is a MyISAM without foreign keys, we want to dedupe table B, and update table A to use the canonical deduped value from table B, and then delete all but the canonical id records from table B.
This might be easier illustrated via a small example. Lets say table A is a person table, and table B is a city table. Lets also say that there are records in the city table that are duplicates and need deduping. Lets say row 1 and row 2 of table B both refer to Los Angeles.
Then in the person table, we want to update all persons in Los Angeles with city id 2, to have city id 1, and delete the duplicate value from the city table with city id 2.
There may be many such rows representing the duplicated value, not just 2, you get the point. Right now, I am querying out all the cities from the city table, grouping them into equivaslence classes, looping over each equivalence class, nominating the canonical version in this case just choose the first, and performing 2 queries, the updated and the delete:
update person set city_id = $canonical_city_id where city_id in ($list_of_dupes)
Then
delete from city where city_id in ($list_of_dupes) and city_id != $canonical_city_id
I think there may be a faster way since we don't care which id is canonical, it could be the first, the in, or a random, doesn't matter. Can you think of a way to do this whole job in 1 SQL statement? What do you think is the fastest way?

Related

MySQL : One to One Relationship Tables Merging

I am trying to simplify an application's database. In that database I have two tables let's say Patient and MedicalRecord. I know that two tables are said to be in One-to-One relationship iff that any given row from Table-A can have at most one row ine Table-B(It means there can be zero matchings).
But in my case, it is not at most, it is exactly. i.e., Every row in Patient should have exactly one row in MedicalRecord(no patient exist without a medical record).
Patient table has all personal details of the patient with his id as PK.
MedicalRecord talbe has details like his blood-group, haemoglobin, bp etc with his id as both PK and FK to the Patient.
My Question is, can I merge those two tables and create one table like,
PatientDetails : personal_details and blood-group, haemoglobin, bp etc

"bp" = "Blood pressure"? Then you must not combine the tables. Instead, it is 1:many -- each patient can have many sets of readings. It is very important to record and plot trends in the readings.
Put only truly constant values in the Patient -- name, birthdate (not age; compute that), sex, race (some races are more prone to certain diseases than others), not height/weight. Etc.
Sure, a patient may have a name change (marriage, legal action, etc), but that is an exception that does not affect the schema design, except to force you to use patient_id, not patient_name as a unique key.
Every patient must have a MedicalRecord? That is "business logic"; test it in the application; do not depend (in this case) on anything in the Database.
Both tables would have patient_id. Patients would have it as the PRIMARY KEY; MedicalRecord would haveINDEXed`. That's all it takes to have 1:many.
In situations where the tables are really 1:1 (optionally 1:0/1), I do recommend merging the table. (There are exceptions.)

If two tables have the same set of subrow values for a shared set of columns that is a superkey in both (SQL PRIMARY KEY or UNIQUE) then you can replace the two tables by their natural join. ("Natural join" is probably what you mean by "merge" but that is not a defined technical term.) Each original table will equal the projection of the join on that original's columns.
(1:1 means total on both sides, it does not mean 1:0-or-1, although most writing about cardinalities is sloppy & unclear.)

A correct way to make relationships between three tables

I am building a library database and I am stuck on one particular thing.
I have three tables :BookCopy, BookLoan and Members. It is not clear to me how to make the relationships between them, so a member can borrow a book(or books) and all this to be correctly reflected in my database.
My idea was to have a two many-to-many tables, so I add BoakLoansMembers and BookCopiesBookLoans . I am not sure if this is correct, and even if it is, I have no idea how to scipt so many tables.
So, now I am wondering what would be the best thing to be done in this case and why?

I'm guessing your BookCopy is to account for having X copies of book Y, and in that sense "books" are not loaned, "copies" of them are, right?
I think the best course of action is probably to realize the BookLoan table should be the many-to-many table. A copy is loaned to a member at a time and then returned. BookLoad should have the id's for the copy and the member, and the date loaned (as you have now, though it should be a datetime field NOT a varchar one) & date returned (like the loaned date, it should be a datetime, but should also be nullable to represent unreturned copies). You should also keep the unique (presumably auto-increment) id of the loan as it is very possible a member might check out the same copy multiple times.
I am guessing that perhaps you were originally conceptualizing the "loan" similar to a sales transaction, which could work; but you would want a loanCopies table, and wouldn't want the dateReturned on the loan then since different copies could be returned independently.
Edit (additional observations):
isAvailable may be redundant if it is only based on whether the copy is checked out (if you want to withhold the book from circulation it might be appropriate though)
ISBN maxes at 13 characters according to wikipedia (char van be a little more efficient than varchar under some circumstances)
you might want to consider a languages table that the copy can reference rather than using a string type field.
Edit (re: isAvailable):
If you just need to find the copies not loaned out, a simple query like this is all you need.
SELECT *
FROM BookCopy
WHERE idBookCopy NOT IN (
SELECT idBookCopy
FROM BookLoan
WHERE dateReturned IS NULL
);
The subquery gets the list of copies loaned out, and the NOT IN makes sure the copies in the results are not in that list.
If you want to prevent a copy from being loaned out (damaged, vandalized, etc...) an isAvailable "flag" could be a simple way to add such functionality; just add AND isAvailable = 1 to the outer query's WHERE conditions.

You can just have an m:m relationship between Members and BookCopy and use your BookLoan Table as your cross join table. So you basically just have to add the references from the tables Members and Bookcopy to the Table BookLoan
BookLoan
---------------
idBookLoan
dateLoaned
dateReturned
idBookCopy FK -- add these two
idMember FK
And also consider making idBookCopy, idMember and dateLoaned the primary keys of your BookLoan Table

Adding a database record with foreign key

Let's say there is a database with two tables: one customer table and one country table. Each customer row contains (among other things) a countryId foreign key. Let's also assume that we are populating the database from a data file (i.e., it is not an operator that is selecting a country from a UI).
What is the best practice for this?
Should one query the database first and get all ID's for all countries, and then just supply the (now known) country id's in the insert query? This is not a problem for my 'country' example, but what if there is a large number of records in the table that is being referred?
Or should the insert query use a sub query to get the country id based on the country name? If so, what if the record for the country does not exist yet and has to be added?
Or another approach? Or does it depend? :)

I would suggest using a join in your insert query to get the country id based on the country name. However, I don't know if that's something possible with every SGBD and you don't give more precision on the one you're using.

Should Foreign Keys be used in a structure where multiple options can be selected by a user? If so, how so?

In MySQL, I was advised to store the multiple choice options for "Drugs" as a separate table user_drug where each row is one of the options selected by a particular user. I was also advised to create a 3rd table drug that describes each option selected in table user_drug. Here is an example:
user
id name income
1 Foo 10000
2 Bar 20000
3 Baz 30000
drug
id name
1 Marijuana
2 Cocaine
3 Heroin
user_drug
user_id drug_id
1 1
1 2
2 1
2 3
3 3
As you can see, table user_drug can contain the multiple drugs selected by a particular user, and table drug tells you what drug each drug_id is referring to.
I was told a Foreign Key should tie tables user_drug and drug together, but I've never dealt with Foreign Key's so I'm not sure how to do that.
Wouldn't it be easier to get rid of the drug table and simply store the TEXT value of each drug in user_drug? Why or why not?
If adding the 3rd table drug is better, then how would I implement the Foreign Key structure, and how would I normally retrieve the respective values using those Foreign Keys?
(I find it far easier to use just 2 tables, but I've heard Foreign Keys are helpful in that they ensure a proper value is entered, and that it is also a lot faster to search and sort for a drug_id than a text value, so I want to be sure.)

Wouldn't it be easier to get rid of the drug table and simply store the TEXT value of each drug in user_drug? Why or why not?
Easier, yes.
But not better.
Your data would not be normalized, wasting lots of space to store the table.
The index on that field would occupy way more space again wasting space and slowing things down.
If you want to query a drop-down list of possible values, that's trivial with a separate table, hard (read: slow) with just text in a field.
If you just drop text fields in 1 table, it's hard to ensure misspellings do not get in there, with a separate link table preventing misspellings is easy.
If adding the 3rd table drug is better, then how would I implement the Foreign Key structure
ALTER TABLE user_drug ADD FOREIGN KEY fk_drug(drug_id) REFERENCES drug(id);
and how would I normally retrieve the respective values using those Foreign Keys?
SELECT u.name, d.name as drug
FROM user u
INNER JOIN user_drug ud ON (ud.user_id = u.id)
INNER JOIN drug d ON (d.id = ud.drug_id)
Don't forget to declare the primary key for table user_drug as
PRIMARY KEY (user_id, drug_id)
Alternatively
You can use an enum
CREATE TABLE example (
id UNSIGNED INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
example ENUM('value1','value2','value3'),
other_fields .....
You don't get all the benefits of a separate table, but if you just want a few values (e.g. yes/no or male/female/unknown) and you want to make sure it's limited to only those values it's a good compromise.
And much more self documenting and robust than magic constants (1=male, 2=female, 3= unknown,... but what happens if we insert 4?)

Wouldn't it be easier to get rid of the drug table and simply store
the TEXT value of each drug in user_drug? Why or why not?
Normally, you'd have lots of other columns on the drug table -- things like description, medical information, chemical properties, etc. In that case, you wouldn't want to duplicate all of that information on every record of the user_drug table. In this particular case however, you've only got one column, so that issue is not really a big deal.
Also, you want to be sure that the drug referenced in the user_drug table actually exists. For example, if you store the field as text, then you could have heroin and its related misspellings like haroin or herion. This will give you problems when you try to select all heroin records later. Using a foreign key to a lookup table forces the id to exist in that table, so you can be absolutely sure that all references to heroin are accurate.

How to handle fragmentation of auto_increment ID column in MySQL

I have a table with an auto_increment field and sometimes rows get deleted so auto_increment leaves gaps. Is there any way to avoid this or if not, at the very least, how to write an SQL query that:
Alters the auto_increment value to be the max(current value) + 1
Return the new auto_increment value?
I know how to write part 1 and 2 but can I put them in the same query?
If that is not possible:
How do I "select" (return) the auto_increment value or auto_increment value + 1?

Renumbering will cause confusion. Existing reports will refer to record 99, and yet if the system renumbers it may renumber that record to 98, now all reports (and populated UIs) are wrong. Once you allocate a unique ID it's got to stay fixed.
Using ID fields for anything other than simple unique numbering is going to be problematic. Having a requirement for "no gaps" is simply inconsistent with the requirement to be able to delete. Perhaps you could mark records as deleted rather than delete them. Then there are truly no gaps. Say you are producing numbered invoices: you would have a zero value cancelled invoice with that number rather than delete it.

There is a way to manually insert the id even in an autoinc table. All you would have to do is identify the missing id.
However, don't do this. It can be very dangerous if your database is relational. It is possible that the deleted id was used elsewhere. When removed, it would not present much of an issue, perhaps it would orphan a record. If replaced, it would present a huge issue because the wrong relation would be present.
Consider that I have a table of cars and a table of people
car
carid
ownerid
name
person
personid
name
And that there is some simple data
car
1 1 Van
2 1 Truck
3 2 Car
4 3 Ferrari
5 4 Pinto
person
1 Mike
2 Joe
3 John
4 Steve
and now I delete person John.
person
1 Mike
2 Joe
4 Steve
If I added a new person, Jim, into the table, and he got an id which filled the gap, then he would end up getting id 3
1 Mike
2 Joe
3 Jim
4 Steve
and by relation, would be the owner of the Ferrari.

I generally agree with the wise people on this page (and duplicate questions) advising against reusing auto-incremented id's. It is good advice, but I don't think it's up to us to decide the rights or wrongs of asking the question, let's assume the developer knows what they want to do and why.
The answer is, as mentioned by Travis J, you can reuse an auto-increment id by including the id column in an insert statement and assigning the specific value you want.
Here is a point to put a spanner in the works: MySQL itself (at least 5.6 InnoDB) will reuse an auto-increment ID in the following circumstance:
delete any number rows with the highest auto-increment id
Stop and start MySQL
insert a new row
The inserted row will have an id calculated as max(id)+1, it does not continue from the id that was deleted.

As djna said in her/his answer, it's not a good practice to alter database tables in such a way, also there is no need to that if you have been choosing the right scheme and data types. By the way according to part od your question:
I have a table with an auto_increment field and sometimes rows get deleted so auto_increment leaves gaps. Is there any way to avoid this?
If your table has too many gaps in its auto-increment column, probably as a result of so many test INSERT queries
And if you want to prevent overwhelming id values by removing the gaps
And also if the id column is just a counter and has no relation to any other column in your database
, this may be the thing you ( or any other person looking for such a thing ) are looking for:
SOLUTION
remove the original id column
add it again using auto_increment on
But if you just want to reset the auto_increment to the first available value:
ALTER TABLE `table_name` AUTO_INCREMENT=1

not sure if this will help, but in sql server you can reseed the identity fields. It seems there's an ALTER TABLE statement in mySql to acheive this. Eg to set the id to continue at 59446.
ALTER TABLE table_name AUTO_INCREMENT = 59446;
I'm thinking you should be able to combine a query to get the largest value of auto_increment field, and then use the alter table to update as needed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008