MySQL Multi Duplicate Record Merging - mysql

A previous DBA managed a non relational table with 2.4M entries, all with unique ID's. However, there are duplicate records with different data in each record for example:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | | |
| 2 | bobby | | 02456 | bob#domain | |
| 3 | bob | 12 Some Rd | 02456 | | 2010-07-13 |
| 4 | sir bob | | 02456 | | |
| 5 | bob | 12SomeRoad | 02456 | | |
| 6 | mr bob | | 02456 | | |
| 7 | robert | | 02456 | | |
+---------+---------+--------------+---------+------------+-------------+
This isnt the exact table - the real table has 32 columns - this is just to illustrate
I know how to identify the duplicates, in this case i'm using the phone number. I've extracted the duplicates into a seperate table - there's 730k entires in total.
What would be the most efficient way of merging these records (and flagging the un-needed records for deletion)?
I've looked at using UPDATE with INNER JOIN's, but there are several WHERE clauses needed, because i want to update the first record with data from subsequent records, where that subsequent record has additional data the former record does not.
I've looked at third party software such as Fuzzy Dups, but i'd like a pure MySQL option if possible
The end goal then is that i'd be left with something like:
+---------+---------+--------------+----------------------+-------------+
| id | Name | Address | Phone | Email | LastVisited |
+---------+---------+--------------+---------+------------+-------------+
| 1 | bob | 12 Some Road | 02456 | bob#domain | 2010-07-13 |
+---------+---------+--------------+---------+------------+-------------+
Should i be looking at looping in a stored procedure / function or is there some real easy thing i've missed?

U have to create a PROCEDURE, but before that
create ur own temp_table like :
Insert into temp_table(column1, column2,....) values (select column1, column2... from myTable GROUP BY phoneNumber)
U have to create the above mentioned physical table so that u can run a cursor on it.
create PROCEDURE myPROC
{
create a cursor on temp::
fetch the phoneNumber and id of the current row from the temp_table to the local variable(L_id, L_phoneNum).
And here too u need to create a new similar_tempTable which will contain the values as
Insert into similar_tempTable(column1, column2,....) values (Select column1, column2,.... from myTable where phoneNumber=L_phoneNumber)
The next step is to extract the values of each column u want from similar_tempTable and update into the the row of myTable where id=L_id and delete the rest duplicate rows from myTable.
And one more thing, truncate the similar_tempTable after every iteration of the cursor...
Hope this will help u...

Related

MySql add relationships without creating dupes

I created a table (t_subject) like this
| id | description | enabled |
|----|-------------|---------|
| 1 | a | 1 |
| 2 | b | 1 |
| 3 | c | 1 |
And another table (t_place) like this
| id | description | enabled |
|----|-------------|---------|
| 1 | d | 1 |
| 2 | e | 1 |
| 3 | f | 1 |
Right now data from t_subject is used for each of t_place records, to show HTML dropdowns, with all the results from t_subject.
So I simply do
SELECT * FROM t_subject WHERE enabled = 1
Now just for one of t_place records, one record from t_subject should be hidden.
I don't want to simply delete it with javascript, since I want to be able to customize all of the dropdowns if anything changes.
So the first thing I though was to add a place_id column to t_subject.
But this means I have to duplicate all of t_subject records, I would have 3 of each, except one that would have 2.
Is there any way to avoid this??
I thought adding an id_exclusion column to t_subject so I could duplicate records only whenever a record is excluded from another id from t_place.
How bad would that be?? This way I would have no duplicates, so far.
Hope all of this makes sense.
While you only need to exclude one course, I would still recommend setting up a full 'place-course' association. You essentially have a many-to-many relationship, despite not explicitly linking your tables.
I would recommend an additional 'bridging' or 'associative entity' table to represent which courses are offered at which places. This new table would have two columns - one foreign key for the ID of t_subject, and one for the ID of t_place.
For example (t_place_course):
| place_id | course_id |
|----------|-----------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 3 |
As you can see in my example above, place 3 doesn't offer course 2.
From here, you can simply query all of the courses available for a place by querying the place_id:
SELECT * from t_place_course WHERE place_id = 3
The above will return both courses 1 and 3.
You can optionally use a JOIN to get the other information about the course or place, such as the description:
SELECT `t_course`.`description`
FROM `t_course`
INNER JOIN `t_place_course`
ON `t_course`.`id` = `t_place_course`.`course_id`
INNER JOIN `t_place`
ON `t_place`.`id` = `place_id`

Mysql - Compare int field with comma separated field from another table

I have two tables in a MySQL database like this:
User:
userid |userid | Username | Plan(VARCHAR) | Status |
-----------+------------+--------------+---------------+---------+
1 | 1 | John | 1,2,3 |1 |
2 | 2 | Cynthia | 1,2 |1 |
3 | 3 | Charles | 2,3,4 |1 |
Plan: (planid is primary key)
planid(INT) | Plan_Name | Cost | status |
-------------+----------------+----------+--------------+
1 | Tamil Pack | 100 | ACTIVE |
2 | English Pack | 100 | ACTIVE |
3 | SportsPack | 100 | ACTIVE |
4 | KidsPack | 100 | ACTIVE |
OUTPUT
id |userid | Username | Plan | Planname |
---+-------+----------+------------+-------------------------------------+
1 | 1 | John | 1,2,3 |Tamil Pack,English Pack,SportsPack |
2 | 2 | Cynthia | 1,2 |Tamil Pack,English Pack |
3 | 3 | Charles | 2,3,4 |English Pack,Sportspack, Kidspack |
Since plan id in Plan table is integer and the user can hold many plans, its stored as comma separated as varchar, so when i try with IN condition its not working.
SELECT * FROM plan WHERE find_in_set(plan_id,(select user.planid from user where user.userid=1))
This get me the 3 rows from plan table but i want the desired output as above.
How to do that.? any help Please
A rewrite off your query what should work is as follows..
Query
SELECT
all columns you need
, GROUP_CONCAT(Plan.Plan_Name ORDER BY Plan.planid) AS Planname
FROM
Plan
WHERE
FIND_IN_SET(Plan.plan_id,(
SELECT
User.Plan
FROM
user
WHERE User.userid = 1
)
)
GROUP BY
all columns what are in the select (NOT the GROUP_CONCAT function)
You also can use FIND_IN_SET on the ON clause off a INNER JOIN.
One problem is that the join won't ever use indexes.
Query
SELECT
all columns you need
, GROUP_CONCAT(Plan.Plan_Name ORDER BY Plan.planid) AS Planname
FROM
User
INNER JOIN
Plan
ON
FIND_IN_SET(Plan.id, User.Plan)
WHERE
User.id = 1
GROUP BY
all columns what are in the select (NOT the GROUP_CONCAT function)
Like i said in the comments you should normalize the table structures and add the table User_Plan whats holds the relations between the table User and Plan.

Fastest way to insert data to database for one to many relationship

I am looking for a fastest way to insert data into database.
Currently I have 2 tables which is "User" and "User_Detail".
One "User" can has many "User_detail"
Example:
In database,we have the record of Age and mail for user "John".
User table
|Name |
|---------|
| John |
| Jason |
| Wilson |
User_Detail table
| Usr_Name| Property | Value |
|---------+----------+--------|
| John | Age | 12 |
| John | mail | gmail |
| Wilson | Age | 31 |
I would like to write a query to add "uni" to ALL of the users.
The result will become like this.
User_Detail table
| Usr_Name | Property | Value |
|----------+----------+--------|
| John | Age | 12 |
| John | mail | gmail |
| John | Uni | 00000 |
| Wilson | Age | 31 |
| Wilson | Uni | 00000 |
| Jason | Uni | 00000 |
Is there any suggestions or ideas on how to insert data ?
I need the fastest way to do it, as I have around 10k users in my USER table.
It can be any language or database query, as long as it can be very fast to insert the record to database.
First, consider normalizing your schema. Here is an in-depth discussion of EAV storage on dba.SE.
With your given design, this does the job:
INSERT INTO "User_Detail" ("Usr_Name", "Property", "Value")
SELECT "Name", 'Uni', '0000'
FROM "User";
In Postgres, I would also advise not to use mixed-case identifiers.
To insert a value in, just do a simple insert query.
INSERT INTO `User_detail` (`User_name`, 'Property`, `Value')
SELECT `Name`, 'H/P', 50012 FROM `Users`
To make the inserted value be something different, you need to change that hard coded value 50012 to something that resolves to the number you want there.

SQL Trigger Multiple Tables

I want to trigger an Update on multiple sql tables without creating a loop.
Lets say I have 2 tables:
Table: User_Names
---------------
|Name | Clark |
|Gen | Male |
|id | 1 |
---------------
Table: User_Ages
---------------
|Age | 34|
|Gen | Male |
|id | 1 |
---------------
The id's are unique and refer to the same person.I want to update the columnGen in User_Names, my trigger should update it in the other Table. I also want this to happen when I change it in User_Ages Table, But if both update eachother im creating a loop on the Update trigger in mysql. How do I prevent this loop? The point here is creating a SQL Trigger.
I'm not going to address your original question given the nature of your example. This is a normalization issue much more than trigger issue.
In this case you should normalize your data and only store it in one place. Example above also suggests that you have slight misunderstanding on how to use rows and columns.
Given the example, better layout would probably be:
Table: User_names
+----+---------+------+
| id | Name | gen |
+----+---------+------+
| 1 | Clark | Male |
+----+---------+------+
Table: User_Ages
+----+------+
| id | age |
+----+------+
| 1 | 34 |
+----+------+
When you want to retrieve both values, you'd just link them in your query, e.g.
SELECT user_names.id,name,gen,age FROM User_names JOIN User_Ages USING (id);
Would give you:
+----+---------+------+-----+
| id | Name | gen | age |
+----+---------+------+-----+
| 1 | Clark | Male | 34 |
+----+---------+------+-----+
Coming back to your original question: In situation like that I'd question the original design. If it is really called for, then I'd pick one table that acts as a master and propagates the changes to other table. E.g. define the trigger on User_names table and use it to populate User_Ages table as well.

How to store multiple values in single column where use less memory?

I have a table of users where 1 column stores user's "roles".
We can assign multiple roles to particular user.
Then I want to store role IDs in the "roles" column.
But how can I store multiple values into a single column to save memory in a way that is easy to use? For example, storing using a comma-delimited field is not easy and uses memory.
Any ideas?
If a user can have multiple roles, it is probably better to have a user_role table that stores this information. It is normalised, and will be much easier to query.
A table like:
user_id | role
--------+-----------------
1 | Admin
2 | User
2 | Admin
3 | User
3 | Author
Will allow you to query for all users with a particular role, such as SELECT user_id, user.name FROM user_role JOIN user WHERE role='Admin' rather than having to use string parsing to get details out of a column.
Amongst other things this will be faster, as you can index the columns properly and will take marginally more space than any solution that puts multiple values into a single column - which is antithetical to what relational databases are designed for.
The reason this shouldn't be stored is that it is inefficient, for the reason DCoder states on the comment to this answer. To check if a user has a role, every row of the user table will need to be scanned, and then the "roles" column will have to be scanned using string matching - regardless of how this action is exposed, the RMDBS will need to perform string operations to parse the content. These are very expensive operations, and not at all good database design.
If you need to have a single column, I would strongly suggest that you no longer have a technical problem, but a people management one. Adding additional tables to an existing database that is under development, should not be difficult. If this isn't something you are authorised to do, explain to why the extra table is needed to the right person - because munging multiple values into a single column is a bad, bad idea.
You can also use bitwise logic with MySQL. role_id must be in BASE 2 (0, 1, 2, 4, 8, 16, 32...)
role_id | label
--------+-----------------
1 | Admin
2 | User
4 | Author
user_id | name | role
--------+-----------------
1 | John | 1
2 | Steve | 3
3 | Jack | 6
Bitwise logic allows you to select all user roles
SELECT * FROM users WHERE role & 1
-- returns all Admin users
SELECT * FROM users WHERE role & 5
-- returns all users who are admin or Author because 5 = 1 + 4
SELECT * FROM users WHERE role & 6
-- returns all users who are User or Author because 6 = 2 + 4
From your question what I got,
Suppose, you have to table. one is "meal" table and another one is "combo_meal" table. Now I think you want to store multiple meal_id inside one combo_meal_id without separating coma[,]. And you said that it'll make your DB to more standard.
If I not getting wrong from your question then please read carefully my suggestion bellow. It may be help you.
First think is your concept is right. Definitely it'll give you more standard DB.
For this you have to create one more table [ example table: combo_meal_relation ] for referencing those two table data. May be one visible example will clear it.
meal table
+------+--------+-----------+---------+
| id | name | serving | price |
+------+--------+-----------+---------+
| 1 | soup1 | 2 person | 12.50 |
+------+--------+-----------+---------+
| 2 | soup2 | 2 person | 15.50 |
+------+--------+-----------+---------+
| 3 | soup3 | 2 person | 23.00 |
+------+--------+-----------+---------+
| 4 | drink1 | 2 person | 4.50 |
+------+--------+-----------+---------+
| 5 | drink2 | 2 person | 3.50 |
+------+--------+-----------+---------+
| 6 | drink3 | 2 person | 5.50 |
+------+--------+-----------+---------+
| 7 | frui1 | 2 person | 3.00 |
+------+--------+-----------+---------+
| 8 | fruit2 | 2 person | 3.50 |
+------+--------+-----------+---------+
| 9 | fruit3 | 2 person | 4.50 |
+------+--------+-----------+---------+
combo_meal table
+------+--------------+-----------+
| id | combo_name | serving |
+------+--------------+-----------+
| 1 | combo1 | 2 person |
+------+--------------+-----------+
| 2 | combo2 | 2 person |
+------+--------------+-----------+
| 4 | combo3 | 2 person |
+------+--------------+-----------+
combo_meal_relation
+------+--------------+-----------+
| id | combo_meal_id| meal_id |
+------+--------------+-----------+
| 1 | 1 | 1 |
+------+--------------+-----------+
| 2 | 1 | 2 |
+------+--------------+-----------+
| 3 | 1 | 3 |
+------+--------------+-----------+
| 4 | 2 | 4 |
+------+--------------+-----------+
| 5 | 2 | 2 |
+------+--------------+-----------+
| 6 | 2 | 7 |
+------+--------------+-----------+
When you search inside table then it'll generate faster result.
search query:
SELECT m.*
FROM combo_meal cm
JOIN meal m
ON m.id = cm.meal_id
WHERE cm.combo_id = 1
Hopefully you understand :)
You could do something like this
INSERT INTO table (id, roles) VALUES ('', '2,3,4');
Then to find it use FIND_IN_SET
As you might already know, storing multiple values in a cell goes against 1NF form. If youre fine with that, using a json column type is a great way and has good methods to query properly.
SELECT * FROM table_name
WHERE JSON_CONTAINS(column_name, '"value 2"', '$')
Will return any entry with json data like
[
"value",
"value 2",
"value 3"
]
Youre using json, so remember, youre query performance will go down the drain.