I have two tables from two different databases, and both contain lastName and firstName columns. I need to create JOINa relationship between the two. The lastName columns match about 80% of the time, while the firstName columns match only about 20% of the time. And each table has totally different personID primary keys.
Generally speaking, what would be some "best practices" and/or tips to use when I add a foreign key to one of the tables? Since I have about 4,000 distinct persons, any labor-saving tips would be greatly appreciated.
Sample mismatched data:
db1.table1_____________________ db2.table2_____________________
23 Williams Fritz 98 Williams Frederick
25 Wilson-Smith James 12 Smith James Wilson
26 Winston Trudy 73 Winston Gertrude
Keep in mind: sometimes they match exactly, often they don't, and sometimes two different people will have the same first/last name.
You can join on multiple fields.
select *
from table1
inner join table2
on table1.firstName = table2.firstName
and table1.lastName = table2.lastName
From this you can determine how many 'duplicate' firstname / last name combos there are.
select table1.firstName, table2.lastName, count(*)
from table1
inner join table2
on table1.firstName = table2.firstName
and table1.lastName = table2.lastName
group by table1.firstName, table2.lastName
having count(*) > 1
Conversely, you can also determine the ones which match identically, and only once:
select table1.firstName, table2.lastName
from table1
inner join table2
on table1.firstName = table2.firstName
and table1.lastName = table2.lastName
group by table1.firstName, table2.lastName
having count(*) = 1
And this last query could be the basis for performing the bulk of your foreign key updates.
For those names that match more than once between the tables, they'll likely need some sort of manual intervention, unless there are other fields in the table that can be used to differentiate them?
Related
I have 2 tables:
Table1: users
id
name
faculty_id
level_id
1
john
1
1
2
mark
1
1
3
sam
1
2
Table 2: subjects
id
title
faculty_id
1
physics
1
2
chemistry
1
3
english
2
SQL query:
SELECT count(subjects.id) FROM users INNER JOIN subjects ON users.faculty_id = subjects.faculty_id WHERE users.level_id = 1
I'm trying to get count of subjects where users.level_id = 1, Which should be 2 in this case physics and chemistry.
But it's returning more than 2.
Why is that and how to get only 2?
I would recommend exists:
SELECT COUNT(*)
FROM subjects s
WHERE EXISTS (SELECT 1
FROM users u
WHERE u.faculty_id = s.faculty_id AND
u.level_id = 1
);
This counts subjects where a user exists with a level of 1.
You are joining users and subjects on faculty_id; this produces every combination of user and subject rows (2 users and 2 subjects makes 4 combined rows); change your query to SELECT users.*, subjects.* FROM... to see how this works.
count(subjects.id) counts the number of non-null subjects.id values in your results; you can just do count(distinct subjects.id).
The two tables are not directly related as none is parent to the other. The faculty table is parent to both tables and this is what relates the two tables indirectly.
When joining the faculties' students with the faculties' subjects per faculty, you get all combinations (john|physics, mark|physics, sam|physics, john|chemistry, mark||chemistry, ...). Whether John really has the subject Physics cannot even be gathered from the database. We see that John studies a faculty containing the subjects Physics and Chemistry, but does every student have every subject belonging to their faculty? You probably know but we don't. That shows that in order to write proper queries, one should know their database :-)
Now you are joining the tables and get all students per faculty multiplied with all subjects per faculty. You limit this to level_id = 1, which gets you 2 students x 2 subjects = 4. You could use COUNT(*) for this, because you are counting rows. By applying COUNT(subjects.id) instead you are only counting rows for which the subject ID is not null, but that is true for all rows, because all four combined rows have either subject ID 1 (Physics) or 2 (Chemistry). Counting something that cannot be null makes no sense, except for counting distinct, as has already been suggested. You can COUNT(DISTINCT subjects.id) to get the distinct number of subjects matching yur conditions.
This, however, has two drawbacks. First, the query doesn't clearly show your intention. Why do you join all students with all subjects, when your are not really interested in the (four) combinations? Secondly, you are building an unnecessary intermediate result (four rows in your small example) that must be searched for duplicates, so these can be removed from the counting. That means more memory consumed and more work for the DBMS.
What you want to count is subjects. So select from the subjects table. Your condition is that a student exists with level 1 for the same faculty. Conditions belong in the WHERE clause. Use EXISTS as Gordon suggests in his answer or use IN which is slightly shorter to write and may hence be considered a tad more readable (but that boils down to personal preference, as EXISTS and IN express exactly the same thing here).
select count(*)
from subjects
where faculty_id in (select faculty_id from users where level_id = 1);
You can just add "distinct" before subjects.id
your SQL query like:
SELECT count(distinct subjects.id) FROM users INNER JOIN subjects ON users.faculty_id = subjects.faculty_id WHERE users.level_id = 1
You want to count level_id and you have mentioned subject_id in the code. I would suggest first join two tables.
SELECT users.name, users.level_id,
subjects.title
FROM users
INNER JOIN subjects ON
users.faculty_id = subjects.faculty_id as new_table
After joining the table u can get the count.
SELECT level_id, COUNT(level_id)
FROM new_table
GROUP BY level_id
WHERE level_id = 1
(You have not mentioned group by in your code.)
I want to store data like the following, unique is on user_id and lids, in MySQL:
recordid user_id lids length breadth
------------------------------------------------------------
1 1 l1,l2 10 5
2 1 l1 7 5
3 1 l1,l3,l2 10 10
4 1 l2,l3 25 15
My query patterns are:
Give me length & breadth where lids are l2,l1
Give me length & breadth where lids are l2,l3
Basically, the input of lids can come in any order to search, still it should provide the correct length, breadth.
Since, we should not store the comma separated values in RDBMS.
Question - How should I structure the DB to have unqiue user_id/lids combinations which can provide the correct length & breadth without much string operations?
I came up with a solution to query the DB like this -
select * from table1 where find_in_set('l2', lids) AND find_in_set('l1', lids);
then in code, identify the count to be exact 2 of lids. But it is not the perfect solution. Need guidance regarding it.
AddOn - A SpringBoot + JPA (Hibernate) specific solution will be great, where there is no requirement of writing native sql query
As per comments if I create a table for lids -
recordid(fk) lid
----------------------------------
1 l1
1 l2
2 l1
3 l1
3 l3
3 l2
4 l2
4 l3
Then how will I ensure that just 1 unique combination of lids should be available for the user?
and what will be my select query? Will it be like the following?
select * from table1, lids where main.recordid = lids.recordid and lid IN ('l2','l3');
The IN operator will run a OR query instead of AND which will give wrong results as well.
Do I have to group based on the recordid in lids table then apply where condition? Apologies, I'm totally confused as I have read many articles related to it and got distracted.
Okay the question basically drills down to this - How to find if a list/set is exactly within another list
I want to find recordid having EXACT list of lids to search.
tl;dr: for design problems like this think about entities. relationships, amd sets.
You have two entities, records and lids. They have a many-to-many relationship.
Let's call your second table records_lids, to show that it's a many-to-many association table between records and lids. It has two columns, record_id and lid. When a row exists in that table it means that the record_id mentioned has the lid mentioned.
That table's primary key should be made of both its columns (record_id, lid). Because primary keys are unique, this prevents any record from having the same lid more than once.
Now, finding the set of record_id values with lid l1 is easy. You don't even need your first table.
SELECT record_id FROM records_lids WHERE lid = `l1`
To find records with multiple lids, you need to take the logical intersection of the sets of records with each lid. You can do that like this: (https://www.db-fiddle.com/f/cLf4b6LDwMH9eFRTTheZJr/0)
SELECT record_id
FROM (SELECT record_id FROM records_lids WHERE lid = 'l1') l1
NATURAL JOIN (SELECT record_id FROM records_lids WHERE lid = 'l2') l2
NATURAL JOIN (SELECT record_id FROM records_lids WHERE lid = 'l3') l3
The NATURAL JOIN operations handle the intersection operation; the result only includes rows with matching record_id values. (Some other makes of SQL table server have the INTERSECT operator, but not MySQL, yet...)
You can also do it this way (https://www.db-fiddle.com/f/cLf4b6LDwMH9eFRTTheZJr/1).
SELECT record_id
FROM records_lids
WHERE lid IN ('l1','l2','l3')
GROUP BY record_id
HAVING COUNT(*) = 3
The HAVING clause is how you insist you want records with all three lids.
Once you have the set of record_ids, you can join that to your other table. (https://www.db-fiddle.com/f/cLf4b6LDwMH9eFRTTheZJr/2)
SELECT records.*
FROM (SELECT record_id FROM records_lids WHERE lid = 'l1') l1
NATURAL JOIN (SELECT record_id FROM records_lids WHERE lid = 'l2') l2
NATURAL JOIN (SELECT record_id FROM records_lids WHERE lid = 'l3') l3
NATURAL JOIN records
or (https://www.db-fiddle.com/f/cLf4b6LDwMH9eFRTTheZJr/3)
SELECT *
FROM records
WHERE record_id IN (
SELECT record_id
FROM records_lids
WHERE lid IN ('l1','l2','l3')
GROUP BY record_id
HAVING COUNT(*) = 3
)
Edit: I did not completely understand your question. You want to exclude records without an *exactly( matching set of lids. Try this (https://www.db-fiddle.com/f/cLf4b6LDwMH9eFRTTheZJr/4). It depends on a quirk of MySQL, which is that Boolean expressions like lid IN ('l1', 'l2') have the value 0 when false and 1 when true.
SELECT *
FROM records
WHERE record_id IN (
SELECT record_id
FROM records_lids
GROUP BY record_id
HAVING SUM(lid IN ('l1', 'l2')) = 2
AND COUNT(*) = 2
)
SQL is, at its heart, a language for manipulating sets. The design technique here is
figure out your entities
work out the relationships between them
work out how to get the sets of entities you require
retrieve the rows you need matching the sets
I have two tables. These are not them but it is the same principle:
Table:One (artists)
--------------
id (Primary Key)
name
best genre
Table:Two (artist teams)
-------------
id1 (Foreign Key)
id2 (Foreign Key)
I want to select the artist teams where their favorite genres are the same.
My work so far is
SELECT *
FROM Two INNER JOIN One
WHERE ( ).
Im confused as to what to put in the WHERE statement.
I have no idea how to compare the values of the artist's genres to each other!
pseudo code for WHERE:
retrieve id#1's favourite genre
retrieve id#2's favourite genre
compare them
if equal display the related entity from table Two
I've searched for a while looking for a solution and I can't find anything
just like this, I believe it could be a bit a syntax that im missing.
Thanks for any help!
You need multiple joins to the "artists" table:
select t.*, a1.genre
from teams t join
artists a1
on t.id1 = a1.id join
artists a2
on t.id2 = a2.id and a2.genre = a1.genre;
I have a database as:
Student (ID, Name, Grade)
Likes (ID1, ID2)
Where ID1 and ID2 in last table are foreign key referenced student(ID)
Note: Liking isn't a mutual relation, e.g its not necessary that if (123, 456) is in Likes table, then (456,123) is also in Likes table.
I have to write query for the following statement:
"For every pair of students, who both like each other, return the name and grade of both students. Include each pair only once, with the two names in alphabetical order."
So far I have given the data in which ID1 and ID2 mutually like each other:
SELECT s1.ID, s1.name, s2.ID, s2.name
FROM student s1, student s2, likes l
WHERE s1.ID = l.ID1 AND s2.ID = l.ID2
AND l.ID1 IN (SELECT ID2 FROM likes)
AND l.ID2 IN (SELECT ID1 FROM likes);
Someone kindly help me how to avoid duplicate pairs.
Database is: (If someone needs it)
INSERT INTO `student` VALUES (1025,'John',12),(1101,'Haley',10),(1247,'Alexis',11),(1304,'Jordan',12),(1316,'Austin',11),(1381,'Tiffany',9),(1468,'Kris',10),(1501,'Jessica',11),(1510,'Jordan',9),(1641,'Brittany',10),(1661,'Logan',12),(1689,'Gabriel',9),(1709,'Cassandra',9),(1782,'Andrew',10),(1911,'Gabriel',11),(1934,'Kyle',12);
INSERT INTO `likes` VALUES (1689,1709),(1709,1689),(1782,1709),(1911,1247),(1247,1468),(1641,1468),(1316,1304),(1501,1934),(1934,1501),(1025,1101);
and according to data entered:
DATA I GET
1689 Gabriel 1709 Cassandra
1709 Cassandra 1689 Gabriel
1501 Jessica 1934 Kyle
1934 Kyle 1501 Jessica
IDEAL DATA
1689 Gabriel 1709 Cassandra
1501 Jessica 1934 Kyle
Since the question is "how to avoid duplicate pairs.":
You join 2 tables to get the ones where they both like each other, you will get 2 rows for each pair.
You can discard one by comparing against some distinct value. ID is a great candidate:
select * -- put fields here
from likes li
join likes li2 on li2.ID1 = li.ID2 and li2.ID2 = li.ID1
-- join 2 students here
where li.ID1 < li.ID2
Try below query:
SELECT L1.*
FROM `likes` l1
LEFT JOIN `likes` l2 ON l1.id1 = l2.id2 AND l1.id2 = l2.id1
WHERE l2.id2 IS NOT NULL
GROUP BY l1.id1 - l1.id2
HAVING l1.id1 - l1.id2 < 0
I have a table called friends which has id and name and a self join table called friendship which stores the relationship which includes friend_id and friend2_id .
how do i get the names of related friends if a name of a particular frnd is given
example
id name
1 jack
2 kurt
3 jim
and
friendship
f_id f1_id
1 3
So if i give 'jack' i should get jim back
You could do this in one query or two queries, depending on what you want to accomplish.
A simple one could be:
SELECT
f_id,
f1_id
FROM
friendship
WHERE
f_id=1
OR
f1_id=1
And then you can get the specific friends with a statement like:
SELECT name FROM people WHERE id IN(2,3)
Alternative is a self join but the hard part here is that your id might be in both f_id and f1_id so that would need some UNION command or something like (untested):
SELECT
p1.name,
p2.name,
FROM
friendship
INNER JOIN
people AS p1
ON friendship.f_id = people.id
INNER JOIN
people AS p2
ON friendship.f1_id = people.id
WHERE
p1.id=1 OR p2.id=1
I would thoroughly check the speed of these options since they are quite heavy on huge amounts of records. If you measure you need more performance try some alternative. For example when you always put the smallest people.id in f_id and the bigger one in f1_id you might run 2 queries which you union. Alternative is to denormalize a small bit to cache the results if you need them frequently.
It would save you lots of joings for example if you would add the names into the friendship table:
SELECT
f_id,
f1_id,
f_name,
f1_name
FROM
friendship
WHERE
f_id=1
OR
f1_id=1