Complex SQL substr comparison for dups

Complex SQL substr comparison for dups - mysql

slightly complex problem here I'd like to solve in SQL:
I have duplicate person records like these:
Many examples like this where the name was misspelled, so my inbound ETL code didn't detect them as duplicates.
I have a dedupping workflow, that culls suspected duplicates with same first/last names and let's the user collapse them. The query for this page is below:
SELECT * FROM
((address
INNER JOIN ((person AS a
INNER JOIN (SELECT
idperson, last_name, first_name, middle, suffix
FROM
person
GROUP BY last_name , first_name
HAVING Count(*) > 1) AS b ON (a.first_name = b.first_name)
AND (a.last_name = b.last_name))
INNER JOIN constituent ON a.constituent_idconstituent = constituent.idconstituent
INNER JOIN constituent_address ON a.constituent_idconstituent = constituent_address.constituent_idconstituent) ON address.idaddress = constituent_address.address_idaddress)
INNER JOIN city ON address.city_idcity = city.idcity)
INNER JOIN
state ON city.state_idstates = state.idstates
WHERE
a.last_name = 'Cascarano'
ORDER BY a.last_name , a.first_name , a.middle , address.line_1 ASC
However, my example above where the first names are spelled differently, isn't caught by this query.
Is there a substring or some other SQL trick I can apply here, to somehow maybe chop up the first_name field and look for, maybe 75% letter match? I know I'm reaching...
Thanks!!!!!

Related

Why use letters in front of each value in MySQL query?

Why would I use letters in front of each value in my query like this?
In the database, each of these values is WITHOUT the letter in front.
SELECT c.client_id, c.client_name, c.contactperson, c.internal_comment,
IF NULL(r.region, 'Alle byer') as region, c.phone, c.email,
uu.fullname as changed_by,
(select count(p.project_id)
from projects p
where p.client_id = c.client_id and (p.is_deleted != 1 or p.is_deleted is null)
) as numProjects
FROM clients c LEFT JOIN users uu ON c.db_changed_by = uu.id
LEFT JOIN regions r ON c.region_id = r.region_id
WHERE (c.is_deleted != 1 or c.is_deleted is null)
I have tried looking it up, but I can't find it anywhere.

When in SQL you need to use more than one table for a query, you can do this:
SELECT person.name, vehicle.id FROM person, vehicle;
OR you can do it smaller, and put like this
SELECT p.name, v.id FROM person p, vehicle v;
It's only for reducing the query length, and it's useful for you

By "letters in front", I assume you mean the qualifiers on the columns c., uu. and so on. They indicate the table where the column comes from. In a sense, they are part of the definition of the column.
This is your query:
SELECT c.client_id, c.client_name, c.contactperson, c.internal_comment,
IF NULL(r.region, 'Alle byer') as region, c.phone, c.email,
uu.fullname as changed_by,
(select count(p.project_id)
from projects p
where p.client_id = c.client_id and (p.is_deleted != 1 or p.is_deleted is null)
) as numProjects
FROM clients c LEFT JOIN
users uu
ON c.db_changed_by = uu.id LEFT JOIN
regions r
ON c.region_id = r.region_id
WHERE (c.is_deleted != 1 or c.is_deleted is null)
In some cases, these are needed. Consider the on clause:
ON c.region_id = r.region_id
If you leave them out, you have:
ON region_id = region_id
The SQL compiler cannot interpret this, because it does not know where region_id comes from. Is it from clients or regions? If you used this in the select, you would have the same issue -- and it makes a difference because of the left join. This is also true in the correlated subquery.
In general, it is good practice to qualify column names for several reasons:
The query is unambiguous.
You (and others) readily know where columns are coming from.
If you modify the query and add a new table/subquery, you don't have to worry about naming conflicts.
If the underlying tables are modified to have new column names that are shared with other tables, then the query will still compile.

Consider you are accessing 2 tables and both have same column name say 'Id', In query you can easily identify those columns using letters like a.Id == d.Id if first table has alias name 'a' and second table 'b'. Or else It would be very difficult to identify which column belongs which table especially when you have common table columns.

Select from table where names have same initial letters

I have been looking for a solution for this in SQL. I am trying to find records from one table that has the same first two characters and same birth date. I thought about doing self-join but I doubt I am getting the right results. Here is my query, please tell me what's missing:
select p1.frst_name,
from person p1 inner join person p2
on upper(left(p1.frst_name,2)) like upper(left(p2.frst_name,2))
and upper(p1.last_name) LIKE upper(p2.last_name)
and p1.birth_date = p2.birth_date

Join on the last_name and birth_date since you want those to match exactly, then filter by the two first two characters matching.
You shouldn't need upper() on p1.frst_name or p2.frst_name. Because they are the same column in the same table, their cases will match.
Try...
select p1.frst_name,
from person p1
full outer join person p2
on p1.last_name = p2.last_name
and p1.birth_date = p2.birth_date
where upper(left(p1.frst_name,2)) like upper(left(p2.frst_name,2))

Change LIKE to = (you want an exact match), and add a join condition to prevent rows from joining to themselves:
select p1.id, p1.frst_name, p1.last_name, p1.birth_date
from person p1
join person p2
on upper(left(p1.frst_name,2)) = upper(left(p2.frst_name,2))
and upper(p1.last_name) = upper(p2.last_name)
and p1.birth_date = p2.birth_date
and p1.id != p2.id
Without the addition of and p1.id != p2.id, every row would be returned, because of course every row would otherwise match itself.
The question was tagged with both mysql and oracle. The above query works in mysql. For iracle, which doesn't support left(col, 2), use substr(col, 1, 2) instead.

return results of same ID from 2 tables

I'm using an opensource database, so it's setup is a bit over my head.
Its basically like this.
A persons normal information is in the table 'person_per'
There is custom information in the table 'person_custom'
both use 'per_ID' to organize.
select per_ID from person_custom where c3 like '2';
gives my the IDs of people who fit my search, I want to "join" (I think) their name, phone, ect from the 'person_per' table using the ID as the "key"(terms I read that seem to fit).
How can I do that in a single query?

select per.*
from person_per per
inner join person_custom cus on cus.per_id = per.per_id
where cus.c3 = 2

You can retrieve all the columns from both tables with a single query:
SELECT p.name
, p.phone
, p.ect
, c.custom_col
FROM person_per p
JOIN person_custom c
ON c.per_ID = p.per_ID
WHERE c.c3 LIKE '2'
Use a JOIN operator between the table names, and include the "matching" criteria (predicate) in the ON clause.

MySQL SQL statement with INNER JOIN

I have the following SQL statement
SELECT be.*, it.order_number
FROM songs AS be
INNER JOIN
(
SELECT song_id, order_number
FROM items
WHERE order_status = 1
) it
ON be.id = it.song_id
INNER JOIN orders AS or
ON it.order_number = or.order_number
WHERE be.active = 0
I can't seem to understand why this statement does not produce any results. When I remove the following line;
INNER JOIN orders AS or
ON it.order_number = or.order_number
It seems to produce results, and I know that the order_number does exist in the orders table - so it's clearly incorrect syntax but i'm not sure where to go from here? Appreciate the help.

The problem in this particular instance is that the or in the query is a reserved word. You can use that if you wish, but you'll have to quote it, like so
SELECT be.*, it.order_number
FROM songs AS be
INNER JOIN
(
SELECT song_id, order_number
FROM items
WHERE order_status = 1
) it
ON be.id = it.song_id
INNER JOIN orders AS "or"
ON it.order_number = "or".order_number
WHERE be.active = 0
Generally though, for readability, I'd avoid such names. If you have to quote it or escape it, it's probably a bad name.

SQL Server 2008 GROUP BY

I am stuck and I am trying to find a solution, so please help!
I have 3 tables:
Books (id, title, ...)
Authors (id, name, surname, ...)
book_authors (bookID, authorID, whatisdoing)
Now I want to retrieve all books depending on the title (user search) and all the other info (author name, surname). But, I want the book titles to be unique with only the first occurrence of the book_authors.whatisdoing to be shown.
In MS Access I achieved that with first function, but now first does not work and with min I didn't get the results I want.
Any help would be appreciate.
The query in Access was:
SELECT
First(book_authos.whatisdoing) AS FirstOfidiothta_ID,
First(authors.name) AS onoma,
First(authors.surname) AS eponymo,
books.ID, books.title, books.photoLink
FROM (books
INNER JOIN book_authors ON books.ID = book_authors.book_ID)
INNER JOIN authors ON book_authors.author_ID = authors.ID
GROUP BY
books.ID, books.titlos, books.photoLink, books.active
HAVING
(((books.titlos) Like '%" & textString & "%') AND
((books.active)=True) AND ((First(authors.active))=True))
ORDER BY
First(book_authos.whatisdoing), books.title

EDIT: Changed based on OP comments.
EDIT 2: Revised to correct flaws.
You might give this a try.
SELECT aba.name, aba.surname, aba.whatisdoing,
b.ID, b.title, b.photoLink
FROM Books b
JOIN
(
SELECT ba.bookID, ba.whatisdoing, a.name, a.surname,
ROW_NUMBER() OVER (PARTITION BY ba.bookID ORDER BY ba.whatisdoing) AS sequence
FROM book_authors ba
JOIN Authors a ON ba.authorID = a.id
WHERE a.active = 1
) AS aba ON b.id = aba.bookID
WHERE b.title LIKE '%' + #textString + '%'
AND b.active = 1
AND aba.sequence = 1
The ROW_NUMBER function assigns a row number to the row based on the groups defined in the PARTITION BY clause and the order defined by the ORDER BY clause.
#textString is a SQL Server variable. I'm not sure how you would assign this in your situation.
The boolean data type is BIT in SQL Server and the values are 0 for false and 1 from true.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008