Grouping by part of a postcode - mysql

I am developing a php/mysql system for a legal advice agency. Clients are recorded in a table ‘clients’ which contains (amongst others) the columns clientid and postcode. (Note this is in the UK)
A client then registers for advice and a matter is opened. The matters table contains columns mattered, client and legalid.
Legalid refers to a table ‘legal’ which has columns legalid and legal (legal is the type of legal advice e.g. employment, housing etc)
What I need to be able to do is to count the number of people receiving advice in particular areas grouped by the first part of the UK postcode. I think I can do this except I don’t know how to do the postcode grouping as the first part could be 2, 3 or 4 characters. For example, the postcodes might be E2 6TY or E14 7YU - I want to group by E2 and E14 etc. Also, in some cases a client doesn't want to provide the whole postcode so only the 1st part is entered.
Has anybody any guidance as to how I might be able to do this grouping?
Many thanks

You can do
group by SUBSTRING_INDEX(postal_code, ' ', 1)
There are lots of posts on stackoverflow where people use and give examples of substring_index so just do a search for it.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

Method to Match Multiple Columns Dynamically in one Table to another Table (e.g. Address, City, State, Zip)

I'm using SQL Server 2008 w T-SQL syntax. I know a little bit of C# and Python - so possibly could venture into those paths to make this work.
The Problem:
I have multiple databases where I have to match customers within each Database to a "Master Customer" file.
It's basically mapping those customers at those distributors to the supplier level.
There are 3-8 million customers for each Database (8 of them) that have to be matched to the Supplier Table (1800 customers).
I'd rather not have to do a Excel "Matching game" for about 3-4 weeks (30 million customers). I need some shortcuts as this is exhaustive.
This is one Distributor Table:
select
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
from
Distributor.sales
group by
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
This is a small snippet of what the table yields:
And I'd have to match that to a table that looks like this:
Basically I need a function that will search out the address lines in the Distributor DBs for a matches or closest match(es). I could do a case when address like '%birch%' to find all 'birch' street matches when distributor.zip_code=supplier.zip_Code" but I'd rather not have to write in each of those strings within the like statements.
Is there an XML trick that will pick out parts within the multiple distributor address columns to match that of the supplier address column? (or maybe 1 at a time, if that's easier)
Can I do a like '%[0-9][A-Z]% type of search? I'm open to best practices (will awards pts) as to tackle this beast. I guess I'mt not even sure how to tackle this other than brute force by grouping by zip codes and working street matches from there.
The matching/searching 'like' function (or XML) or whatever would have to try to dynamically match one column say "Birch St" in the Supplier Address column to find all matches of "Birch Ave" "Birch St" "Birch Ave" "Birch" that had that same zip.
Do you have SQL Server Integration Services? If so, you could use the SSIS Fuzzy Lookup to look up customers in the master customer table by address.
If you're not familiar with that feature of SSIS, it will let you specify columns to compare, how many potential matches to return, and a threshold that a comparison has to meet to be reported to you.
Here's an article specifically about matching addresses: Using Fuzzy Lookup Transformations in SQL Server Integration Services
If you don't have it available ... well, then we start getting into implementing your own fuzzy-matching algorithm. Which is quite possible: it's just a bit more laborious.

finding similar strings from mysql

I am trying to implement a customers list in an application that I derive the DB from MySQL,
Currently we would take the first and last name, dob, and phone number of said people.
To search for the customer we may enter a phone number and ideally it will find the person with the correct phone number - but I want to enter a fail-safe - lets say the previous representative entered the phone number slightly incorrect lets say the phone number is 7777777777 and the person accidentally types 7777077777 I would still want it to show up in query of course in order of relevance - so 7777777777 would be the preferred and top result.
Select * from Customers WHERE Pnumb Like....
would be what I would expect however I would not know how to implement that.
Any help is greatly appreciated as I can not find much (any) information on this online.

Designing a quick-lookup address database

If I were to design a database in MySQL to the following specifications:
1) Over 25mil records
2) Columns of house number, street, town, city, postcode
3) street, town, city and postcode need to be fulltext-searchable (on the front-end side, the search will be running on AJAX off a text-input field with immediate drop-down results)
How would I design the above?
I was thinking with working with a single table - is this a bad idea? Am not sure whether to normalize across different tables, given this is address data. I am also thinking that if working with a single table, I would do a FULLTEXT index across the searchable fields.
I have not worked with such a large DB before. Is the above a bad idea?
UPDATE #1:
Decided to normalize the street and postcode columns, which are the only ones actually being searched on (re-checked the original specification). Did some quick math and cardinality of street name is 2% and postcodes 6% of total data set, so I think this is the best way forward.
Currently running the import of 29 million rows - will take about 5 hours. Will update again later on performance tests for the sake of wrapping up this question.
Your design sounds reasonable. But. Are you sure that the addresses in the database all conform to the " , , " format? What about "c/o" addresses ("care/of")? Unit/apartment/floor/suite numbers? What about specific building names ("Barack Obama, White House, Washington, DC")?
In the United States, there are various exceptions to this address layout. For example, there is something called "Rural Routes", whose format is "RR BOX " (described here). There are PO Boxes and military addresses. In fact, I just learned that the US Post Office has a publication describing all sorts of different address formats (here).
A more general form is something like "Address Line 1", "Address Line 2", "City", "Post Code". There are services that standardize addresses for much of the world, and even software available for this purpose.
Your idea of using full text search is a good idea. When looking for a partial match on a street name, for instance, it will be much faster.

De-dupe a list of hundreds of thousands of first name/last name/address/date of birth

I have a large data set which I know contains many dupicate records. Basically I have data on first name, last name, different address components and date of birth.
I think the best way to do this is to use the name and date of birth as chances are if these things match, it's the same person. There are probably lots of instances where there are slight differences in spelling (like typos missing a single letter) or use of name (ie: some might have a middle initial in first name column) which would be good to account for, but I'm not sure how to approach this.
Are there any tools or articles on going about this process? The data is all in a MySQL database and I have a basic proficiency in SQL.
You could get a sense of how much dedupe you have to do by something like:
select birthDate,last_name,soundex(first_name),count(*)
from table
group by birthDate,last_name,soundex(first_name)
having count(*) >1
This will list the people with the same birthdate, last_name, and similar first names. Soundex() isn't great, but this could help you get a sense of amount of deduping.
This query below would allow you get the alphabetical first first_name from the table of similar named people. Hopefully this will give you some rough starting ideas//
select birthDate,last_name,soundex(first_name),min(first_name)
from table
group by birthDate,last_name,soundex(first_name)
having count(*) >1
With the second query, you could remove all occurrences of additional names, by using a DELETE where name not in, but that assumes you are willing to keep the lowest first_name and remove the rest...