multiple like queries access - ms-access

I am using Access.
Szenario
At work I've got a table with around 300k rows which defines Person IDs to House IDs with the associated information (First Name, Last Name, City, "Street + StreetNumber", Postal Code). Each person can live in n houses, and each house can inhabit n of those people.
When I am visited by various persons I get a table. This table is filled in by a human being, so there are no IDs in it and unfortunately often have spelling errors and information missing. It should contain "First Name", "Last Name", "Street & Nr", "City", "Postal Code".
To integrate the data I need the IDs from the persons. To counter the spelling error problem I want to build a table which gives me results ordered by "matching priority".
The handfilled Table is called tbl_to_fill and got empty Person-ID row, an indexed autonumber and First Name, Last Name, Street & Nr, City and Postal Code. The table with the relation information is called tbl_all.
So if I find an exact match (with a join query) from tbl_to_fill to tbl_all for "First Name", "Last Name" and "Postal Code" or "First Name", "Last Name", "Street & Nr", "City" it gets "matching priority" 1. If I find an exact match with only "Last Name", "Postal Code" or "Last Name", "City", "Street & Nr" I get a "matching priority" 2. And there are few more levels.
Then comes the tricky part:
Now I built a "tbl_filter" from "tbl_to_fill" with tweaked informations: The street numbers are cut, common spelling errors are replaced with a '*' (for example a common spelling error in german names: ph - f, like Stefan and Stephan), city names are shortened after the last space " " found and some more.
With this table I look for the same criteria as stated above, but with a "LIKE '*' & tbl_filter.Field & '*'" - query. And they get matching priority same as above + 10.
Now those join queries and the Like queries are all aggregated via a union query, let's call this Query 001 quni All rows.
I got this to work exactly like I want to, but it takes AGES, every time I run the last query.
My Question
Has someone done something alike? What can I do to fasten the process?
As many of my matching criteria expect First Name and Last Name to fit and then some more, should I first extract only matching rows from "tbl_all" via a make table and then run the according queries?
Should I use regex instead of like queries on a field which contains all information concatenated by a "-"?
Is there a better way to assign those priorities? Maybe all in one query via an Iff - function?
Select ..., matching_priority = IFF(tbl_all."First Name" = tbl_to_Fill."FirstName",1,
IFF(...)
)
From tbl_all;
I am a self tought access developer, so i often have problems knowing which approach is the most optimized.
I use VBA regularly and dont shy away from it, so if you got a solution via VBA let me know.

I think you might simplify your approach a bit if you were to use fuzzy text search. The common way of doing this is to use the Levenshtein distance, or the number of changes it takes to turn one string into another. A nice implementation of the Levenshtein is here:
Levenshtein Distance in Excel
In this manner, you can find the closest possible match on city, street, first name, last name, etc. You can even set up a "reasonable" limit, such as any record where the Levenshtein > 10 may be "unreasonable." I threw out 10, but it will vary depending on your data.
Some optimization notes:
Based on the fact that you have 300,000 rows, I would go so far as to say you still need to narrow your results a bit. Reading all 300,000 records for each match is unreasonable. For example, if you had state (which I see you don't), then a reasonable limit is to say the state must match. That takes your 300,000 down to a much lower number. You may want to also assume that the first letter of the last name must match. That will narrow it down further. Etc, etc.
If you can, I would use an actual RDBMS instead of Access to store the data and let the DB server do the heavy lifting. PostgreSQL, in particular, has nice fuzzy search capabilities available through extensions

One thing I have done in similar situations is to extract the first few characters of the last name, the first one or two characters of the first name, and the postal code, and write it from both tables to temp tables, and do the matching query on the truncated tables. After some tinkering with how many characters to extract, I can usually find a balance between speed and false positive matches that I can live with, then have a human review the resulting list. The speed difference can be significant - if instead of matching Schermerstien, Stephan to Schermerstien, Ste*an, you are matching Scher, St to Scher, St, you can see the processing advantage. But it only works if there is a small intersection between the tables and you can tolerate a human review step.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

How to perform inexact matches on two data sets

i'm trying to compare two data sets (vendor masters) from two systems. we are moving to one system, so we want to avoid duplication. the issue is that the names, addresses, etc could be slightly different. for example, the name might end in 'Inc' or 'Inc.' or the address could be 'St' or 'Street'. the vendor masters have been dumped to excel, so i was thinking about pulling them into access to compare them, but i'm not sure how to handle the inexact matches. the data fields i need to compare are: name, address, telephone number, feder tax id (if populated), contact name
Here is how I would proceed. You will rarely get answers like this on Stack Exchange, since your question if not focused enough. This is a rather generic set of steps not specific to a particular tool (i.e. database or spreadsheet). As I said in my comments, you'll need to search for specific answers (or ask new ones) about the particular tools you use as you go. Without knowing all the details, Access can certainly be useful in doing some preliminary matching, but you could also utilize Excel directly or even Oracle SQL since you have it as a resource.
Back up your data.
Make a copy of your data for matching purposes.
Ensure that each record for both sets of data have a unique key (i.e. AutoNumber field or similar), so that until you have a confirmed match the records can always be separately identified.
Create new matched-key table and/or fields containing the list of matched unique key values.
Create new "matching" fields and copy your key fields into these new fields.
Scrub the data in all possible matching fields by
Removing periods and other punctuation
Choosing standard abbreviations and replacing all variations by the same value in all records. Example: replace "Incorporation" and "Inc." with "Inc"
Trim excess spaces from the end and between terms
Formatted all phone numbers exactly the same way, or better yet remove all space and punctuation for comparison purposes, excluding extension information: ##########
Parse and split multi-term fields into separate fields. Name -> First, Middle, Last Name fields; Address -> Street number, street name, extra address info.
The parsing process itself can identify and reconcile formatting differences.
Allows easier matching on terms separately.
Etc., etc.
Once the matching fields are sufficiently scrubbed, now match on the different fields.
Define matching priorities, that is which field or fields are likely to produce reliable matches with the least amount of uncertainty.
For records containing Tax ID numbers, that seems like the most logical place to start since an exact match on that number should be valid OR can indicate mistakes in your data.
For each type of match, update the matched-key fields mentioned above
For each successive matching query, exclude records that already have a match in the matched-key table/fields.
Refine and repeat all these steps until you are satisfied that all matches have been found.
Add all non-matched records to your final merged record set.
You never said how many records you have. If possible, it may be worth your organization's time to manually verify the automated matches by listing them side by side and manually tweaking them when needed.
But even if you successfully pair non-exact matches, someone still needs to make the decision of which record to keep for the merged system. I imagine you might have matches on company name and tax id--essentially verifying the match--but still have different addresses and/or contact name. There is no technical answer that will help you know which data to keep or discard. Once again, human review should be done to finalize the merged records. If you set this up correctly, a couple human eyeballs could probably go through thousands of record in just a day.

Method to Match Multiple Columns Dynamically in one Table to another Table (e.g. Address, City, State, Zip)

I'm using SQL Server 2008 w T-SQL syntax. I know a little bit of C# and Python - so possibly could venture into those paths to make this work.
The Problem:
I have multiple databases where I have to match customers within each Database to a "Master Customer" file.
It's basically mapping those customers at those distributors to the supplier level.
There are 3-8 million customers for each Database (8 of them) that have to be matched to the Supplier Table (1800 customers).
I'd rather not have to do a Excel "Matching game" for about 3-4 weeks (30 million customers). I need some shortcuts as this is exhaustive.
This is one Distributor Table:
select
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
from
Distributor.sales
group by
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
This is a small snippet of what the table yields:
And I'd have to match that to a table that looks like this:
Basically I need a function that will search out the address lines in the Distributor DBs for a matches or closest match(es). I could do a case when address like '%birch%' to find all 'birch' street matches when distributor.zip_code=supplier.zip_Code" but I'd rather not have to write in each of those strings within the like statements.
Is there an XML trick that will pick out parts within the multiple distributor address columns to match that of the supplier address column? (or maybe 1 at a time, if that's easier)
Can I do a like '%[0-9][A-Z]% type of search? I'm open to best practices (will awards pts) as to tackle this beast. I guess I'mt not even sure how to tackle this other than brute force by grouping by zip codes and working street matches from there.
The matching/searching 'like' function (or XML) or whatever would have to try to dynamically match one column say "Birch St" in the Supplier Address column to find all matches of "Birch Ave" "Birch St" "Birch Ave" "Birch" that had that same zip.
Do you have SQL Server Integration Services? If so, you could use the SSIS Fuzzy Lookup to look up customers in the master customer table by address.
If you're not familiar with that feature of SSIS, it will let you specify columns to compare, how many potential matches to return, and a threshold that a comparison has to meet to be reported to you.
Here's an article specifically about matching addresses: Using Fuzzy Lookup Transformations in SQL Server Integration Services
If you don't have it available ... well, then we start getting into implementing your own fuzzy-matching algorithm. Which is quite possible: it's just a bit more laborious.

Intelligent Comparison based Update - Access / VBA

Need to intelligently perform updates on an access table.
Expert VBA / Intelligent Thinking would be required.
Table1 (For reference only)
CompanyCode Text
RegionCategory Number (1-99)
RegionCount Number(0 - 25000)
Table2
InvoiceNumber Number
CompanyCode Text
NumRows Number
RegionCode FourdigitNumber
ConfirmationRemark Y / N
Ourobjective is to put a Yes or No in the 'ConfirmationRemark' Column.
Rules :
1.Select only those InvoiceNumbers which have exactly two rows from Table2 and different RegionCode. These will have the same CompanyCode. RegionCategory is first two digits of RegionCode.
2.For these two Invoices - The difference between the two RegionCategory must be greater than two.
3.LookUp The RegionCount , from Table1
Decision Making :
We are now basically comparing two Invoices with different RegionCodes.
Idea is that , the Invoice with higher RegionCount is the one to be marked Yes.
1.The difference between RegionCount must be considerable. 'considerable' - I am trying to determine what would be the right number. Let us take 500 for now.
2.The Invoice with lower Region Count - should have RegionCount - Zero (bestCase) or very very low. If The Invoice with lower Region Count has a high RegionCount value > 200 , then we cannot successfully conclude.
3.NumRows , is prefered to be 1 or lesser than the other. This comparison , is not mandatory , hence we shall have a provision to not check for this. Mark the Other Invoice as 'N'
You have many ways to approach that type of complex update.
If you are lucky, you may be able to craft a SQL UPDATE statement that can include all the changes, but often you will have to resort to a combination of SELECT queries and custom VBA to filter them based on the results of calculations or lookups involving other data.
A few hints
Often, we tend to think about a problem in terms of 'what are the steps to get to the data that match the criteria'.
Sometimes, though, it's easier to turn the problem on its head and instead ask yourself 'what are the steps to get to the data that do not match the criteria'.
Because in your case the result is boolean, true or false, you could simply set the ConfirmationRemark field to True for all records and then update those that should be set to False, instead of the other way around.
Break down each step (as you did) and try to find the simplest SELECT query that will return just the data you need for that step. If a step is too complex, break it down further.
Compose your broken down SELECT statements together to slowly build a more complex query that tends toward your goal.
Once you have gone as far as you can, either construct an UPDATE Table2 SET ConfirmationRemark=True WHERE InvoiceNumber IN (SELECT InvoiceNumber ....) or use VBA to go through the recordset of results from your complext SELECT statement and do some more checks before you update the field in code.
Some issues
Unfortunately, despite your efforts to document your situation, there are not enough details for us to really help:
you do not mention which are the primary keys (from what you say, it seems that Table2 could have multiple records with identical InvoiceNumber)
the type of data you are dealing with is not obvious. You should include a sample of data and identify which ones should end-up with ConfirmationRemark set.
your problem is really too localised, meaning it is too specific to your to be of value to anyone else, although I think that with a bit more details your question could be of interest, if only to show an example of how to approach complex data updates in Access.

Searching a database of names

I have a MYSQL database containing the names of a large collection of people. Each person in the database could could have one or all of the following name types: first, last, middle, maiden or nick. I want to provide a way for people to search this database to see if a person exists in the database.
Are there any off the shelf products that would be suited to searching a database of peoples names?
With a bit of ingenuity, MySQL will do just what you need... The following gives a few ideas how this could be accomplished.
Your table: (I call it tblPersons)
PersonID (primary key of sorts)
First
Last
Middle
Maiden
Nick
Other columns for extra info (address, whatever...)
By keeping the table as-is, and building an index on each of the name-related columns, the following query provides an inefficient but plausible way of finding all persons whose name matches somehow a particular name. (Jack in the example)
SELECT * from tblPersons
WHERE First = 'Jack' OR Last = 'Jack' OR Middle = 'Jack'
OR Maiden = 'Jack' OR Nick = 'Jack'
Note that the application is not bound to only searching for one name value to be sought in all the various name types. The User can also input a specific set of criteria for example to search for the First Name 'John' and Last Name 'Lennon' and the Profession 'Artist' (if such info is stored in the db) etc.
Also, note that even with this single table approach, one of the features of your application could be to let the user tell the search logic whether this is a "given" name (like Paul, Samantha or Fatima) or a "surname" (like Black, McQueen or Dupont). The main purpose of this is that there are names that can be either (for example Lewis or Hillary), and by being, optionally, a bit more specific in their query, the end users can get SQL to automatically weed-out many irrelevant records. We'll get back to this kind of feature, in the context of alternative, more efficient database layout.
Introducing a "Names" table.
Instead (or in addition...) of storing the various names in the tblPersons table, we can introduce an extra table. and relate it to tblPersons.
tblNames
PersonID (used to relate with tblPersons)
NameType (single letter code, say F, L, M, U, N for First, Last...)
Name
We'd then have ONE record in tblPersons for each individual, but as many records in tblNames as they have names (but when they don't have a particular name, few people for example have a Nickname, there is no need for a corresponding record in tblNames).
Then the query would become
SELECT [DISTINCT] * from tblPersons P
JOIN tblNames N ON N.PersonID = P.PersonID
WHERE N.Name = 'Jack'
Such a layout/structure would be more efficient. Furthermore this query would lend itself to offer the "given" vs. "surname" capability easily, just by adding to the WHERE clause
AND N.NameType IN ('F', 'M', 'N') -- for the "given" names
(or)
AND N.NameType IN ('L', 'U', 'N') -- for the "surname" types. Note that
-- we put Nick name in there, but could just as eaily remove it.
Another interest of this approach is that it would allow storing other kinds of names in there, for example the SOUNDEX form of every name could be added, under their own NameType(s), allowing to easily find names even if the spelling is approximate.
Finaly another improvement could be to introduce a separate lookup table containing the most common abbreviations of given names (Pete for Peter, Jack for John, Bill for William etc), and to use this for search purposes (The name columns used for providing the display values would remain as provided in the source data, but the extra lookup/normalization at the level of the search would increase recall).
You shouldn't need to buy a product to search a database, databases are built to handle queries.
Have you tried running your own queries on it? For example: (I'm imagining what the schema looks like)
SELECT * FROM names WHERE first_name='Matt' AND last_name='Way';
If you've tried running some queries, what problems did you encounter that makes you want to try a different solution?
What does the schema look like?
How many rows are there?
Have you tried indexing the data in any way?
Please provide some more information to help answer your question.