Grouping similar field data in MySQL - mysql

In MySQL, I have a table that accepts common data from multiple input channels and consists of ~100,000 rows.
One of the fields, stores the name of an employees functional manager. In the organisation, there are ~100 of these functional managers.
The issue I have is, as there are multiple input channels, different reporting systems have used a different name format for these managers.
For example, John Smith could be stored as;
John Smith
Smith, John
Smith John
This is a bit of nightmare now as we are looking to use this functional manager field as mechanism for reporting, which would mean we would need to sort or group by individual functional managers.
The data becomes legacy after each quarter, so we are happy to clean and format the functional manager field.
The question is, is there a simple way to do group these managers, even though their names are in different formats, I am looking for a way that does not involve me going one by one through each functional manager with a statement like this:
UPDATE tablename SET fm_name = "John Smith" where fm_name like "%John%" and fm_name like "Smith";
For example; programmatically, I could take the first record, break the name into its first and last name strings, then match similar records and update them. Then move to the next record. Is something like that possible in MySQL or would I be better to do that in the layer above.
Any suggestions would be greatly appreciated.

If you can come up with a normalizing function name_normalize(string) that yields George H. W. Bush given either that exact input or Bush, George H. W., then you can do
GROUP BY name_normalize(name)
and get what you want without mucking around with the data in your table.
This is such a function. It hacks around with MySQL's string functions. https://dev.mysql.com/doc/refman/5.7/en/string-functions.html
IF(LOCATE(',',#name1) = 0, --need to change?
#name1, -- no, return original
LEFT(CONCAT_WS(' ', -- yes, concatenate...
TRIM(SUBSTRING_INDEX(#name1, ',',-1)), -- after last ,
#name1), -- whole name
LENGTH( -- cut to original name length
REPLACE(#name1,',','')))) -- but without the comma
Substitute the name of your column for #name. And beware, this is sensitive to the number of spaces after the comma.
You'd be wise to define this function as a stored function. For one thing, you can handle the odd cases better. For another, it's kind of long to write in a query.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

MS-Access: Text column transformations to clean data with no unique ID

I have about ten data sources I'm trying to aggregate in an Access DB to feed a set of Tableau dashboards. The files all contain employee data, the problem is, Employee_Name is inconsistent across the files, and only one file has a unique ID, so I can't perform any of the joins I need to.
The best solution is of course to get the source data with a common Employee_ID across all files, but I don't know if/when I can get that.
Currently, the name formats are as follows
FISHER, BOBBY M
FISHER BOBBY
FISHER, BOBBY M L
Fisher, Bobby M
Fisher Bobby M
Bobby M Fisher
Bobby Fisher
Bobby Fisher (note: two spaces)
Fisher Bobby M Jr.
And just to make it really fun:
Fisher, Bob Jr.
So all these names are equivalent, and all join under the same Employee_ID if that existed.
I know I can write an expression like StrConv(Replace(Replace([Employee Name],",",""),".",""),3) to handle some of the inconsistencies, but even if I do that for every table, it still won't catch Bob and Bobby, and I still have to string split and concatenate to end up with a "somewhat" robust, consistent Employee_Name to join on.
I could create a lookup table for each table to assign a unique ID, but that's a terrible solution as soon as you start adding more people to the original data.
Does anyone have any other ideas on how to approach this, or do I really just need to insist that I get the unique IDs, and that otherwise a sustainable solution isn't really possible.
Here is where I would start: Remove all dots commas and extra spaces, then split the name into substrings and keep the two longest ones. That is your First and Last name, hopefully. Compare each checking for the reversals (StringA1=StringA2 AND String B1=StringB2) OR (StringA1=StringB2 AND String B1=StringA2). That should get you all the matches for full first and last name.
Depending on the size of your data, there may be a small enough remainder of unresolved matches to be done manually. If not you have to start checking things like StringA1 LIKE "*" & StringA2 & "*" instead of equality.
As long as the goal is to do as much as possible with code and fix the rest manually, that should get you almost there. If you want a fully automated, repeatable process you are probably better off waiting for the full original data.

Is it possible to add an array value to a row in a MySql database

I am trying to do a CRUD operation with a form and I was wondering if like for example, a student is asked to register for say 4 subjects for an exam and they have to choose from a list, is it possible for those subject to appear in a single row on a column or do I have to create a seperate table for that?
It's generally good practice to add a separate table for that so that you can then use the information to, for example, find out which students need to take the economics exam. If you put comma separated values into a column, it's harder to get that information back.
If you're using newer versions of MySQL (5.7 or later), you could also check out the JSON column type which caters for storing more than just a single value in one column - but I'd still recommend using a separate table in most cases for good data design. Hope that helps!
It is possible. You could for instance write a string 'Math,Literature,Arts'.
However, you should only do this when you are never interested in the separate parts. If you never need a query to ask how many students registered for 'Math' or whether student 123 registered for 'Arts' etc., then no problem.
This would be a very rare case, though. Usually you are interested in the separate subjects, so store them separately, i.e. have a Student table, a subject table, and a student_subject table.

The most efficient sql schema for searching names and lastnames

I'm creating a list of members on my site, and I want to enable them to look for eachother by first name and last name or either one. The catch is that a user can have several names, like names and then nicknames, also a person can have more than one lastnames, their maiden name and then the lastname after marriage.
Once users fillout their names and last names, each user could have several names and last names, for example There could be a person with 3 names and 2 lastnames - names: Eleonora, Ela, El and lastnames: Smith, Brown.
Then if someone looks for Ela Brown, Eleonora Brown, Eleonora Smith or any other combination, they should find this person.
My question, is how should I set this all up in sql (mysql) so tha schema and search is efficient and fast? Didn't want to reinvent a wheel so I turned to pros and asking a question here.
Thanks guys
P.S. I guess the standard solution would be to have a user table, fname table, lname table, userfname table with userid and fnameid and userlname table with userid and lnameid, but I'm not sure if this is the best way to do this and wether or not search would be fast...
Do you need to differentiate between first names and last names?
I would suggest a Users Table having UserID
and also some UsersNames Table having UserID and Name, a one-to-many relationship.
If you need, you could also add a IsLastName bit to the UsersNames table (or just a LastName column, but the bit is better imho)....
But this way you search one table and can easily locate user ID's, plus you don't limit the number of names each user can have.
EDIT:
You could easily take your input string and split it out too. So if somebody put in "John Smith" you could search for both or either name simply by splitting the string and using it in the WHERE clause using either OR or AND depending on your intended functionality.
The last time I did somethig like this I processed each name into a single column in a NAMES table. All names, first/last/middle. A second table hold a link to the person record in the PERSONS table.
So each NAME field get linked to one or more PERSONS record. If I search for "Scott" I would find the name Scott in the NAMES table, find the links in the NAMES_TO_PERSONS(/PEOPLE?) table and then return all the records for that name. ie: Scott Bruns, John Scott, David Scott Smith.
It worked very well with only a small amount of pre processing.
Text searching is what you need - use Lucene. I've used Lucene on several projects and it's truly amazing - not hard to use and ridiculously fast.
If in your data model the users may have multiple but bounded number of name types then the simplest solution would be to create indecies for each column that stores the name type. You would add a field for first name, last name, nickname, maiden name, etc. This model would be more performant than having a one-many names association.
You may also evaluate if there are general search requirements for the rest of the application or if you would like the search to be more flexible. In this case you can look into using a backend indexing process, such as with Lucene or using full text search. Initially, I would try to avoid this if possible, because it certainly complicates the project.

Searching a database of names

I have a MYSQL database containing the names of a large collection of people. Each person in the database could could have one or all of the following name types: first, last, middle, maiden or nick. I want to provide a way for people to search this database to see if a person exists in the database.
Are there any off the shelf products that would be suited to searching a database of peoples names?
With a bit of ingenuity, MySQL will do just what you need... The following gives a few ideas how this could be accomplished.
Your table: (I call it tblPersons)
PersonID (primary key of sorts)
First
Last
Middle
Maiden
Nick
Other columns for extra info (address, whatever...)
By keeping the table as-is, and building an index on each of the name-related columns, the following query provides an inefficient but plausible way of finding all persons whose name matches somehow a particular name. (Jack in the example)
SELECT * from tblPersons
WHERE First = 'Jack' OR Last = 'Jack' OR Middle = 'Jack'
OR Maiden = 'Jack' OR Nick = 'Jack'
Note that the application is not bound to only searching for one name value to be sought in all the various name types. The User can also input a specific set of criteria for example to search for the First Name 'John' and Last Name 'Lennon' and the Profession 'Artist' (if such info is stored in the db) etc.
Also, note that even with this single table approach, one of the features of your application could be to let the user tell the search logic whether this is a "given" name (like Paul, Samantha or Fatima) or a "surname" (like Black, McQueen or Dupont). The main purpose of this is that there are names that can be either (for example Lewis or Hillary), and by being, optionally, a bit more specific in their query, the end users can get SQL to automatically weed-out many irrelevant records. We'll get back to this kind of feature, in the context of alternative, more efficient database layout.
Introducing a "Names" table.
Instead (or in addition...) of storing the various names in the tblPersons table, we can introduce an extra table. and relate it to tblPersons.
tblNames
PersonID (used to relate with tblPersons)
NameType (single letter code, say F, L, M, U, N for First, Last...)
Name
We'd then have ONE record in tblPersons for each individual, but as many records in tblNames as they have names (but when they don't have a particular name, few people for example have a Nickname, there is no need for a corresponding record in tblNames).
Then the query would become
SELECT [DISTINCT] * from tblPersons P
JOIN tblNames N ON N.PersonID = P.PersonID
WHERE N.Name = 'Jack'
Such a layout/structure would be more efficient. Furthermore this query would lend itself to offer the "given" vs. "surname" capability easily, just by adding to the WHERE clause
AND N.NameType IN ('F', 'M', 'N') -- for the "given" names
(or)
AND N.NameType IN ('L', 'U', 'N') -- for the "surname" types. Note that
-- we put Nick name in there, but could just as eaily remove it.
Another interest of this approach is that it would allow storing other kinds of names in there, for example the SOUNDEX form of every name could be added, under their own NameType(s), allowing to easily find names even if the spelling is approximate.
Finaly another improvement could be to introduce a separate lookup table containing the most common abbreviations of given names (Pete for Peter, Jack for John, Bill for William etc), and to use this for search purposes (The name columns used for providing the display values would remain as provided in the source data, but the extra lookup/normalization at the level of the search would increase recall).
You shouldn't need to buy a product to search a database, databases are built to handle queries.
Have you tried running your own queries on it? For example: (I'm imagining what the schema looks like)
SELECT * FROM names WHERE first_name='Matt' AND last_name='Way';
If you've tried running some queries, what problems did you encounter that makes you want to try a different solution?
What does the schema look like?
How many rows are there?
Have you tried indexing the data in any way?
Please provide some more information to help answer your question.