Any way to compare/match sentences with only a different word order? - mysql

I have 2 MySQL tables , each with address data of companies in it. One table is more recent, but has no telephone and no website data. Now I want to unite these tables into 1 recent and complete table.
But for some companies the order of the words is different,like this:
'Bakery Johnson' in table 1 and 'Johnson Bakery' in table 2.
Now I need to find a way to compare these values, as they're obviously the same company.
I think I will somehow have to split those names first, and then order the different parts alphabetically.
Any chance anybody has done something like this before, and willing to share some code or function?
UPDATE:
I found a function that sorts words inside a string. I can use this to detect name swaps as described above. It's quite SLOW though...
See : MySQL: how to sort the words in a string using a stored function?

If your table is MyISAM you can run this query:
SELECT *
FROM mytable
WHERE MATCH(name) AGAINST ('+bakery +johnson')
This will find all records containing the words bakery and johnson (and probably some other words too).
Creating a FULLTEXT index on the table:
CREATE FULLTEXT INDEX
fx_mytable_name
ON mytable (name)
will speed up this query.

Going back a bit on your solution, you could go with a similar way as modern phones resolve duplicate names conflicts
You present your user with the option, as he finds something suspicious:
Is this a duplicate? Use our [ Merge ] option
You are merging Bakery Johnson, please select the source/original item:
[ Johnson Bakery v ] (my amazing dropdown!)
Everything not already in Johnson Bakery gets ported to Bakery Johnson (orders for example), you may also show an intermediate screen displaying what will be merged, or let the user pick, for example, he wants the address info from Johnson Bakery and orders from both etc
It is not self correcting as you asked, but the collaboration from the users may be more accurate than AI here. I also love low-tech solutions like this so let us know what you ended up doing.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

Infinite Sub Category Ordering in MySQL

Given this table
Is it possible to write SQL to write a SELECT query that would result in an order such as this, while also being applicable regardless the number of subcategories in the table? I can write queries to fetch the correct order with a hard coded number of expected subcategories but I run into difficulty when that number is unknown.
Miscellaneous
Personal
Bobby
Jane
Susie
Tom
Other
Work Lunches
Or would I have to adjust the schema?
Well the best way is to manage the hierarchical data via the "Nested set" model:
http://en.wikipedia.org/wiki/Nested_set_model
it may seem alien first, but it is just fantastic once you get your head around it. (Yes I am using it, and it works great)
Of course this means you have to change your schema a bit (include the left & right values) and the selects/inserts/updates are different. But you can select or re-attach whole branches in one go very easily.

MySQL table design, one row or more pr user?

Using MySQL I have table of users, a table of matches (Updated with the actual result) and a table called users_picks (at first it's always going to be 10 football matches pr. gameweek pr. league because there's only one league as of now, but more leagues will come along eventually, and some of them only have 8 matches pr. gameweek).
In the users_picks table should i store each 'pick' (by pick I mean both 'hometeam score' and 'awayteam score') in a different row, or have all 10 picks in one single row? Both with a FK for user and gameweek. All picks in one row would mean I had columns with appended numbers like this:
Option 1: [pick_id, user_id, league_id, gameweek_id, match1_hometeam_score, match1_awayteam_score, match2_hometeam_score, match2_awayteam_score ... etc]
and that option doesn't quite fill me with joy, and looks a bit stupid. Especially since there's going to be lots of potential NULLs in the db. The second option would mean eventually millions of rows. But would look like this:
Option 2: [pick_id, user_id, league_id, gameweek_id, match_id, hometeam_score, awayteam_score]
What's the best practice? And would it be a PITA to do all sorts of statistics using the second option? eg. Calculating how many matches a user has hit correctly in a specific round, how many alltime correct hits etc.
If I'm not making much sense, I'll try to elaborate anything. I just wan't my table design to be good from the start, so I won't have a huge headache in a couple of months.
Thanks in advance.
The second choice is much better than the first. This is called database normalisation and makes querying easier, not harder. I would suggest reading the linked article, and the related descriptions of the various "normal forms", and aiming for a 3rd Normal Form data structure as a minimum.
To see the flaw in your first option, imagine if there were to be included later a new league with 11 matches. Or 400.
You should read up about database normalization.
When you have a 1:n relation, like in your case one team having many matches, you would create two tables. One table "teams" and a second table "matches" where each row includes the ID of the team which played the match.
In the same manner you should also have separate tables for users, picks and leagues.
Option two is better, provided you INDEX your table properly, since (as you indicate) it will grow quite large. The pick_id is the primary key, but also create an INDEX on the user_id field, as likely the most common query will be
SELECT * FROM `users_pics` WHERE `user_id`=?;
to get all the picks for a given user.

The most efficient sql schema for searching names and lastnames

I'm creating a list of members on my site, and I want to enable them to look for eachother by first name and last name or either one. The catch is that a user can have several names, like names and then nicknames, also a person can have more than one lastnames, their maiden name and then the lastname after marriage.
Once users fillout their names and last names, each user could have several names and last names, for example There could be a person with 3 names and 2 lastnames - names: Eleonora, Ela, El and lastnames: Smith, Brown.
Then if someone looks for Ela Brown, Eleonora Brown, Eleonora Smith or any other combination, they should find this person.
My question, is how should I set this all up in sql (mysql) so tha schema and search is efficient and fast? Didn't want to reinvent a wheel so I turned to pros and asking a question here.
Thanks guys
P.S. I guess the standard solution would be to have a user table, fname table, lname table, userfname table with userid and fnameid and userlname table with userid and lnameid, but I'm not sure if this is the best way to do this and wether or not search would be fast...
Do you need to differentiate between first names and last names?
I would suggest a Users Table having UserID
and also some UsersNames Table having UserID and Name, a one-to-many relationship.
If you need, you could also add a IsLastName bit to the UsersNames table (or just a LastName column, but the bit is better imho)....
But this way you search one table and can easily locate user ID's, plus you don't limit the number of names each user can have.
EDIT:
You could easily take your input string and split it out too. So if somebody put in "John Smith" you could search for both or either name simply by splitting the string and using it in the WHERE clause using either OR or AND depending on your intended functionality.
The last time I did somethig like this I processed each name into a single column in a NAMES table. All names, first/last/middle. A second table hold a link to the person record in the PERSONS table.
So each NAME field get linked to one or more PERSONS record. If I search for "Scott" I would find the name Scott in the NAMES table, find the links in the NAMES_TO_PERSONS(/PEOPLE?) table and then return all the records for that name. ie: Scott Bruns, John Scott, David Scott Smith.
It worked very well with only a small amount of pre processing.
Text searching is what you need - use Lucene. I've used Lucene on several projects and it's truly amazing - not hard to use and ridiculously fast.
If in your data model the users may have multiple but bounded number of name types then the simplest solution would be to create indecies for each column that stores the name type. You would add a field for first name, last name, nickname, maiden name, etc. This model would be more performant than having a one-many names association.
You may also evaluate if there are general search requirements for the rest of the application or if you would like the search to be more flexible. In this case you can look into using a backend indexing process, such as with Lucene or using full text search. Initially, I would try to avoid this if possible, because it certainly complicates the project.

Searching a database of names

I have a MYSQL database containing the names of a large collection of people. Each person in the database could could have one or all of the following name types: first, last, middle, maiden or nick. I want to provide a way for people to search this database to see if a person exists in the database.
Are there any off the shelf products that would be suited to searching a database of peoples names?
With a bit of ingenuity, MySQL will do just what you need... The following gives a few ideas how this could be accomplished.
Your table: (I call it tblPersons)
PersonID (primary key of sorts)
First
Last
Middle
Maiden
Nick
Other columns for extra info (address, whatever...)
By keeping the table as-is, and building an index on each of the name-related columns, the following query provides an inefficient but plausible way of finding all persons whose name matches somehow a particular name. (Jack in the example)
SELECT * from tblPersons
WHERE First = 'Jack' OR Last = 'Jack' OR Middle = 'Jack'
OR Maiden = 'Jack' OR Nick = 'Jack'
Note that the application is not bound to only searching for one name value to be sought in all the various name types. The User can also input a specific set of criteria for example to search for the First Name 'John' and Last Name 'Lennon' and the Profession 'Artist' (if such info is stored in the db) etc.
Also, note that even with this single table approach, one of the features of your application could be to let the user tell the search logic whether this is a "given" name (like Paul, Samantha or Fatima) or a "surname" (like Black, McQueen or Dupont). The main purpose of this is that there are names that can be either (for example Lewis or Hillary), and by being, optionally, a bit more specific in their query, the end users can get SQL to automatically weed-out many irrelevant records. We'll get back to this kind of feature, in the context of alternative, more efficient database layout.
Introducing a "Names" table.
Instead (or in addition...) of storing the various names in the tblPersons table, we can introduce an extra table. and relate it to tblPersons.
tblNames
PersonID (used to relate with tblPersons)
NameType (single letter code, say F, L, M, U, N for First, Last...)
Name
We'd then have ONE record in tblPersons for each individual, but as many records in tblNames as they have names (but when they don't have a particular name, few people for example have a Nickname, there is no need for a corresponding record in tblNames).
Then the query would become
SELECT [DISTINCT] * from tblPersons P
JOIN tblNames N ON N.PersonID = P.PersonID
WHERE N.Name = 'Jack'
Such a layout/structure would be more efficient. Furthermore this query would lend itself to offer the "given" vs. "surname" capability easily, just by adding to the WHERE clause
AND N.NameType IN ('F', 'M', 'N') -- for the "given" names
(or)
AND N.NameType IN ('L', 'U', 'N') -- for the "surname" types. Note that
-- we put Nick name in there, but could just as eaily remove it.
Another interest of this approach is that it would allow storing other kinds of names in there, for example the SOUNDEX form of every name could be added, under their own NameType(s), allowing to easily find names even if the spelling is approximate.
Finaly another improvement could be to introduce a separate lookup table containing the most common abbreviations of given names (Pete for Peter, Jack for John, Bill for William etc), and to use this for search purposes (The name columns used for providing the display values would remain as provided in the source data, but the extra lookup/normalization at the level of the search would increase recall).
You shouldn't need to buy a product to search a database, databases are built to handle queries.
Have you tried running your own queries on it? For example: (I'm imagining what the schema looks like)
SELECT * FROM names WHERE first_name='Matt' AND last_name='Way';
If you've tried running some queries, what problems did you encounter that makes you want to try a different solution?
What does the schema look like?
How many rows are there?
Have you tried indexing the data in any way?
Please provide some more information to help answer your question.