I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()
Related
I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.
Example
homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike
compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1
compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2
Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.
The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.
Do something like this exists?
The key to your problem can be thought of like an IQ word association question.
Sound Glyph
--------- = ----------
Homophone Homoglyphs
Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).
The way to find similar sounding words is via Soundex (Sound Index).
So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.
Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.
Make sense?
If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.
Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.
This sounds like a duplicate question, as there are several questions similar to this, but they don't specifically ask this (or I just haven't found it! :) )
I have an array, this one has two distinct elements, "a" and "b", and a length of four total elements:
var list:Array = ["a","a","b","b"];
I'm looking for all combinations, using all elements, no duplicates.
This should yield:
aabb
abab
abba
bbaa
baba
baab
Searching for a solution for this has given me results similar to these:
a,b,ab,ba,aab,abb,aba, etc
or
a a b b, a a b b, a a b b, etc
Mind you, the application that would ultimately use this function would have two distinct elements, "a" and "b", and a length of 50 total elements:
var list:Array = ["a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b"]
...so a brute force solution like I used with aabb wouldn't be feasible.
Any help, especially using AS3 code, would be appreciated, even if it is simply pointing me to the right google search :)
Here is a JavaScript answer that might get you started: Permutations in JavaScript? (they're both EcmaScript implementations so converting to ActionScript should only require minor changes)
It doesn't handle the uniqueness requirement, but it might point you in the right direction.
However, there are a few things you might need to consider first. I don't think it will be feasible to pre-compute all unique permutations upfront.
Based on this answer about unique permutations it looks like there are 50! / 25! * 25! = 126,410,606,437,752 unique permutations for 25 a's and 25 b's.
To give an idea how large that number is: if each combination was 1 byte in memory (in practice it will be more than this) then that would be: 126410606437752 bytes = 126,410.6 gigabytes in memory.
Plus, the algorithm for generating the permutations has complexity O(n!) - so it might take far too long, separate to memory constraints, to generate the list of permutations.
Pattern matching (as found in e.g. Prolog, the ML family languages and various expert system shells) normally operates by matching a query against data element by element in strict order.
In domains like automated theorem proving, however, there is a requirement to take into account that some operators are associative and commutative. Suppose we have data
A or B or C
and query
C or $X
Going by surface syntax this doesn't match, but logically it should match with $X bound to A or B because or is associative and commutative.
Is there any existing system, in any language, that does this sort of thing?
Associative-Commutative pattern matching has been around since 1981 and earlier, and is still a hot topic today.
There are lots of systems that implement this idea and make it useful; it means you can avoid write complicated pattern matches when associtivity or commutativity could be used to make the pattern match. Yes, it can be expensive; better the pattern matcher do this automatically, than you do it badly by hand.
You can see an example in a rewrite system for algebra and simple calculus implemented using our program transformation system. In this example, the symbolic language to be processed is defined by grammar rules, and those rules that have A-C properties are marked. Rewrites on trees produced by parsing the symbolic language are automatically extended to match.
The maude term rewriter implements associative and commutative pattern matching.
http://maude.cs.uiuc.edu/
I've never encountered such a thing, and I just had a more detailed look.
There is a sound computational reason for not implementing this by default - one has to essentially generate all combinations of the input before pattern matching, or you have to generate the full cross-product worth of match clauses.
I suspect that the usual way to implement this would be to simply write both patterns (in the binary case), i.e., have patterns for both C or $X and $X or C.
Depending on the underlying organisation of data (it's usually tuples), this pattern matching would involve rearranging the order of tuple elements, which would be weird (particularly in a strongly typed environment!). If it's lists instead, then you're on even shakier ground.
Incidentally, I suspect that the operation you fundamentally want is disjoint union patterns on sets, e.g.:
foo (Or ({C} disjointUnion {X})) = ...
The only programming environment I've seen that deals with sets in any detail would be Isabelle/HOL, and I'm still not sure that you can construct pattern matches over them.
EDIT: It looks like Isabelle's function functionality (rather than fun) will let you define complex non-constructor patterns, except then you have to prove that they are used consistently, and you can't use the code generator anymore.
EDIT 2: The way I implemented similar functionality over n commutative, associative and transitive operators was this:
My terms were of the form A | B | C | D, while queries were of the form B | C | $X, where $X was permitted to match zero or more things. I pre-sorted these using lexographic ordering, so that variables always occurred in the last position.
First, you construct all pairwise matches, ignoring variables for now, and recording those that match according to your rules.
{ (B,B), (C,C) }
If you treat this as a bipartite graph, then you are essentially doing a perfect marriage problem. There exist fast algorithms for finding these.
Assuming you find one, then you gather up everything that does not appear on the left-hand side of your relation (in this example, A and D), and you stuff them into the variable $X, and your match is complete. Obviously you can fail at any stage here, but this will mostly happen if there is no variable free on the RHS, or if there exists a constructor on the LHS that is not matched by anything (preventing you from finding a perfect match).
Sorry if this is a bit muddled. It's been a while since I wrote this code, but I hope this helps you, even a little bit!
For the record, this might not be a good approach in all cases. I had very complex notions of 'match' on subterms (i.e., not simple equality), and so building sets or anything would not have worked. Maybe that'll work in your case though and you can compute disjoint unions directly.
Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.
I'm quite anal about form validation. So while creating a validator for a "data of birth" (DOB) field in one of my current projects for a job application form (platform/language is neutral in this context), I wanted something to prevent 'punky' inputs.
I used a date picker and restricted the max date to be XX years from the current day. XX make sense for this scenario as anyone younger shouldn't be even applying for the job.
The validation error message is: You seem too young for the job.
Then I began to get adventurous. How about?
If DOB is more than 120 years ago, message: "You cannot be that old!!!"
If DOB is in the future, message: "You must be kidding, you are not born yet!!!"
In the end, I deployed without the last 2, too cheeky for my no-nonsense client.
I would like to know how far/much would you guys go to validate DOB fields for good usability (or humor)?
Similarly for dates like, "Date of marriage", "Year of graduation" etc...
PS: As I was about to submit this post, there's a warning under the title textbox:
"The question you're asking appears subjective and is likely to be closed."
Fingers crossed.
To add:
I'm quite surprised that some/most of the guys are not too concern about the validation. I repeat one of my comments here:
If the user entered the date wrongly (something very obvious) whether by intent or by mistake; that's one of the purposes of the validators to catch it. When data goes into the system, the site owner only know the input is wrong, he/she would not know the actual value without asking the user. If this field is highly important, it will not be a pretty scenario.
Think about the times you've filled out forms. How many times have you been frustrated because some "overly clever" programmer inserted some "validation" that just happened to be incorrect for your circumstance? I say, trust the user. Come to think of it, as time goes on I guess people are living longer and getting on the net at earlier ages, anyway. :P
don't forget you can also warn the user against unlikely values. In most cases, a typo is more likely than deliberately being awkward.
So for your application, maybe something like this:
Age < min. applicant age - error
Age > common retirement age - warning
Age > expected life span - error
Validation vs. Correctness
The point of input validation is to ensure all elements are within the range allowed for and expected by further processing - i.e. if your database guarantees all applicants in the DB are 18 years or older, validate that. If your database also accepts school kids applying for internships, don't.
Everything unusual is just a warning. Yes, a value of 120 years is crazy, you should warn the user and possibly flag this record as suspicous / for review. However, there's no point in rejecting it (unless you have a business rule that e.g. all applicants are younger than 70).
Fake trust
Imagine what happens if you tell one user that "you rule out unlikely DOBs at the input". She might tell her co-worker that DOB is "already validated". He ends up with an unfounded trust that the applicant is 90, and if it were a fake you would have rejected it.
All further processing - by human or by computer - must still assume the DOB may be incorrect - just because of a typo. You are trying to create a guarantee you can't actually make. Many users trust the computer they use every day more than a stranger, you are trying to enforce this trust - which is IMO s fallacy.
Transmutation
Many applications live much longer than the original implementer imagined, and quite some will be used for purposes beyond his wildest dreams. Building in artificial limits that neither simplify the actual processing nor the job of the operator don't actually help.
(That puts me probably into the no-nonsense category of your client - but thst's my way to be "anal about validation": knowing when to stop :))
I think validation is incredibly important, but not necessarily in your situation. Which isn't to say that your situation is trivial, I just have my own date-oriented nits to pick.
Specifically, my concerns are always in keeping things in logical order. If someone says they were born in 1802, that's fine (sorta), I just want their date of graduation to be greater than their date of birth. But you run into itchy little problems when it comes to time (as in hours and minutes), for instance, if a user chooses 8:30 as the start time and then chooses 9:15 as the end time, but then realizes that the end time was 8:45. They decide to change the 9 to an 8 with the intention of changing the minutes to :45. But my validation script is too busy saying "Hey Wait! 8:15 is before 8:30, nice try!" but I can't risk letting them leave it wrong, etc etc.
For your situation specifically, I would lean toward what is ethical right. Because as it's been pointed out, someone could be entering a family history (with DOBs in the 1600's) or future purchases (with dates after today), so there is no realistic limit on dates in general. But there are limits to your scenario, ie:
If Age is less than legal working age (16 in most parts of the US), don't even offer anything higher than that year as an option (if you are using drop down).
If Age is beyond reasonable working age (which can be a sensitive subject) offer the highest value based on retirement age and simply add a ">" in front of that year. If someone is 75 and applying for an admin-level job, they will be more pleased that you made things simple rather than offended that you didn't have their year of birth listed. If anything, they will be impressed (I think) that you went this route instead of nothing at all, implying they shouldn't waste their time.
In the end you have a simple drop down very easy to script (example in PHP):
$currentYear = date('Y');
echo "<select name=\"YearOfBirth\">";
for($i = 16; $i <= 64; $i++) {
$optionYear = $currentYear - $i;
echo "<option value=\"$optionYear\">$optionYear</option>";
}
$greaterYear = $currentYear - 65;
echo "<option value=\">$greaterYear\">>$greaterYear</option>";
echo "</select>";
When asking living people for their birthdate, only reject values that are definitely wrong. Any birthdate in the future is definitely wrong. And I would draw a line and say that any birthdate before (say) 1880 is definitely wrong. Anything else is a valid birthdate.
So any birthdate that fails the above tests is rejected with a message at field level, like "This date is in the future/too far in the past. Please enter your birthdate."
Any other birthdate is valid (maybe the user really is 11 years old, or 108). But the overall form may be rejected by business rules. For example, "You must be at least 18 years old to apply."
The idea is to separate individual field validation from form validation. Conflating them yields complicated rules. Separating means you can re-use the rules for the field (e.g. "DOB of a living person must be between 1/1/1880 and today") in other contexts.
If you're doing this for anything professional - like a job application - I might not use "!" in messages to users. Take a look at any well done website you'd like, you're not going to find it in common use.
Valid date: check
Date not in future: maybe (I deal with medical applications, so I suppose you could be treating unborn babies)
Date not older than 120 years: probably
I'm not a big fan of over-engineering these things, particular if a user mistake is relative harmless and can be spotted and fixed easily. That's how I approach it anyway.
Valid Date:
I'll go to the extend of checking whether this date exists or not. i.e. leap year 29th Feb and so on
Date in the future:
we usually check the age (this year - dob given) and must be at least a certain age to sign up.
Date older than 120 years or not:
I won't check. 200 years would be a safer limit? (in case a 121 year old man wants to use the computer *chuckles*)
I think you should consider your actual requirements when designing validations. Yes if the field is a date field (and perhaps more importantly if it stores a date but some less than stellar dba made it a varchar),make sure only a valid date is submitted. This is critical. Invalid dates cause all sorts of issues with querying the data. If it is a date that must of necessity have occurred in the past, limit the date range to the present date or earlier.
After that go with what your client wants. If they want to pay for you to eliminate people younger than work age, they will tell you. Disallowing a top age limit can get you into legal trouble for age discrimination. The client may not want you to do this either.
Humour is a pretty subjective thing and very project specific so it’s a bit difficult to answer along those lines. Having said that, if the application supports a formal process such as applying for a job I’d probably err on the side of caution and keep it pretty factual.
As for validation, I believe the effort so you go to here should be proportional to the impact of invalid data making its way through from the UI. Going back to the job application form, I imagine there will be a human review process at some time so the risk of invalid data is minimal whether the data was intentionally or inadvertently entered incorrectly.
If you’re worried about “punky” or bot driven inputs then use Captcha. Having said all that, I reckon you’re pretty safe with the validation rules you’ve used.
Well I'm not a programer (More of a BA) though I'm trying to gain some development skills as I think it may help me be a better BA. I've done a bit of VBA (Don't laugh).
Anyway in thinking about this here's my two cents
1) Dropping the humour. Whats funny to you now won't be to someone else. Furthermore, whats funny after two or three goes isn't funny after 25 or 30 - its just tiresome even if you are dealing with a jokey crowd!
2) I am coming round to the idea that unless you can definitively validate something as being plain wrong, E.g. you don't want to let someone enter a value < 0, then you should consider warning rather than prevention via dialogues or whatever the OS standard happens to be.
Hey what do I know, In a week I'll have changed my mind (I'm a Business Analyst) and will be demanding instant repsonses from developers ;->
Let's just use two digit years everywhere. No one's going to be using our software after 1999!
Below are the checks that you can do while validating the DOB:
calculate the age from the DOB and do the following checks
AGE > XX [XX is the min age required to apply]
AGE < XX {SHould throw a message mentioning that you are not old enough}
AGE = XX
If there is no upper limit of age then we can take it as retirement age else verify with the upper limit for the next two checks
AGE < Retirement Age
AGE > Retirement Age {Should throw a message mentiong that you are too old to apply}
AGE = retirement Age
DOB is a valid date (by giving valid date)
DOB is invalid -
Enter 0 in either of day/month/Year
Enter some negative Value
Enter some invalid date e.g. 30th feb or 32 Jan etc
Enter valid date with different separators (although the date is a valid one but due to different separators it will become an invalid one)
Enter date with different formats such as by giving dd/mm/yyyy, dd/mm/yy, dd/MON/yyyy etc.
Enter some future date (Invalid here as your purpose is something different)
being a perfectionist i would go here for 150 :D
as low as the chances are, people have passed the 120, and who know what shall happens in the coming 30 years :D
i don't find it that important however..
It all depends on the application. A line of business (LOB) application for order processing is very different to tracking historical or future data.
One can agree it needs to be a valid date, but consider there are multiple calendars (e.g. month number can be 13, year can be over 5000).
Validate for an integer and to be helpful; I think anything else: an abusive/big brother/over-enginereed system is a bad idea.
People should be allowed to lie on these forms if they wish; it's not a legal thing, it's a website.
Don't take it so seriously.
Just let the user pick a date. The user should be in control..not the system/developer. The only date you should avoid with respect to DOB is the future as that is incorrect (i.e. preventing error by design). The date picker you provide should handle any date format issues.
And definitely do not throw up any cheeky exceptions/messages. Your message should aid the user in recognising & recoverying from an error.
Hope that helps.