I'm quite anal about form validation. So while creating a validator for a "data of birth" (DOB) field in one of my current projects for a job application form (platform/language is neutral in this context), I wanted something to prevent 'punky' inputs.
I used a date picker and restricted the max date to be XX years from the current day. XX make sense for this scenario as anyone younger shouldn't be even applying for the job.
The validation error message is: You seem too young for the job.
Then I began to get adventurous. How about?
If DOB is more than 120 years ago, message: "You cannot be that old!!!"
If DOB is in the future, message: "You must be kidding, you are not born yet!!!"
In the end, I deployed without the last 2, too cheeky for my no-nonsense client.
I would like to know how far/much would you guys go to validate DOB fields for good usability (or humor)?
Similarly for dates like, "Date of marriage", "Year of graduation" etc...
PS: As I was about to submit this post, there's a warning under the title textbox:
"The question you're asking appears subjective and is likely to be closed."
Fingers crossed.
To add:
I'm quite surprised that some/most of the guys are not too concern about the validation. I repeat one of my comments here:
If the user entered the date wrongly (something very obvious) whether by intent or by mistake; that's one of the purposes of the validators to catch it. When data goes into the system, the site owner only know the input is wrong, he/she would not know the actual value without asking the user. If this field is highly important, it will not be a pretty scenario.
Think about the times you've filled out forms. How many times have you been frustrated because some "overly clever" programmer inserted some "validation" that just happened to be incorrect for your circumstance? I say, trust the user. Come to think of it, as time goes on I guess people are living longer and getting on the net at earlier ages, anyway. :P
don't forget you can also warn the user against unlikely values. In most cases, a typo is more likely than deliberately being awkward.
So for your application, maybe something like this:
Age < min. applicant age - error
Age > common retirement age - warning
Age > expected life span - error
Validation vs. Correctness
The point of input validation is to ensure all elements are within the range allowed for and expected by further processing - i.e. if your database guarantees all applicants in the DB are 18 years or older, validate that. If your database also accepts school kids applying for internships, don't.
Everything unusual is just a warning. Yes, a value of 120 years is crazy, you should warn the user and possibly flag this record as suspicous / for review. However, there's no point in rejecting it (unless you have a business rule that e.g. all applicants are younger than 70).
Fake trust
Imagine what happens if you tell one user that "you rule out unlikely DOBs at the input". She might tell her co-worker that DOB is "already validated". He ends up with an unfounded trust that the applicant is 90, and if it were a fake you would have rejected it.
All further processing - by human or by computer - must still assume the DOB may be incorrect - just because of a typo. You are trying to create a guarantee you can't actually make. Many users trust the computer they use every day more than a stranger, you are trying to enforce this trust - which is IMO s fallacy.
Transmutation
Many applications live much longer than the original implementer imagined, and quite some will be used for purposes beyond his wildest dreams. Building in artificial limits that neither simplify the actual processing nor the job of the operator don't actually help.
(That puts me probably into the no-nonsense category of your client - but thst's my way to be "anal about validation": knowing when to stop :))
I think validation is incredibly important, but not necessarily in your situation. Which isn't to say that your situation is trivial, I just have my own date-oriented nits to pick.
Specifically, my concerns are always in keeping things in logical order. If someone says they were born in 1802, that's fine (sorta), I just want their date of graduation to be greater than their date of birth. But you run into itchy little problems when it comes to time (as in hours and minutes), for instance, if a user chooses 8:30 as the start time and then chooses 9:15 as the end time, but then realizes that the end time was 8:45. They decide to change the 9 to an 8 with the intention of changing the minutes to :45. But my validation script is too busy saying "Hey Wait! 8:15 is before 8:30, nice try!" but I can't risk letting them leave it wrong, etc etc.
For your situation specifically, I would lean toward what is ethical right. Because as it's been pointed out, someone could be entering a family history (with DOBs in the 1600's) or future purchases (with dates after today), so there is no realistic limit on dates in general. But there are limits to your scenario, ie:
If Age is less than legal working age (16 in most parts of the US), don't even offer anything higher than that year as an option (if you are using drop down).
If Age is beyond reasonable working age (which can be a sensitive subject) offer the highest value based on retirement age and simply add a ">" in front of that year. If someone is 75 and applying for an admin-level job, they will be more pleased that you made things simple rather than offended that you didn't have their year of birth listed. If anything, they will be impressed (I think) that you went this route instead of nothing at all, implying they shouldn't waste their time.
In the end you have a simple drop down very easy to script (example in PHP):
$currentYear = date('Y');
echo "<select name=\"YearOfBirth\">";
for($i = 16; $i <= 64; $i++) {
$optionYear = $currentYear - $i;
echo "<option value=\"$optionYear\">$optionYear</option>";
}
$greaterYear = $currentYear - 65;
echo "<option value=\">$greaterYear\">>$greaterYear</option>";
echo "</select>";
When asking living people for their birthdate, only reject values that are definitely wrong. Any birthdate in the future is definitely wrong. And I would draw a line and say that any birthdate before (say) 1880 is definitely wrong. Anything else is a valid birthdate.
So any birthdate that fails the above tests is rejected with a message at field level, like "This date is in the future/too far in the past. Please enter your birthdate."
Any other birthdate is valid (maybe the user really is 11 years old, or 108). But the overall form may be rejected by business rules. For example, "You must be at least 18 years old to apply."
The idea is to separate individual field validation from form validation. Conflating them yields complicated rules. Separating means you can re-use the rules for the field (e.g. "DOB of a living person must be between 1/1/1880 and today") in other contexts.
If you're doing this for anything professional - like a job application - I might not use "!" in messages to users. Take a look at any well done website you'd like, you're not going to find it in common use.
Valid date: check
Date not in future: maybe (I deal with medical applications, so I suppose you could be treating unborn babies)
Date not older than 120 years: probably
I'm not a big fan of over-engineering these things, particular if a user mistake is relative harmless and can be spotted and fixed easily. That's how I approach it anyway.
Valid Date:
I'll go to the extend of checking whether this date exists or not. i.e. leap year 29th Feb and so on
Date in the future:
we usually check the age (this year - dob given) and must be at least a certain age to sign up.
Date older than 120 years or not:
I won't check. 200 years would be a safer limit? (in case a 121 year old man wants to use the computer *chuckles*)
I think you should consider your actual requirements when designing validations. Yes if the field is a date field (and perhaps more importantly if it stores a date but some less than stellar dba made it a varchar),make sure only a valid date is submitted. This is critical. Invalid dates cause all sorts of issues with querying the data. If it is a date that must of necessity have occurred in the past, limit the date range to the present date or earlier.
After that go with what your client wants. If they want to pay for you to eliminate people younger than work age, they will tell you. Disallowing a top age limit can get you into legal trouble for age discrimination. The client may not want you to do this either.
Humour is a pretty subjective thing and very project specific so it’s a bit difficult to answer along those lines. Having said that, if the application supports a formal process such as applying for a job I’d probably err on the side of caution and keep it pretty factual.
As for validation, I believe the effort so you go to here should be proportional to the impact of invalid data making its way through from the UI. Going back to the job application form, I imagine there will be a human review process at some time so the risk of invalid data is minimal whether the data was intentionally or inadvertently entered incorrectly.
If you’re worried about “punky” or bot driven inputs then use Captcha. Having said all that, I reckon you’re pretty safe with the validation rules you’ve used.
Well I'm not a programer (More of a BA) though I'm trying to gain some development skills as I think it may help me be a better BA. I've done a bit of VBA (Don't laugh).
Anyway in thinking about this here's my two cents
1) Dropping the humour. Whats funny to you now won't be to someone else. Furthermore, whats funny after two or three goes isn't funny after 25 or 30 - its just tiresome even if you are dealing with a jokey crowd!
2) I am coming round to the idea that unless you can definitively validate something as being plain wrong, E.g. you don't want to let someone enter a value < 0, then you should consider warning rather than prevention via dialogues or whatever the OS standard happens to be.
Hey what do I know, In a week I'll have changed my mind (I'm a Business Analyst) and will be demanding instant repsonses from developers ;->
Let's just use two digit years everywhere. No one's going to be using our software after 1999!
Below are the checks that you can do while validating the DOB:
calculate the age from the DOB and do the following checks
AGE > XX [XX is the min age required to apply]
AGE < XX {SHould throw a message mentioning that you are not old enough}
AGE = XX
If there is no upper limit of age then we can take it as retirement age else verify with the upper limit for the next two checks
AGE < Retirement Age
AGE > Retirement Age {Should throw a message mentiong that you are too old to apply}
AGE = retirement Age
DOB is a valid date (by giving valid date)
DOB is invalid -
Enter 0 in either of day/month/Year
Enter some negative Value
Enter some invalid date e.g. 30th feb or 32 Jan etc
Enter valid date with different separators (although the date is a valid one but due to different separators it will become an invalid one)
Enter date with different formats such as by giving dd/mm/yyyy, dd/mm/yy, dd/MON/yyyy etc.
Enter some future date (Invalid here as your purpose is something different)
being a perfectionist i would go here for 150 :D
as low as the chances are, people have passed the 120, and who know what shall happens in the coming 30 years :D
i don't find it that important however..
It all depends on the application. A line of business (LOB) application for order processing is very different to tracking historical or future data.
One can agree it needs to be a valid date, but consider there are multiple calendars (e.g. month number can be 13, year can be over 5000).
Validate for an integer and to be helpful; I think anything else: an abusive/big brother/over-enginereed system is a bad idea.
People should be allowed to lie on these forms if they wish; it's not a legal thing, it's a website.
Don't take it so seriously.
Just let the user pick a date. The user should be in control..not the system/developer. The only date you should avoid with respect to DOB is the future as that is incorrect (i.e. preventing error by design). The date picker you provide should handle any date format issues.
And definitely do not throw up any cheeky exceptions/messages. Your message should aid the user in recognising & recoverying from an error.
Hope that helps.
Related
Can anyone help me write a class, e.g. BigNumber.as (or BigInt.as) which will:
Allow for really really big numbers/integers.
Include a method to express a number in format "1.54 Million", "1.98 Vigintillion" and so on...
Allow the maximum number to stop only at the last number word (e.g. Million, Vigintillion, etc) in the defined list. (e.g. list built from here: https://en.wikipedia.org/wiki/Names_of_large_numbers under Standard dictionary numbers [Short scale])
I had an idea to have a class which contains 2 Number values ("value" and "timesMaxedOut"). When "value" >= Number.MAX_VALUE, it would then increment "timesMaxedOut" by 1 and reset "value" back to the difference that the value went over by.
The problem? It seems if you hit or surpass "MAX_VALUE" then the Number will reset to 0. I'm also sure it would then be difficult to properly multiply or divide numbers with this approach, as it would need to take into account "timesMaxedOut" just for the calculations to work correctly.
My goal is to write a game which would allow players to reach really big numbers, and play indefinitely essentially, but AS3 lacks very large number support it seems.
With Zaqar, imagine you have a message M in a queue Queue.
Let Queue.claim_timeout == Queue.message_ttl (please ignore the misuse of the terminology, but also feel free to correct).
Imagine I do a claim(M) and I do no delete!
What happens when the claim_timeout will be reached?
M is still available
that means the message_ttl was "suspended" while the message was claimed
M is autodeleted
because of the message_ttl which will also be reached
Something else.
Any explanation or comment will be of huge help.
Thanks,
Costin
It seams that the answer is 1:
M is still available
; that means the message_ttl was "suspended" while the message was claimed
I base my statement on the Zaqar docs: https://wiki.openstack.org/wiki/Zaqar/specs/api/v1#Claims
The server will extend the lifetime of claimed messages to be at least as long as the lifetime of the claim itself, plus a specified grace period to deal with crashed workers (up to 1209600 or 14 days including claim lifetime). If a claimed message would normally live longer than the grace period, it's expiration will not be adjusted.
I've gone through quite a few examples on here and I apologize if I'm asking a repeat question, as far as I can tell, I am not.
I have an SSRS report made that shows gross sales for certain aspects of our sales departments. They are broken down, in row, by "cost, gross profit, gross profit %, order count, total sales." The columns are the aspects of our sales. Web sales, phone sales, etc....
In the tablix I can format a text box to display the results as numbers, but as you can see, I have also Percentage and Count in there. I don't know how to format those within the context of the original text box format. So I know I have everything that shows under there as a number already, but how do I handle getting the percentage to show as a percentage and the count to show as a count?
For example, all the percentages currently show as, "$0.35" and various other numbers that follow that form. The count's currently appear as currency too.
I've used an example I found on here, "=Iif ( Me.Value = Floor ( Me.Value ) , "0%" , "0.00%" )," but all that did was make everything that showed up in that column, "0.00%" I am fairly new to SSRS and have been cramming consistently for the past two weeks, but I just cannot find help on this. Thank you in advance for anything you can offer.
Update: =IIF(Fields!LVS_Web.Value=0.00, "0%", format(Fields!LVS_Web.Value, "P"))
That worked... to a degree, but now everything is a percent.... thinking ELSE here but I don't know how ELSE goes in, I've not once seen the word ELSE.
Update 2: The thing that I've noticed is that in the statement, where it says, "=0.00, "0%"," that doesn't even really apply. I've just put that there because I'm new to this and I just needed an argument involved. I took the 0% and changed it to N under the condition that the number was < .99, hopeing I would just catch all of the decimals that fell below the value of 1. Like, "$.23", which later became 23.45%, so I COULD do that, but what I don't udnerstand is it made everything else, "N," instead of a number. Why is that? It doesn't make everything else, "P?"
I'm losing my damned mind.
There is also the fact that this is information being pulled from a stored procedure, I don't really know too much about those quite yet, I get assigned simple tasks ever so often as a stepping stool for learning. I don't really know what the query was, but I couldn't edit it if I wanted to. This can be done with expression formatting but my expression is too broad, but I get mixed results using Greater or Less than, and it's probably not the wisest thing to use since these numbers are not set in stone. My day is almost done, I've made very very little progress, but I had a good lunch. So success.
So I provided my own answer for this problem, and it works. Thanks me. Thanks to all the tried to help me and did help as well. I appreciate the effort strangers will put out for each other.
I've had a new problem develop, I need to display a time relative to the data being pulled. I can put NOW in there and get today's date, but if someone is pulling information from FEB, they may be a little off-put by the current date. I'll probably get this figured out soon, but if anyone can help in the meantime, I would appreciate it.
A standard principle is to separate data from display, so use the Value property to store the data in its native data type and use the Format property to display it how you want. So rather than use an expression formatting the Value property such as =Format(Fields.SomeField.Value, "0.00%") leave the Value as =Fields!SomeField.Value and set the Format property to P2.
This is especially important when exporting your report to Excel because if you have the right data type for your data it will export to Excel as the right data type. If you use the Format function it will export as text, making sorting and formula not work properly.
The easiest thing to do to control the formatting is use the standard numeric formats. Click on the cell or range of cells that you want to have a certain format and set the Format property. They are a format specifier letter followed by an optional digit for precision (number of decimal places). Some useful ones are:
C Currency with 2 decimal places (by default)
N4 Number with 4 decimal places
P0 Percentage with no decimal places
Click on the link above for the full list. Format the number cells as numbers and the percents as percents - you don't need to try to make one format string fit every cell.
These standard numeric formats also respect regional settings. You should set your report's Language property to =User!Language to use the user's regional settings rather than the report server's.
If the number is already * 100 eg. 9.5 should be shown as 9.5% then use the format:
0.00\%
9.5 -> 9.5%
0.34 -> 0.34%
This way you can use the standard number formatting and just add the % to the end. The \ escapes the %, preventing the *100 in formatting (which would make 9.5 show 950%.).
=iif(Fields!Metric.Value = "Gross Profit %",
Format(Fields!LVS_Web.Value,"P"),
iif(Fields!Metric.Value = "Order Count",
Format(Fields!LVS_Web.Value,"G4"),
Format(Fields!LVS_Web.Value,"C")))
This is what saved me and did what I wanted. There is another error, but it's my bosses fault, so now I get to laugh at him. Thanks everyone.
Source:
https://technet.microsoft.com/en-us/library/bb630415(v=sql.100).aspx
This is simple to use,
Percent of (the sum of line item totals for the current scope)/(the sum of line item totals for the dataset).
This value is formatted using FormatPercent specifying one decimal place.
="Percentage contributing to all sales: " & FormatPercent(Sum(Field!LineTotal.Value)/Sum(Field!LineTotal.Value,"Sales"),1)
When people talk about the use of "magic numbers" in computer programming, what do they mean?
Magic numbers are any number in your code that isn't immediately obvious to someone with very little knowledge.
For example, the following piece of code:
sz = sz + 729;
has a magic number in it and would be far better written as:
sz = sz + CAPACITY_INCREMENT;
Some extreme views state that you should never have any numbers in your code except -1, 0 and 1 but I prefer a somewhat less dogmatic view since I would instantly recognise 24, 1440, 86400, 3.1415, 2.71828 and 1.414 - it all depends on your knowledge.
However, even though I know there are 1440 minutes in a day, I would probably still use a MINS_PER_DAY identifier since it makes searching for them that much easier. Whose to say that the capacity increment mentioned above wouldn't also be 1440 and you end up changing the wrong value? This is especially true for the low numbers: the chance of dual use of 37197 is relatively low, the chance of using 5 for multiple things is pretty high.
Use of an identifier means that you wouldn't have to go through all your 700 source files and change 729 to 730 when the capacity increment changed. You could just change the one line:
#define CAPACITY_INCREMENT 729
to:
#define CAPACITY_INCREMENT 730
and recompile the lot.
Contrast this with magic constants which are the result of naive people thinking that just because they remove the actual numbers from their code, they can change:
x = x + 4;
to:
#define FOUR 4
x = x + FOUR;
That adds absolutely zero extra information to your code and is a total waste of time.
"magic numbers" are numbers that appear in statements like
if days == 365
Assuming you didn't know there were 365 days in a year, you'd find this statement meaningless. Thus, it's good practice to assign all "magic" numbers (numbers that have some kind of significance in your program) to a constant,
DAYS_IN_A_YEAR = 365
And from then on, compare to that instead. It's easier to read, and if the earth ever gets knocked out of alignment, and we gain an extra day... you can easily change it (other numbers might be more likely to change).
There's more than one meaning. The one given by most answers already (an arbitrary unnamed number) is a very common one, and the only thing I'll say about that is that some people go to the extreme of defining...
#define ZERO 0
#define ONE 1
If you do this, I will hunt you down and show no mercy.
Another kind of magic number, though, is used in file formats. It's just a value included as typically the first thing in the file which helps identify the file format, the version of the file format and/or the endian-ness of the particular file.
For example, you might have a magic number of 0x12345678. If you see that magic number, it's a fair guess you're seeing a file of the correct format. If you see, on the other hand, 0x78563412, it's a fair guess that you're seeing an endian-swapped version of the same file format.
The term "magic number" gets abused a bit, though, referring to almost anything that identifies a file format - including quite long ASCII strings in the header.
http://en.wikipedia.org/wiki/File_format#Magic_number
Wikipedia is your friend (Magic Number article)
Most of the answers so far have described a magic number as a constant that isn't self describing. Being a little bit of an "old-school" programmer myself, back in the day we described magic numbers as being any constant that is being assigned some special purpose that influences the behaviour of the code. For example, the number 999999 or MAX_INT or something else completely arbitrary.
The big problem with magic numbers is that their purpose can easily be forgotten, or the value used in another perfectly reasonable context.
As a crude and terribly contrived example:
while (int i != 99999)
{
DoSomeCleverCalculationBasedOnTheValueOf(i);
if (escapeConditionReached)
{
i = 99999;
}
}
The fact that a constant is used or not named isn't really the issue. In the case of my awful example, the value influences behaviour, but what if we need to change the value of "i" while looping?
Clearly in the example above, you don't NEED a magic number to exit the loop. You could replace it with a break statement, and that is the real issue with magic numbers, that they are a lazy approach to coding, and without fail can always be replaced by something less prone to either failure, or to losing meaning over time.
Anything that doesn't have a readily apparent meaning to anyone but the application itself.
if (foo == 3) {
// do something
} else if (foo == 4) {
// delete all users
}
Magic numbers are special value of certain variables which causes the program to behave in an special manner.
For example, a communication library might take a Timeout parameter and it can define the magic number "-1" for indicating infinite timeout.
The term magic number is usually used to describe some numeric constant in code. The number appears without any further description and thus its meaning is esoteric.
The use of magic numbers can be avoided by using named constants.
Using numbers in calculations other than 0 or 1 that aren't defined by some identifier or variable (which not only makes the number easy to change in several places by changing it in one place, but also makes it clear to the reader what the number is for).
In simple and true words, a magic number is a three-digit number, whose sum of the squares of the first two digits is equal to the third one.
Ex-202,
as, 2*2 + 0*0 = 2*2.
Now, WAP in java to accept an integer and print whether is a magic number or not.
It may seem a bit banal, but there IS at least one real magic number in every programming language.
0
I argue that it is THE magic wand to rule them all in virtually every programmer's quiver of magic wands.
FALSE is inevitably 0
TRUE is not(FALSE), but not necessarily 1! Could be -1 (0xFFFF)
NULL is inevitably 0 (the pointer)
And most compilers allow it unless their typechecking is utterly rabid.
0 is the base index of array elements, except in languages that are so antiquated that the base index is '1'. One can then conveniently code for(i = 0; i < 32; i++), and expect that 'i' will start at the base (0), and increment to, and stop at 32-1... the 32nd member of an array, or whatever.
0 is the end of many programming language strings. The "stop here" value.
0 is likewise built into the X86 instructions to 'move strings efficiently'. Saves many microseconds.
0 is often used by programmers to indicate that "nothing went wrong" in a routine's execution. It is the "not-an-exception" code value. One can use it to indicate the lack of thrown exceptions.
Zero is the answer most often given by programmers to the amount of work it would take to do something completely trivial, like change the color of the active cell to purple instead of bright pink. "Zero, man, just like zero!"
0 is the count of bugs in a program that we aspire to achieve. 0 exceptions unaccounted for, 0 loops unterminated, 0 recursion pathways that cannot be actually taken. 0 is the asymptote that we're trying to achieve in programming labor, girlfriend (or boyfriend) "issues", lousy restaurant experiences and general idiosyncracies of one's car.
Yes, 0 is a magic number indeed. FAR more magic than any other value. Nothing ... ahem, comes close.
rlynch#datalyser.com
I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()