How forgiving should form inputs be? - language-agnostic

I went to my bank website the other day and entered my account number with a trailing space. An error message popped that said, "Account number must consist of numeric values only." I thought to myself, "Seriously?! You couldn't have just stripped the space for me?". If I were any less of a computer geek, I may even have thought, "What? There are only numbers in there!" (not being able to see space).
The Calculator that comes with Ubuntu on the other hand merrily accepts spaces and commas, but oddly doesn't like trailing dots (without any ensuing digits).
So, that begs the question. Exactly how forgiving should web forms be? I don't think trimming whitespace is too much to ask, but what about other integer fields?
Should they allow +/- signs?
How many spaces should be allowed between the sign and the number?
What about commas for thousands separators?
What about in other parts of the world where use dots instead?
What if they're in between every 4 digits instead of every 3?
What about hexidecimal and octal representations?
Scientific notation?
What if I accidentally hit the quote button when I'm trying to hit enter, should that be stripped too?
It would be very easy for me to strip out all non-digit characters, and that would be extremely forgiving, but what if the user made an actual mistake that affects the input and should have been caught, but now I've just stripped it out?
What about things like phone numbers (which have a huge variety of formats), postal codes, zip codes, credit card numbers, usernames, emails, URLs (should I assume http? What about .com while I'm at it?)?
Where do you draw the line?

For something as important as banking, I don't mind it complaining about my input, especially if the other option is mistakenly transferring a bucketload of money into some stranger's account instead of my wife's (because of a missing or incorrect digit for example).
A classic example is one of my banks which disallows monetary values unless they have ".99" at the end (where 9 can be any digit of course). The vast majority of things I do are for exact dollar amounts and it's a tad annoying to have to always enter 500.00 instead of just 500.
But I'll be happier about that the first time I avoid accidentally paying somebody $5072 instead of $50.72 just because I forgot the decimal point. Actually, that's pretty unlikely since it also asks for confirmation and I'm pretty anal in controlling my money :-)
Having said that, the general rule I try to follow is "be liberal in what you accept, be strict in what you produce".
This allows other software using my output to expect a limited range of possibilities (making their lives easier). But it makes my software more useful if it can handle simple misteaks.

You draw the line at the point where the computer is guessing at what the correct input should be.
For example, a license key input box I wrote once accepts spaces and dashes and both upper and lower case, even though internally the keys were without said spaces, dashes and were all upper case. I could do that, since I knew that none of the keys actually had spaces or dashes.
Your example about URLs is another good one. I've noticed that modern browsers (I'm using Chrome), when something like 'flowers' is typed into the address bar, it knows it should search for it since it's not a valid URL. If instead, I type 'st' it auto corrects (or auto-suggests) 'stackoverflow.com' since it's a bookmark.
A well-written input system will complain when it would otherwise be forced to guess what the correct input should be.

Numeric input:
Stripping non-digits seems reasonable to me, but the problem is conflicting decimal notation. Some regions expect , (comma) to denote the decimal separator, while others use . (period). Unless the input would likely be in other bases, I would only assume base 10. If it's reasonable to assume non-base 10 input (base-16 for color input, for example), I would go with standard conventions for denoting the bases: leading 0 means base 8, leading 0x means base 16.
String input:
This gets a lot more complicated. It mostly depends on what the input is actually meant to represent. A username should exclude characters that will cause trouble, but the meaning of 'cause trouble' will vary depending on the use of the application and the system itself. URLs have a concrete definition of what qualifies, but that definition is rather broad. Fortunately, many languages come with tools to discern URLs, without you having to code your own parsing (whether the language does it perfectly or not is another question).
In the end, it's really a case-by-case basis. I do like paxadiablo's general rule, though: Accept as much as you can, output only what you must.

It totally depends on how the data is going to be used.
If the input is a monetary amount, for a transaction for example, then the inputted variable should be normalised to a set of standards for sure.
If it's simply a case of a phone number, then it is unlikely the stored data will provide any functional sort of use so you can be more forgiving.
There is nothing wrong with forcing correct format to make displayed look nicer, but you have to balance user irritation with micro benefits.
Once you start collecting data you can scan through it and see what sort of patterns emerge, and you can auto strip off inputted format.

Where do you draw the line?
When the consequences of accepting "invalid" data outweigh the irritation of not accepting it.
Should they allow +/- signs?
If negative values are valid, then of course they should.
If not, then don't just silently strip minus signs, as it totally changes the meaning of the data. Stripping pluses is less of a problem.
What if [thousands separators are] in between every 4 digits instead of every 3?
In countries that use three-digit grouping, "1,0000" can be assumed to be a typo. But is it a typo for "10,000" or for "1,000"? I wouldn't dare guess, as a wrong guess could cost the user $9,000.
What about hexidecimal and octal
representations?
Unless you're running the search feature for unicode.org, I can't imagine why anyone would use hexidecimal in web form.
And "01234" is almost certainly intended to be 1234 instead of 668.
What about things like...credit card numbers
Please allow spaces or hyphens in credit card numbers. It's really annoying when I have to type an undelimited 16-digit number.

I think you're over reacting a little bit. If there's anything in the field that shouldn't be there, strip it. otherwise try to force the input into whatever format you want, and if it doesn't fit, reject it.

I would say "Accept anything but process only valid data".
Expect your users to behave like a computer noob. Validate the input data using regular expressions and other validators.
Search for standard regular expressions for urls, emails and stuff.
Throw in a regular exp like this "/(?:([a-zA-Z0-9][\s,]+))([a-zA-Z0-9]+)$/" for comma or space separated values. With minor tweaking this exp will work for any number of comma separated values.

The one that irritates me as a user is credit card numbers, conventionally these appear as groups of 4 digits with spaces separating them but the odd webform will only accept a single string of digits with no spaces and no indication that this is the format it's seeking. Similarly telephone numbers, humans often use spaces to improve clarity, webforms sometimes accept the spaces and sometimes don't.

Related

How can I determine which regular expressions from a list possibly overlap

I have a table of regular expressions that are in an MySQL table that I match text against.
Is there a way, using MySQL or any other language (preferably Perl) that I can take this list of expressions and determine which of them MAY overlap. This should be independent of whatever text may be supplied to the expressions.
All of the expression have anchors.
Here is an example of what I am trying to get:
Expressions:
^a$
^b$
^ab
^b.*c
^batch
^catch
Result:
'^b.*c' and '^batch' MAY overlap
Thoughts?
Thanks,
Scott
Further explanation:
I have a list of user-created regexes and an imported list of strings that are to be matched against the regexes. In this case the strings are "clean" data (ie they are not user-created but imported from another source - they must not change).
When a user adds to the list of regexes I do not want any collisions on either the existing list of strings nor any future strings (which can not be guessed ahead of time - the only constraints being they are ASCII printable characters no longer than 255 characters).
A brute-force method would be to create a "rainbow" table of all of the permutations of strings and each time a regex is added run all of the regexes against the rainbow table. However I'd like to avoid this (I'm not even sure of the cost) and so was wondering aloud as to the possibility of an algorithm that would AT LEAST show which regexes in a list MAY collide.
I will punt on full REs. Even limiting to BREs and/or MySQL-pre-8.0 will be challenging. Here are some thoughts.
If end-anchored and no + or *, the calculate the length. The fixed-length can be used as a discriminator. Also, it could be used for toning back the "brute force" by perhaps an order of magnitude.
Anything followed by + or * gets turned into .* for simplicity. (Re the "may collide" rule.)
Any RE with explicit characters (including those followed by +) becomes a discriminator in some situations. Eg, ^a.*b$ vs ^a.*c$.
For those anchored at the end, reverse the pattern and test it that way. (I don't know how difficult reversing is.)
If you can say that a particular character must be at any position, then use it as a discriminator: ^a.b.*c$ -- a in pos 1; b in pos 3; c at end. Perhaps this can be extended to character classes: ^\w may match, but ^\d and ^a.*\d$ can't.

How to stop a number input removing the first 0 in a number

For card payments we accept a security code of 3 digits.
In some instances on some browsers (likely to be older IE versions) we have had occurences of a code with a 0 at the start (example 012) having the first 0 removed thus only allowing the input of 12. This therefore invalidates the security code.
We have this as a number input to allow number input only on mobile devices, I've a feeling this is the cause. However, is there anything we can do to stop this from happening?
The current input code is:
<input type="number" pattern="[0-9]*" size="4" value="$securitycode" name="securitycode">
Many thanks in advance.
This behavior is according to the spec, so I don't think you can directly do something to prevent it.
If the user agent provides a user interface for selecting a number,
then the value must be set to the best representation of the number
representing the user's selection as a floating-point number.
Specifically, the smoking gun in the definition of "best representation" is
(11). Collect a sequence of characters that are ASCII digits, and interpret the resulting sequence as a base-ten integer. Multiply value
by that integer.
I am assuming that you want to keep the input type so that mobile user agents present to the user a UI better suited to the task of inputting a numeric code. So what you can do is, since you now know what the spec says, anticipate this behavior on the server side: pad the incoming value with zeroes.
It seems changing the input type to "text" might resolve this issue.

What are "magic numbers" in computer programming?

When people talk about the use of "magic numbers" in computer programming, what do they mean?
Magic numbers are any number in your code that isn't immediately obvious to someone with very little knowledge.
For example, the following piece of code:
sz = sz + 729;
has a magic number in it and would be far better written as:
sz = sz + CAPACITY_INCREMENT;
Some extreme views state that you should never have any numbers in your code except -1, 0 and 1 but I prefer a somewhat less dogmatic view since I would instantly recognise 24, 1440, 86400, 3.1415, 2.71828 and 1.414 - it all depends on your knowledge.
However, even though I know there are 1440 minutes in a day, I would probably still use a MINS_PER_DAY identifier since it makes searching for them that much easier. Whose to say that the capacity increment mentioned above wouldn't also be 1440 and you end up changing the wrong value? This is especially true for the low numbers: the chance of dual use of 37197 is relatively low, the chance of using 5 for multiple things is pretty high.
Use of an identifier means that you wouldn't have to go through all your 700 source files and change 729 to 730 when the capacity increment changed. You could just change the one line:
#define CAPACITY_INCREMENT 729
to:
#define CAPACITY_INCREMENT 730
and recompile the lot.
Contrast this with magic constants which are the result of naive people thinking that just because they remove the actual numbers from their code, they can change:
x = x + 4;
to:
#define FOUR 4
x = x + FOUR;
That adds absolutely zero extra information to your code and is a total waste of time.
"magic numbers" are numbers that appear in statements like
if days == 365
Assuming you didn't know there were 365 days in a year, you'd find this statement meaningless. Thus, it's good practice to assign all "magic" numbers (numbers that have some kind of significance in your program) to a constant,
DAYS_IN_A_YEAR = 365
And from then on, compare to that instead. It's easier to read, and if the earth ever gets knocked out of alignment, and we gain an extra day... you can easily change it (other numbers might be more likely to change).
There's more than one meaning. The one given by most answers already (an arbitrary unnamed number) is a very common one, and the only thing I'll say about that is that some people go to the extreme of defining...
#define ZERO 0
#define ONE 1
If you do this, I will hunt you down and show no mercy.
Another kind of magic number, though, is used in file formats. It's just a value included as typically the first thing in the file which helps identify the file format, the version of the file format and/or the endian-ness of the particular file.
For example, you might have a magic number of 0x12345678. If you see that magic number, it's a fair guess you're seeing a file of the correct format. If you see, on the other hand, 0x78563412, it's a fair guess that you're seeing an endian-swapped version of the same file format.
The term "magic number" gets abused a bit, though, referring to almost anything that identifies a file format - including quite long ASCII strings in the header.
http://en.wikipedia.org/wiki/File_format#Magic_number
Wikipedia is your friend (Magic Number article)
Most of the answers so far have described a magic number as a constant that isn't self describing. Being a little bit of an "old-school" programmer myself, back in the day we described magic numbers as being any constant that is being assigned some special purpose that influences the behaviour of the code. For example, the number 999999 or MAX_INT or something else completely arbitrary.
The big problem with magic numbers is that their purpose can easily be forgotten, or the value used in another perfectly reasonable context.
As a crude and terribly contrived example:
while (int i != 99999)
{
DoSomeCleverCalculationBasedOnTheValueOf(i);
if (escapeConditionReached)
{
i = 99999;
}
}
The fact that a constant is used or not named isn't really the issue. In the case of my awful example, the value influences behaviour, but what if we need to change the value of "i" while looping?
Clearly in the example above, you don't NEED a magic number to exit the loop. You could replace it with a break statement, and that is the real issue with magic numbers, that they are a lazy approach to coding, and without fail can always be replaced by something less prone to either failure, or to losing meaning over time.
Anything that doesn't have a readily apparent meaning to anyone but the application itself.
if (foo == 3) {
// do something
} else if (foo == 4) {
// delete all users
}
Magic numbers are special value of certain variables which causes the program to behave in an special manner.
For example, a communication library might take a Timeout parameter and it can define the magic number "-1" for indicating infinite timeout.
The term magic number is usually used to describe some numeric constant in code. The number appears without any further description and thus its meaning is esoteric.
The use of magic numbers can be avoided by using named constants.
Using numbers in calculations other than 0 or 1 that aren't defined by some identifier or variable (which not only makes the number easy to change in several places by changing it in one place, but also makes it clear to the reader what the number is for).
In simple and true words, a magic number is a three-digit number, whose sum of the squares of the first two digits is equal to the third one.
Ex-202,
as, 2*2 + 0*0 = 2*2.
Now, WAP in java to accept an integer and print whether is a magic number or not.
It may seem a bit banal, but there IS at least one real magic number in every programming language.
0
I argue that it is THE magic wand to rule them all in virtually every programmer's quiver of magic wands.
FALSE is inevitably 0
TRUE is not(FALSE), but not necessarily 1! Could be -1 (0xFFFF)
NULL is inevitably 0 (the pointer)
And most compilers allow it unless their typechecking is utterly rabid.
0 is the base index of array elements, except in languages that are so antiquated that the base index is '1'. One can then conveniently code for(i = 0; i < 32; i++), and expect that 'i' will start at the base (0), and increment to, and stop at 32-1... the 32nd member of an array, or whatever.
0 is the end of many programming language strings. The "stop here" value.
0 is likewise built into the X86 instructions to 'move strings efficiently'. Saves many microseconds.
0 is often used by programmers to indicate that "nothing went wrong" in a routine's execution. It is the "not-an-exception" code value. One can use it to indicate the lack of thrown exceptions.
Zero is the answer most often given by programmers to the amount of work it would take to do something completely trivial, like change the color of the active cell to purple instead of bright pink. "Zero, man, just like zero!"
0 is the count of bugs in a program that we aspire to achieve. 0 exceptions unaccounted for, 0 loops unterminated, 0 recursion pathways that cannot be actually taken. 0 is the asymptote that we're trying to achieve in programming labor, girlfriend (or boyfriend) "issues", lousy restaurant experiences and general idiosyncracies of one's car.
Yes, 0 is a magic number indeed. FAR more magic than any other value. Nothing ... ahem, comes close.
rlynch#datalyser.com

Why 13 places in ROT13?

I understand the reasons for and against ROT13, but I'm wondering why specifically people have chosen 13 places to shift the alphabet? I understand it's halfway around, but is there an elegant reason to go -that- far, but not 12 or 14 spots?
It seems to me like making each letter "as far away" as possible from its starting position only is meaningful to a human who might recognize "close" characters (although I doubt this is possible/probable).
Anyone know the answer to this?
Because it has the nice property of being involutive, that is to say, ROT13(ROT13(alphaOnlyString)) = alphaOnlyString.
According to Wikipedia:
A shift of thirteen was chosen over other values, such as three as in the original Caesar cipher, because thirteen is the value for which encoding and decoding are equivalent, thereby allowing the convenience of a single command for both.
Probably cause it is its own inverse. The same algorithm can be used for "encryption" as well as "decryption".
Because shifting by 13 moves the characters half way around the alphabet (which has 26 places). So, to get back to plaintext you only need to shift it 13 moves again. This way, you don't have to have separate functions for encoding or decoding because the same operation will be encode or decode.

Should implicit octal encoding be removed or changed in programming languages?

I was looking at this question. Basically having a leading zero causes the number to be interpreted as octal. I've ran into this problem numerous times in multiple languages.
Why doesn't the language explicitly require you to specify octal with a function call or a type (in strong typed languages) like:
oct variable = 2;
I can understand why hexadecimal (0x0234) has this format. Hex is pretty useful. An integer from the database will never have an x in it.
But octal numbers 0123 look like ints and are a pain to deal with. I've never used octal for anything.
Can anyone explain the rationale behind this usage? Is it just a bit of historical cruft?
It's largely historic. The best solution I've seen is in the new version of Python, where octal is indicated with a special prefix character "o", much like hexadecimal's "x" prefix:
0o10 == 0x8 == 8
99.9% of the reason it exists is to support chmod() calls, i.e. chmod(fd, 0755).
It does rather seem like a format more like hex's would be superior.
It exists since working with 3-bit segments is almost as useful as working with 4-bit segments. This was more true in the past (e.g., seven-segment LEDs, chmod, etc.).
The real question is why haven't more languages adopted octal and binary notations in a more regular fashion:
10 == 0b1010 == 0o12 == 0x0A
I know that Python finally adopted the 0o8 notation... not sure if they have adopted the binary one as well. I guess a better question is Why does this still trip people up?
I hate this too, I don't know why it's been carried forward into so many modern languages. I once knew someone who had a zip code like "09827" when he lived in NYC. Sometimes he had to input his zip code as "9827," because the leading zero would lead to error messages (since 9's and 8's are illegal characters in octal numbers).
Yes, it's historical. C uses this way to specify literals in octal, and possibly it was used somewhere before that.
I've experienced it in Javascript, where parsing dates stops working in august. Up to july it works as '07' parsed as octal is still seven, but '08' is not a valid number... (The solution is to specify the number base in the parseInt call.)
In C# there are no binary or octal literals, perhaps the reasoning is that you shouldn't do as much bit fiddling that the language needs it...
Personally, I blame the programmer in this case. Why are you formatting an integer by zero padding? Zero padding is for strings, not numeric types.