How do I learn to verify that user input is sane?

How do I learn to verify that user input is sane? - language-agnostic

I'm not sure of the terminology here, so let me specify that when I say "verify" user input, I mean watch out for users claiming 30 Feb 2021 as their birthdays, rather than guarding against injection attacks.
Are there any guides to doing this correctly, or lists of common ways people do it wrong? Strategies for ensuring correct input even before it's entered (e.g., picking out of a calendar instead of typing into a text field)?
Note that I am not interested in language-specific answers (e.g., ASP.NET Validation Controls) but rather general strategies and principles.

The freer you make the input field, the more you have to check. Some languages may make it easy for you to verify that a text field is a valid date; others may not.
Then again, some users will resent clicking on a calendar control or three drop-downs to enter their birthdate. They may prefer to just type it in. That's a trade-off.

The term you are looking for is input validation.
As you point out if you use a control where it is impossible to enter invalid data you can help the client, but you still need to implement proper validation on the server.
I mean watch out for users claiming 30 Feb 2021 as their birthdays, rather than guarding against injection attacks
Why not do both? Is there a specific reason why you want to leave yourself open to injection attacks?
Assume that the user sends a string to the server, either one they entered themselves or else one that was sent by a control you placed on the page. The first part is to find a library function for parsing the string into typed data. In your example you could use DateTime.TryParse to parse a string to a date. This will fail for your given example as the given date is invalid. If you cannot find a library function for what you are trying to parse you can try to write a parser yourself. For simple validations you may be able to express it as a regular expression. For more complicated inputs you may need to write some code that performs the validation, perhaps even using a parser library to help you if the input language is particularly complicated.
The second part is to implement business validation rules specific for your needs. For example you know that a birth date must be in the past, but not too far in the past. This will require some judgement as it's not impossible that someone using your site could be 100 years old, but it's highly unlikely that they are 200 years old since no-one is believed to be this old.

i would recommend using a design pattern called "strategy". this is one of the patterns created by "the gang of four", or "gof" for short. there are some copies and variants of this pattern that you may have heard of, e.g. "inversion of control" and "dependency injection".
anyways, for an object oriented language, what you do is that you create a class called "validator", which validates data in a method called "validate". you'll have to make validate accept some relevant form of input, or overload it to have different methods for different sorts of data. or if you have access to some form of generics, you can use that.
next up, the constructor of this class should take a "validatorstrategy" object as argument. and then the actual validation will be passed through the strategy object.
to take this even further, you could then create some sort of input form generator system, where you specify input fields with your own type names. these will then generate different input fields depending on your front end language (html/android xml/java swing), and they will also affect the way in which the input is validated.
hmm.. i wonder how to solve the issue with two password input fields that need to have the exact same content to validate. how would this look in the form generating system? maybe there would be one input type named "password" which would generate one input field which doesn't show the input and has no validation, and another type named "passwordsetter" which would generate two input fields which doesn't show the input, and has the validation strategy of comparing the data from th two fields. creating that validation strategy could be pretty tricky though D:

Related

Which one is the best approach for validating emails

I just have a quick question; which one is the best approach between joi and regular expressions for validation?

Very simple validation and then send an email with a validate email link. Trying to validate email address beyond # followed by . you will get it wrong unless you read the whole email spec and have a ridiculously complicated regex.

The answers here (including this one) will be very bias as your question is very subjective.
I will try to give you to be as objective as possible.
First, let us review what JOI and Regular Expressions are:
JOI is a javascript only library that does schema-validation of javascript objects. It uses (according to its github page example) regular expressions.
Regular expression is common across language (with some variations may be) which can be used (among its usecases) to validate strings.
If you are writing a web UI and wants to validate the data filled in by the user, JOI is purpose built (because it can take care of the whole objects).
If you just want to check e-mails are in correct form, JOI will be an overkill. In this specific instance, you can live with just a regular expression.
Subjective part: There is a certain school (group) of software engineering community who does not like regular expression due to readability issues. I am one of them. If you want just e-mail validation, I'd recommend you write a validator yourselve (to check if it includes '#' sign ... and a valid domain name follows '#' etc.).

XSS attacks, multiple html sanitization

I'm working on a web service project, where I display some data obtained from database, which in turn is made of users' inputs. Of course I want to prevent my application from being vulnerable to XSS attacks, so obviously I sanitize the input from html special characters. But I have a following problem - data returned from the server is in form < (in this example case for '<' sign), and on the front end the second sanitization process occurs, making it <, which is totally incomprehensible by the web browser. Is there a simple way to get over it, or maybe I should sanitize inputs only in one place (I presume that the server would be the best option).
Thanks for all answers.

You can't reliably sanitize user input. It's a losing battle. As soon as you think you've filtered out all the "bad" characters, someone will pass in an escape sequence or something else unexpected
If you're using a database server, make sure all input is handled by pre-compiled stored procedures, and make sure that the user that the web app logs in as, only has EXECUTE perms. This prevents SQL injection and other mischief.
If you're worried about actual characters, make sure you have a "pass through OK characters" filter and not a "remove bad characters" filter. The number of "good characters" is finite, while the number of attack vectors is infinite.
As for your question about "<" characters, if the intended output is for user display, you can run the entire string through HttpServerUtility.HtmlEncode or it's equivalent in whatever language you use. This will convert the string into code that will display properly in the browser but not be interpreted.
It doesn't look like you're having a problem escaping it, it looks like you're having a problem deciding if you need to escape it. Pick a standard and stick with it, then convert as necessary. If it normally comes in unescaped, just store it that way, and escape it when you want to display it.

The best way to sanitize untrusted data that is served back to a user in the context of XSSs for spring boot is to use a template engine that will suit your needs (e.g. JSP).
Template engine will automatically generate HTML you need, escape it properly and insert the content in a required placeholder (if an issue with broken encoding occurs for async requests).
Be careful and check if a chosen engine does it by default or it needs a special directive to do so.

What are the practical disadvantages of using strongly typed data interchange format (eg thrift / capn proto) in a microservices context?

I'm thinking of introducing a strongly typed (read - with predefined schema) data interchange format for communication between our internal services. For example, I guess something like Thrift or Cap'n Proto.
At least two obvious advantages (to me) of using this over something like JSON is that
you would KNOW the exact format of the data the service can expect (so leaves less room for ambiguity and errors while communicating) and
the implementation generally deserializes the raw message for you and it provides methods for accessing the objects.
What are the practical disadvantages for going this route, versus something like JSON?
For context - our system consists of services written in python and java - and possibly other languages in the future, and communicates via HTTP endpoints between services and message brokers like rabbitmq.

As with every strongly typed system, one of the major advantanges is without a doubt that if you make mistakes, it fails early in the process, typically at the compilation stage, which is a good thing.
Second biggest advantage IMHO is what you already said: because the fields and types are well known, the compiler, libraries and related code know what data to expect and thus can be written/organized in a more efficient manner - or in short: performance.
In contrast, a losely typed system (like Avro), while allowing for much greater flexibility without the need of recompiling, comes with the other side of the same coin: the downside of being prone to errors regarding the contents of the message at runtime.
This is because a losely defined system defines only the syntax of a valid document (like for example XML) and leaves the message-level semantics of what's in the document up to the upper layers. A strongly typed system has the knowledge about those message-level semantics already built in at compile time. Therefore, it is easy to detect/decide whether a particular document or message is not only well-formed but valid with regard to the message contents. If you need to do the same with the losely defined system, you need to provide additional information at runtime (like XML schema) and validate your document against it.
Bottom line
What system you prefer is more or less a matter of taste in most cases. I'd make the decision based on the question, how variable the data are that I have to deal with. If it makes sense to use a strongly typed system, I'd go that way, because I like it very much to get informed about errors and mistakes early.
However, if there is a need for very flexible data structures, it may make more sense to go the other road. Although designing a losely typed schema on top of a strongly typed system is surely possible, it is somewhat contradicting and you'll end up with some overly complicated, while overly generic, thing.

Typed
Incoming messages that are type tagged is very liberating, so long as it's possible to tell what the incoming message is without reading all of it. If so then you no longer care so much about message order. This is because it's easy for the recipient of the messages to handle whatever it is sent. So you can have an application which just sits there taking whatever it gets, and just does whatever is appropriate for each one.
Format
A schema language that allows you to define value and size constraints is very useful. It means that the sender of a message cannot accidentally send an invalid one. Moreover the receiver can automatically tell if an incoming message meets the schema. This is a real bonus in implementing a network service; the vast bulk of the message validation is done for you!
By size constraint, I mean that you can specify how long an array is in the schema and the generated code will refuse to handle arrays longer or shorter. By value constraints, imagine a message field called "bearing"; you might want to constrain that to be between 0 and 359.
These both allow you to make a clear, unambiguous statement about what the interface is and have it enforced automatically. How many security bugs have there been recently where some network interface data validation has been badly implemented...
Options
One serialisation standard that does all this is ASN.1. The tools I've used take an ASN.1 schema and produce code to serialise and deserialise, automatically checking that the value and size constraints have been met and also telling you what an incoming message type is. The tools for ASN.1 can be quite elderly and are in need of updating. If updated it would be ideal for every purpose, with both binary and text wire formats available.
There's now JSON schemas too, and they seem to have type, value and size constraints. This might be what you're looking for.
I'm fairly sure that Google Protocol Buffers doesn't do type tagging very well, and doesn't do value and size constraints. I've seen comments in GPB schema along the lines of:
// musn't be greater than 10.
If that's what is being written into a schema, the schema language is arguably inadequate...
I'm not sure of Thrift, I'm not sure it does value constraints (someone correct me if I'm wrong please!).
Disadvantages
Can't think of any! It can irritate developers; code they thought was good can be readily revealed to be producing junk messages, which annoys them intensely...

what is the most efficient and safe method to save a short array on html?

I'm trying to make a website to serve as the interface between a plotting program and the user input file. The plotting program needs several parameters, which I could allow the user to enter using input tag. But the plotting program needs user input on the legend for distinguishing the values in the input file as well, namely the range(boundary) of value and the corresponding color for this range. I made a fieldset containing the required input elements for one range. When user click "Add another range", the content of the fieldset is cleared so as to be ready for the new input. And the previously entered input is stored in a table below as a new row. Beside this row, there is a "delete" button.
As this website is aimed for multiple users, this information should be also exclusive for the corresponding user. Could someone please tell me what approach should I use? The plotting program is written using perl, and I'm using CGI for this website.
And this approach should allow the html part to access the current values in the array, so I could display the entered ranges in the table dynamically. This approach should also allow the deletion/modification/addition of such entered range information. i'm thinking of a temporary database. But I only need the final version of all the range info in a string, so I can send it to the CGI program and organize it to be the correct format to be inputted into the perl plotting program.
Any help or hint is greatly appreciated! I'm a newbie to this area. Thank you very much for your time and help in advance!

JSON is pretty universal these days. Use that. Many new database systems like MongoDB use JSON as a native storage format.
Most server-side languages can consume and produce JSON easily. JSON allows structured data, so it can do more than simple arrays.
JSON is also very fast on the browser (compared to XML), being a native JavaScript object.

If the data will be purely in Perl, then FreezeThaw or Storable are the things to use. If your data is simple, then there is nothing wrong with Diodeus' answer of using JSON, but as things get complicated, those modules will be able to handle the complexities of Perl datastructures better.

Should persistent objects validate data upon set?

If one has a object which can persist itself across executions (whether to a DB using ORM, using something like Python's shelve module, etc), should validation of that object's attributes be placed within the class representing it, or outside?
Or, rather; should the persistent object be dumb and expect whatever is setting it's values to be benevolent, or should it be smart and validate the data being assigned to it?
I'm not talking about type validation or user input validation, but rather things that affect the persistent object such as links/references to other objects exist, ensuring numbers are unsigned, that dates aren't out of scope, etc.

Validation is a part of the encapsulation- an object is responsible for it's internal state, and validation is part of it's internal state.
It's like asking "should I let an object do a function and set his own variables or should I user getters to get them all, do the work in an external function and then you setters to set them back?"
Of course you should use a library to do most of the validation- you don't want to implement the "check unsigned values" function in every model, so you implement it at one place and let each model use it in his own code as fit.

The object should validate the data input. Otherwise every part of the application which assigns data has to apply the same set of tests, and every part of the application which retrieves the persisted data will need to handle the possibility that some other module hasn't done their checks properly.
Incidentally I don't think this is an object-oriented thang. It applies to any data persistence construct which takes input. Basically, you're talking Design By Contract preconditions.

My policy is that, for a global code to be robust, each object A should check as much as possible, as early as possible. But the "as much as possible" needs explanation:
The internal coherence of each field B in A (type, range in type etc) should be checked by the field type B itself. If it is a primitive field, or a reused class, it is not possible, so the A object should check it.
The coherence of related fields (if that B field is null, then C must also be) is the typical responsibility of object A.
The coherence of a field B with other codes that are external to A is another matter. This is where the "pojo" approach (in Java, but applicable to any language) comes into play.
The POJO approach says that with all the responsibilities/concerns that we have in modern software (persistance & validation are only two of them), domain model end up being messy and hard to understand. The problem is that these domain objects are central to the understanding of the whole application, to communicating with domain experts and so on. Each time you have to read a domain object code, you have to handle the complexity of all these concerns, while you might care of none or one...
So, in the POJO approach, your domain objects must not carry code related to one of these concerns (which usually carries an interface to implement, or a superclass to have).
All concern except the domain one are out of the object (but some simple information can still be provided, in java usually via Annotations, to parameterize generic external code that handle one concern).
Also, the domain objects relate only to other domain objects, not to some framework classes related to one concern (such as validation, or persistence). So the domain model, with all classes, can be put in a separate "package" (project or whatever), without dependencies on technical or concern-related codes. This make it much easier to understand the heart of a complex application, without all that complexity of these secondary aspects.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008