How should one reasonably handle combining characters in UTF-8 - html

I am writing a website with a user chat function.
At some point a user decided to use diatrics to draw all over everyone's screens.
In response I removed all text that was not in the ASCII character range. I'd like to re-enable UTF-8 but I don't know what to do about the combining marks ( UTF-8 characters that modify the character next to them ).
As you can see from the example below, Stack Overflow doesn't handle for this problem.
Malicious input t̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀è̀̀̀̀̀̀̀x̀̀̀̀̀̀̀̀̀̀t̀̀̀̀̀̀̀̀̀̀̀̀̀
I feel like only 1 combining mark should be allowed but that seems like a really excessive thing for me to need to write and I don't know if there are any languages that take 2 or 3 combining characters. I imagine Korean uses them extensively.
This seems like it should be a solved problem but I can't any useful information on the topic.

Related

SQL: What are all the delimiters?

I am running some php on my website pulling content from an SQL database. Queries are working and all is good there. Only problem I am having is that when I upload my dataset (.csv format) using phpmyadmin, an error occurs dropping all contents after a certain row. Supposedly this is caused by SQL recognizing more columns in that specific row than intended. And unfortunately, this is not just a single occurrence. I cannot seem to find out exactly what the problem is but most likely it is caused by some values in the column 'description' containing delimiters that split it into multiple columns. Hopefully, by deleting/replacing all these delimiters the problem can be solved. I am rather new to SQL though and I cannot seem to find a source that simply lays out what all the potential delimiters are that I should consider. Is there somebody that can help me out?
Thank you in advance and take care!
Regards
From personal experience, once can delimit by many different things. I've seen pipes | and commas , as well as tab, fixed space, tildas ~ and colons :.
Taken directly from https://en.wikipedia.org/wiki/Delimiter-separated_values:
"Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheet programs and statistical packages, sometimes even without the user designating which delimiter has been used.[5][6] Despite that each of those applications has its own database design and its own file format (for example, accdb or xlsx), they can all map the fields in a DSV file to their own data model and format.[citation needed]
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database (in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons (;), pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere."

what collation must i use utf8_general_ci or utf8_unicode_ci or any other, for all world languages?

We develop android app. The app accepts text from users and upload to server (mysql). Then this text is read by other users.
While testing i found that 'Hindi' (Indian) language gets inserted in the column as '????? '. Then after SO search, i changed the collation to utf8_general_ci.
I am new to collation. I want to let user input text in any language in the world and others get the access. What shall i do. Accuracy is must.
But i saw a comment where one says, "You should never, ever use utf8_general_ci. It simply doesn’t work. It’s a throwback to the bad old days of ASCII stooopeeedity from fifty years ago. Unicode case-insensitive matching cannot be done without the foldcase map from the UCD. For example, “Σίσυφος” has three different sigmas in it; or how the lowercase of “TSCHüẞ” is “tschüβ”, but the uppercase of “tschüβ” is “TSCHÜSS”. You can be right, or you can be fast. Therefore you must use utf8_unicode_ci, because if you don’t care about correctness, then it’s trivial to make it infinitely fast."
Your question title is asking about collations, but in the body you say:
I want to let user input text in any language in the world and others get the access.
So, I'm assuming that is what you're specifically after. To clarify, collations affect how MySQL compares strings with each other, but it's not the thing that ultimately opens up the possibility of storing unicode characters.
For storage you need to ensure that the character set is defined correctly. MySQL allows you to specify character set and collation values on a column level, but it also allows you to specify defaults on a table and database level. In general I'd advice setting defaults on a database and table level, and let MySQL handle the rest when defining columns. Note that if columns already exist with a different character set, then you'll need to investigate changing it. Depending on what you're using to communicate with MySQL, you may need to specify a character encoding to use against the connection too.
Note that utf8mb4 is an absolute must for the character set used, do not use just utf8.. you won't be able to store unicode characters that consume 4 bytes with UTF-8, such as emoji characters.
As for the collation to use, I don't have a recommendation really, as it sort of depends what you're aiming for, speed or accuracy. There is a fair amount of information around which covers the topic in other answers.

Is it a bad idea to escape HTML before inserting into a database instead of upon output?

I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
also see this perfect owasp article on xss prevention
Yes, because at some stage you'll want access to the original input entered. This is because...
You never know how you want to display it - in JSON, in HTML, as an SMS?
You may need to show it back to the user as is.
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
3<4 :->
They'll only get the 3 if it is a regex.
Suppose you have the text R&B, and store it as R&B. If someone searches for R&B, it won't match with a search SQL:
SELECT * FROM table WHERE title LIKE ?
The same for equality, sorting, etc.
Or if someone searches for life span, it could return extraneous matches with the escaped <span>'s. Though this is a bit orthogonal, and can be solved by using an external service like Elasticsearch, or by storing a raw text version in another field; similar to what #limscoder suggested.
If you expose the data via an API, the consumers may not expect the data to be escaped. Adding documentation may help.
A few months later, a new team member joins. As a well-trained developer, he always uses HTML escaping, now only to see everything is double-escaped (e.g. titles are showing up like He said "nuff" instead of He said "nuff").
Some escaping functions have additional options. Forgetting to use the same functions/options while un-escaping could result in a different value than the original.
It's more likely to happen with multiple developers/consumers working on the same data.
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.

Finding special characters in MySQL database

Does anyone know of a quick and easy way to locate special characters that didn't get correctly converted when data was imported into MySQL.
I think this an issue due to data encoding (e.g. Latin-1 vs. UTF-8). Regardless where the issue first occurred, I'm stuck with junk in my data that I need to remove.
There's unlikely to be an easy function for this, because for example, a broken UTF-8 special character will consist of two valid ISO-8859-1 characters. So while there are patterns of what those broken characters look like, there is no sure-fire way of identifying them.
You could build a search+replace function to replace the most common occurrences in your language (e.g. Ãœ for Ü if imported from UTF-8 into ISO-8859-1).
That said, it would be best to restart the import with the correct settings, if at all possible.

How would I find common misspellings of a given word using aspell or another tool

For a given word I'd like to find the n closest misspellings. I was wondering if an open source spell checker like aspell would be useful in that context unless you have other suggestions.
For example: 'health'
would give me: ealth, halth, heallth, healf, ...
Spelling correction tools take misspelled words and offer possible correctly spelled alternatives. You seem to want to go in the other direction.
Going from a correctly spelled word to a set of possible misspellings could probably be performed by applying a set of mutation heuristics to common words. These heuristics might do things like:
randomly adding or removing single characters
randomly apply transpositions of pairs of characters
changing characters to other characters based on keyboard layouts
application of common "point" misspellings; e.g. transposing "ie" to "ei", doubling or undoubling "l"s.
Going from a correctly spelled word to a set of common misspellings is really hard. Probably the only reliable way to do this would be to instrument a spelling checker package used by a large community of users, record the actual spelling corrections made using the spelling checker, and aggregate the results. That is probably (!) beyond the scope of your project.
On revisiting my answer, I think I've missed something.
My heuristics above are mostly for typing error rather than misspellings. A typing error is where the user knows the correct spelling but mistyped the word. A misspelling is where the person doesn't know the correct spelling of a word, and uses either incorrect knowledge or intuition (i.e. a guess). Typical guesses are based on listening to what the word sounds like, and then pick a spelling that (if correct) would most likely be pronounced that way.
So an good heuristic for predicting misspellings would need to be based what the word actually sounds like when spoken. That requires a phonetic dictionary (to go from the actual word to its pronunciation) and a set of rules for generating plausible spellings for the phonetic word. That's more complicated than simple heuristics for typing errors.