I'm using rich text editors (tiny_mce) pretty widely in a web application I'm developing. An issue I can't seem to get my head around is how to limit user input so that the length of generated content fits in my database field. Character counters, which I would normally use for non rich-text content, are pretty useless because the user does not see (or know or care about) the HTML being generated behind the scenes. i.e. If I limit the input to 100 characters and the user types 99 characters they will still likely exceed the character limit because of the html tags used to structure their content behind the scenes.
The options I see are:
Do nothing and hope for the best. Obviously not really an option. There is a good chance that the insert will fail ungracefully when attempting to save to the db (mysql in this case).
Validate the length of the generated HTML and provide user with a message if it exceeds allowed length. This will be confusing to the user who, to the best of his knowledge has met the requirements for text entry. Telling him that he has exceeded the limit when, in fact, he has only entered 99 characters, would only server to confuse/enrage
Make the database field super large with the hope that it will never be exceeded. Seems like a bad idea...
This must be a commonly encountered issue. What is the best solution here?
When the form submit, do a Javascript check on the length of the field being submitted.
As for your options:
Q: Do nothing and hope for the best.
A: Won't work. Some will will try something
Q: Validate the length of the generated HTML and provide user with a message if it exceeds allowed length.
A: I recently had a similar situation. I put a static message on the page that said the limit was 10,000, but I actually checked for 50,000. This worked for several months until someone copied and pasted out of MS Word. The user saw 5,000 characters, but it was encoded in way over 100,000
Q: Make the database field super large with the hope that it will never be exceeded. Seems like a bad idea...
A: Depending on your database and vesion, this could tolerable, not ideal, but tolerable.
Related
I have a contenteditable div where users can enter up to about 20kb of text. As they're typing I want to keep autosaving that div to a mysql table so that they can later retrieve that text on their phone, tablet etc. So far I've been simply autosaving every 20 seconds by taking the whole text block and doing an update to the row in my table that holds that text.
I realize that method is really inefficient and won't work once my user base grows. Is there a better way to do this? E.g, somehow take a diff between the original text and what changed and save only that? Use node.js? If you've got some idea, let me know. Thanks.
Not all texts will be 20kb, so there doesn't need to be a direct problem. You could also choose to autosave less often, and you can autosave to a session too, and only write to the database less frequently.
You could calculate the differences between the stored version (which can be kept in the session too) and the typed version. The advantage is that you don't have to use the database so often, but processing will become harder, so the load on your webserver will increase too.
I'd maybe choose to do autosaves to a memory table (you can choose the storage type in MySQL), and write a separate process/cron job, that updates the physical table in bulk on regular intervals.
this reminds me of google docs, they have a pretty nice system of autosaving large text to the database. Now I don't know about the internal components of google docs, but that is something you might want to look into.
google has much more bandwidth so their method might not work for you, though it seems like they are constantly saving data to their servers.
what you could do is use javascript and save data on the user side, and only load the data into the database when the user leaves or clicks "save". This way, when they come back to the page, as long as the savefile is there, they can get back to their file.
One way:
Use ajax to send the text back to server every x seconds or even by multiples of x characters (e.g every 10, 20, 30 chars and so on).
Use server side code (eg php) to do most of the work from this request (as its fast and efficient) where you could hold the text in session.
On each request the server side code could compare with the session text from the last request and would then give you the possibility to do one of the following:
a) If start of new text is same as old session text then just update the field in database with the different using say concat:
update table
set field=concat(field,'extra data...')
where id='xx'
and then set session text to new text.
b) If start text is different then you know you need to do a straight up update statement to change the field value to the new text.
Another way:
You only do a full update when necessary if not just a concat.
You can "hold" a certain amount of requests in the session before you actaully do the update eg run the ajax every 5 sec's or chars but you server side script only runs the update statement on every 5th call or 20 chars or addition 20 chars if the first part hasnt changed.
A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it assumed that it was done in order to preserve data quality and possibly to provide a better user experience also.
Other more generic examples include Phone with Area Code (10 consecutive digits versus [3,3,4]) and of course SSN (9 digits versus [3,2,4])
It got me wondering whether there are any known standards out there on the topic? When do you split up your ID#? Specifically with regards to user experience and minimizing data entry errors.
I know there was some research into this, the most I can find at the moment is the Wikipedia article on Short-term memory, specifically chunking. There's also The Magical Number Seven, Plus or Minus Two.
When I'm providing ID's to end users I, personally like to break it up into blocks of 5 which appears to be the same convention the original designer of your system used. I've got no logical reason that I can give you for having picked this number other than it "feels right". Short of being able to spend a lot of money on carrying out a study, "gut instinct" and following contentions from other systems is probably the way to go.
That said, if you can make the UI more usable to the user by:
Automatically moving from the end of one field to the start of another when it's complete
Automatically moving from the start of one field to the prior field and deleting the last character when the user presses delete in an empty field that isn't the first one
OR
Replacing it with one long field that has some form of "input mask" on it (not sure if this is doable in plain HTML, but it may be feasible using one of the UI frameworks) so it appears like "_____ - _____ - _____ - ____" and ends up looking like "1235 - 54321 - 12345 - 1234"
It would almost certainly make them happier!
Don't know about standards, but from a personal point of view:
If there are multiple fields, make sure the cursor moves to the next field once a field is full.
If there's only one field, allow spaces/dashes/whatever to be used in that field because you can filter them out. It's really annoying when sites/programs force you to enter dates in "dd/mm/yyyy" format, for example, meaning the day/month must be padded with zeroes. "23/8/2010" should be acceptable.
You need to consider the wider context of your particular application. There are always pros and cons of any design decision, but their impact changes depending on the situation, so you have to think every time.
Splitting the long number into several fields makes it easier to read, especially if you choose to divide the number the same way as most of your users. You can also often validate the input as soon as the user goes to the next field, so you indicate errors earlier.
On the other hand, users rarely type long numbers like that nowadays: most of the time they just copy-paste them from whatever note-keeping solution they have chosen, in whatever format they have it there. That means that a single field, without any limit on lenght or allowed characters suddenly makes a lot of sense -- you can filter the characters out anyways (just make sure you display the final form of the number to the user at some point). There are also issues with moving the focus between fields, with browsers remembering previous values (you just have to select one number, not 4 parts of the same number then), etc.
In general, I would say that as browsers slowly become more and more usable, you should take advantage of the mechanisms they provide by using the stock solutions, and not inventing complex solutions on your own. You may be a step before them today, but in two years the browsers will catch up and your site will suck.
I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
also see this perfect owasp article on xss prevention
Yes, because at some stage you'll want access to the original input entered. This is because...
You never know how you want to display it - in JSON, in HTML, as an SMS?
You may need to show it back to the user as is.
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
3<4 :->
They'll only get the 3 if it is a regex.
Suppose you have the text R&B, and store it as R&B. If someone searches for R&B, it won't match with a search SQL:
SELECT * FROM table WHERE title LIKE ?
The same for equality, sorting, etc.
Or if someone searches for life span, it could return extraneous matches with the escaped <span>'s. Though this is a bit orthogonal, and can be solved by using an external service like Elasticsearch, or by storing a raw text version in another field; similar to what #limscoder suggested.
If you expose the data via an API, the consumers may not expect the data to be escaped. Adding documentation may help.
A few months later, a new team member joins. As a well-trained developer, he always uses HTML escaping, now only to see everything is double-escaped (e.g. titles are showing up like He said "nuff" instead of He said "nuff").
Some escaping functions have additional options. Forgetting to use the same functions/options while un-escaping could result in a different value than the original.
It's more likely to happen with multiple developers/consumers working on the same data.
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.
I'm building a "Narrow your results by" feature similar to Best Buy's and NewEgg's. What is the best practice for storing the user's filter selections in a URL that can be shared/bookmarked?
The obvious choice is to simply keep all the user's selections in the query string. However, both of these examples are doing something far more cryptic:
Best Buy:
http://www.bestbuy.com/site/olstemplatemapper.jsp?id=pcat17080&type=page&qp=crootcategoryid%23%23-1%23%23-1~~q70726f63657373696e6774696d653a3e313930302d30312d3031~~cabcat0500000%23%230%23%2311a~~cabcat0502000%23%230%23%23o~~nf518||24363030202d2024383939&list=y&nrp=15&sc=abComputerSP&sp=%2Bcurrentprice+skuid&usc=abcat0500000
It appears they're assigning some unique value to the search and storing it temporarily on their side. Or perhaps wrapping their db id's in a bunch of garbage because they believe in security through obscurity?
Is there some inherent disadvantage to keeping things simple like this?
www.mydomain.com?color=blue&type=laptop
So when I select a 17" screen size as a filter, it would simply reload the page with the additional query string tacked on:
www.mydomain.com?color=blue&type=laptop&screen-size=17
Also, to clarify, I would likely use corresponding ids from the database in the URL to make validation and parsing easier/faster, but the question remains about whether there's some problem I'm missing in my simple approach.
Thanks in advance!
One of the first players in the faceted search domain was Endeca, and they are still used by many of the larger online stores (PC Connection, Home Depot, Walmart ...). You may want to take a look at their website.
There is a Drupal plug-in for faceted search. Check out the demo.
I don't think the URL composition matters much, but I actually think presenting the parameters in a readable form may be dangerous. One of the advantages of using "Guided search" is that you can avoid producing empty result sets by not allowing invalid parameter combinations. If the query-string is user-editable, they can come up with invalid combinations, circumventing the guided search.
I think the more human-readable manner, i.e. www.mydomain.com?color=blue&type=laptop&screen-size=17 is the better approach to take here. Just make sure you are sanitizing everything coming from the url before it gets to the database.
The query string has very reachable max length (255?), which is probably the reason for the serialization.
I'm designing a MySQL table for an authentication system for a high-traffic personal website. Every time a user comment, article, etc is displayed the following fields will be needed:
login
User Display
User Bio ( A little signature )
Website Account
YouTube Account
Twitter Account
Facebook Account
Lastfm Account
So everything is in one table to prevent the need to call sub-tables. So my question is:
¿Would there be any improvements if I combine Website, Youtube, Twitter, Facebook and Lastfm columns to one?
For example:
[website::something.com][youtube::youtube.com/something]
No, combining these columns would not result in any improvement. Indeed it seems you would extend the overall length (with the adding of prefix and separators, hence potentially worsening performance.
A few other tricks however, may help:
reduce the size of the values stored in "xxxAccount" columns, by removing altogether, or replacing with short-hand codes, the most common parts of these values (the examples shown indicate some kind of URL whereby the beginning will likely be repeated.
depending on the average length of the bio, and typical text found therein, it may also be useful to find ways of shrinking its [storage] size, with simple replacement of common words, or possibly with actual compression (ZIP and such), although doing so may result in having to store the column in a BLOB column which may then become separated from the table, depending on the server implementation/configuration.
And, of course, independently form any improvements at the level of the database, the use model indicated seems to prompt for caching this kind of data agressively, to avoid the trick to SQL altogether.
Well i dont think so , think of it this way .. you will need some way to split them and that would require additional processing and then why not just have one field in the whole table and have everything in that? :) Dont worry about the performance it would be better with separate columns