Rails: How to avoid changing spaces and other special characters in URLs - html

I'm having some trouble with characters like spaces, plus signs, double quotes and accented latin characters in rails when adding them to urls as parameters. They are always converted to numbers preceded by %, and this is generating a lot of trouble for us, since portuguese uses a lot of those characters.
Everything works just fine when typing the character manually in the URL, but once rails makes it into a link, it'll replace it.
Is there a way to avoid that?
Here's an example. Instead of
url?q=transgênico
we get
url?q=transg%C3%AAnico
This completely breaks our search and communication with other websites through parameters - which works fine when typing the special characters manually.
I must admit I did not search about this, since english is not my first language and I have no clue what to search for... All my tries had no results, but I probably was using bad terms.
Using Rails 2.3.8.
Thank you in advance.

I belive you have to encode those characters because they are not valid in a url: Uniform Resource Locators (URL) Spec and in stackoverflow

I'd try setting the :escape => false flag in link_to and the like. If that doesn't help you'll have to monkey patch actionpack/lib/action_view/helpers/url_helper.rb probably.

Maudite is right. You can convert this urls back to a 'nice' form before using it as search terms:
require 'cgi'
CGI.unescape 'url?q=transg%C3%AAnico'
produces:
"url?q=transgênico"

Related

Can you reverse engineer Saved Query link to Application Insights?

In Analytics, if I try to Export > Share a Link to Query then a URL is copied to my clipboard.
It has the following structure:
https://analytics.applicationinsights.io/subscriptions/[subscription id]/resourcegroups/Default-ApplicationInsights-[region]/components/[resource]?q=[alphanumeric string]&apptype=web
The alphanumeric string is some sort of encoding of the actual query. Why do I say that? Because it grows or shrinks according to the size of the query. I tried seeing if it was Base64 or UUencode but neither worked. Also I tried 5 a's and 5 b's followed by 10 c's in the query (arbitrary query) to see if I would see a pattern but that didn't help either.
Some analysis with Unix tools showed that the alphanumeric string is a character set with 0-9, A-Z, +, /, and =.
Does anybody know this format so that I can make arbitrary query URLs?
Alternately being able to submit parameters to the query would solve my problem. My motivation is to link to Application Insights from my website and go to dynamic queries.
Examples of the encoded part:
Query: aaaaabbbbbbcccccccccccEncoding: ?q=H4sIAAAAAAAAA0tMBIIkMEhGAC4AHRlzExcAAAA%3D
Query: abcdefghijklmnopqrstuvwxyz0123456789
Encoding: ?q=H4sIAAAAAAAAA0tMSk5JTUvPyMzKzsnNyy8oLCouKS0rr6isMjA0MjYxNTO3sOQCANVo3%2FUlAAAA
There are 2 options to link a query.
Encoded query (works well for lengthy queries and special characters). The format is q=EncodedQuery. EncodedQuery is the query, encoded in the following way: (a) first it is compressed via gzip, and (b) then it is encoded using base64 encoding.
Plain text query. The format is query=QueryText. The downside is that the query length is (more) limited by browser's URL length limit. It may also not play well with special characters.
Hope that helps,
Yoram

Text encoding problems in JSON.stringified() object

I have a index.html with a which sends a text to a PHP code. This PHP sends it again by POST (curl) to a Node.js server, inserted in a JSON message (utf8-encoded)
//Node.js server file (app.js) -- gets the json and shows it in a <script> to save it in client JS
render(index, {json:{string:"mystring"}})
//Template to render (index.ejs)
var data = <%=JSON.stringify(json)%>;
So that I can pass those variables in the JSON to data. JSON is way bigger than here, I wrote only the part which creates a bug : the string contained here makes an "INvalid character" JS bug. What should I do ? Which encoding/decoding/escaping should I use ?
I have utf-8 everywhere, as all my other strings work, even with german or arabic characters. In this particular case, this is the "mystring" below which breaks the app :
If I remove the characters in the red circles It works.
Here is the string as it is in the JSON i receive :
"Otto\nTheater-, Konzert- und Gpb\n\u2028\u2028Rhoasse\u00dfe 20\u2028\n51065 K\u00f6ln\n\nTelefon: 0000-000000-0\u2028\nTelefax: 0000-000000\n\nE-Mail: address#mail.com\u2028"
Because it is a user-entered text, I must handle this kind of characters. I don't have access to the PHP part of the code, only to the nodeJS and client JS. How can I find and remove/convert those chars in JS ?
<%- JSON.stringify(data).replace(/[\u0000\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g, "\\n") %>;
I ended up replacing invalid unicode characters (which are valid for JSON but not in JS code) with line breaks. This solves the problem
JSON is commonly thought to be a subset of JavaScript, but it isn't quite. Due to an unfortunate oversight, the raw characters U+2028 and U+2029 are permitted in JSON string literals, but not in JavaScript string literals. In JavaScript, they are interpreted as newlines and so having one in a string literal is a syntax error.
Consequently this:
var data = <%=JSON.stringify(json)%>;
isn't safe. You can make it so by manually replacing them with string-literal-escaped versions:
JSON.stringify(json).replace('\u2028', '\\u2028').replace('\u2029', '\\u2029')
Typically it's best to avoid this kind of problem, and keep code and data strictly separated, by dropping the JSON data into an HTML data- attribute. It can then be read out of the DOM from the client-side script and passed through JSON.parse. Then the only kind of escaping you have to worry about is normal HTML-escaping, which hopefully your templating language does by default.
The other characters in your answer are actually okay for JS string literals, except for the control characters, which JSON also escapes.
It may well make sense to remove some of these characters anyway, as an input filtering step. It's unusual and almost always undesirable to have cruft like U+2028 in your data. You could consider filtering out the characters unsuitable for use in markup which include U+2028/9 and other bad things like bidi overrides that can mess up your page rendering.

Box Net: What is the pattern used and characters allowed during serach

I need to test the working of Box Net search in my application. For this I need more information about the search pattern. I see search results are compared with both file title and content.
Search is showing different behaviour when I have file names with special characters? Will search work when I have special characters as file names?
Following is the query I am using
boxSearch = client.getSearchManager().search(searchFileName, boxDefaultRequestObject);
Can you share me the pattern used during search and characters allowed and in what character combination results are seen?
Here are some resources on search:
https://support.box.com/hc/en-us/articles/200519888-How-do-I-search-for-files-and-folders-in-Box-
Box's search returns folder/file names and content, and it also accepts booleans. Just don't use mixed case (aNd is NOT okay, while AND or and is okay).
Box also accepts special characters in uploads and search. See the description here, as this was a fairly recent product update that came in mid-2013.
Additional special character support – Box will add support for more types of special characters across the Box website, desktop and mobile apps. Once the change is live, Box products will support almost all printable characters (except / \ or empty file names; also will not support leading or trailing spaces on files and folders).

Best HTML encoder for Delphi?

Seems like my data is getting corrupted when using HTTPapp.HTMLEncode( string ): String;
HTMLEncode( 'Jo&hn D<oe' ); // returns 'Jo&am'
This is not correct, and is corrupting my data. Does anyone have suggestions for VCL components that work better? Other than spending my time encoding all the cases
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Update
After understanding more about HTML, I have found there is no need to encode the other characters referenced in my link. You would only need to know about the four HTML reserved characters being
&,<,>,"
The issue with the VCL HTTPApp.HTMLEncode( ) function is because of the buffer size and the new Delphi 2009/2010 specifications for default Unicode string types, this can be fixed the way that #mason says below, or it can be fixed with a call to WideFormatBuf( ) instead of the FormatBuf( ) that is currently in use.
Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)
But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.
Or, have a look at this SO question.
You're probably using Delphi 2009 or 2010. It looks to me like they forgot to update HTMLEncode for Unicode. It's passing the wrong buffer lengths to FormatBuf.
The HTMLEncode routine is basically right, aside from that, and it's pretty short. You could probably just make your own copy. Everywhere it calls FormatBuf, it gives 5 parameters. The second and fourth are integer values. Double both of them in each call, (there are only four of them), and then it will work.
Also, you ought to open a QC report on this so it will get fixed.
Small hint: do not convert single quote (') to &apos; - some browsers do not understand this code because &apos; is not valid HTML
For details, see: "The Curse of &apos;" and "XHTML and '"
(Both Delphi units mentioned do not convert single quotes).

Handling MySQL Full Text Special Characters

When using MySQL full text search in boolean mode there are certain characters like + and - that are used as operators. If I do a search for something like "C++" it interprets the + as an operator. What is the best practice for dealing with these special characters?
The current method I am using is to convert all + characters in the data to _plus. It also converts &,#,/ and # characters to a textual representation.
There's no way to do this in nicely using MySQL's full text search. What you're doing (substituting special characters with a pre-defined string) is the only way to do it.
You may wish to consider using Sphinx Search instead. It apparently supports escaping special characters, and by all reports is significantly faster than the default full text search.
MySQL is fairly brutal in what tokens it ignores when building its full text indexes. I'd say that where it encountered the term "C++" it would probably strip out the plus characters, leaving only C, and then ignore that because it's too short. You could probably configure MySQL to include single-letter words, but it's not optimised for that, and I doubt you could get it to treat the plus characters how you want.
If you have a need for a good internal search engine where you can configure things like this, check out Lucene which has been ported to various languages including PHP (in the Zend framework).
Or if you need this more for 'tagging' than text search then something else may be more appropriate.