encode/decodeURI for URLs with quotes - html

I'm having trouble displaying links to URLs with quotes in them and can't figure out a solution despite a load of examples on stackoverflow! Here's the exact string I'm storing in my database (shows Adelaide Antartica)
https://www.google.com/maps/place/67%C2%B007'27.3%22S+68%C2%B008'56.0%22W/#-67.1447827,-68.3886741,71373m/data=!3m1!1e3!4m5!3m4!1s0x0:0x0!8m2!3d-67.124258!4d-68.148903
When I just try putting that into a href it links to...
https://www.google.com/maps/place/67%C2%B007 (i.e. breaks at the first single quote)
But I try using href="encodeURI(theLink)" or href="encodeURIComponent(theLink)" it links to the same thing (I even tried the decode options in case I was thinking about it the wrong way and had the same problem).
Does anyone have a recommendation on the best way to proceed here? I even tried the deprecated "escape" function which also won't work for me. Thanks for any thoughts at all!
(p.s. funnily enough as I'm writing this I see that even Stack Overflow's link is broken in exactly the same way - maybe it's not even possible?!)
EDIT: As requested by Clemzd - I'm using d3 to construct the links, so doing this...
anElement.append("text").html("<a href='" + myData[i].url + "'> a link name </a>");
Works great on everything but links with a single quote regardless of whether I do encodeURI(myData[i].url) or not

You use single quotes to delimit the value of the href attribute, so that value cannot contain unescaped single quotes. That's an issue with HTML markup encoding, not URL encoding.
You can either reverse your use of single and double quotes (encoded URLs cannot contain double quotes, but they can contain single quotes) or replace the single quotes in the URL with a character entity like '. URL encoding by %27 would also work, but that's not a standard encoding that encodeURIComponent does.

There are many ways to solve your issue. All you need to know is if your input may contains ' then you have to escape this character. Otherwise you will get something like anElement.append("text").html("<a href='" + https://www.google.com/maps/place/'link + "'> a link name </a>"); That can't be parsed because of the '.
If you are sure that your link will never contains " then change your code and use " instead as a concatenaion operator.
If not, you can escape ' in server side or client side. For example in client side you can do :
function escapeJavascript(input){
return input.replace(/\\n/g, "\\n")
.replace(/\\'/g, "\\'")
.replace(/\\"/g, '\\"')
.replace(/\\&/g, "\\&")
.replace(/\\r/g, "\\r")
.replace(/\\t/g, "\\t")
.replace(/\\b/g, "\\b")
.replace(/\\f/g, "\\f");
}
And then use it like this: anElement.append("text").html("<a href='" + escapeJavascript(myData[i].url) + "'> a link name </a>");

Related

How to fix character encoding in Wordpress title for comma, apostrophe or quotes?

I'm using a custom function to fetch RSS feeds based on the Wordpress title.
Works great but if title contains a comma, apostrophe or quote it breaks the feed because it is submitting the html encoding as part of the RSS feed search URL.
The goal is to have the RSS feed search URL contain the exact text that's in the Wordpress title, without any html character encoding. I tried html_entity_decode(get_the_title()) and it gets rid of apostrophe and quotes, but it does not work for commas. I'm guessing need to do a str_replace to get rid of commas but not sure the best way to go about it and also incorporate with the html_entity_decode function.
Here's the custom function I'm using (as a little custom plugin) for now. Thanks for your help!
add_shortcode( 'custom_rss', 'execute_custom_rss_shortcode' );
function execute_custom_rss_shortcode() {
return do_shortcode('[wp_rss_retriever url="https://news.google.com/rss/search?q=' . get_the_title() . '&hl=en-US&gl=US&ceid=US%3Aen" items="10"]');
I would recommend using preg_replace to strip out all special-chars from the title by only allowing letters and numbers (no special chars), i updated your code and posted an example:
add_shortcode( 'custom_rss', 'execute_custom_rss_shortcode' );
function execute_custom_rss_shortcode() {
$filtered_title = preg_replace('/[^a-zA-Z0-9]/', '', get_the_title());
// Remove any special character from the title
return do_shortcode('[wp_rss_retriever url="https://news.google.com/rss/search?q=' . $filtered_title . '&hl=en-US&gl=US&ceid=US%3Aen" items="10"]');
}
Tried this code. It filters the commas, quotes and apostrophes from the title so that the RSS feed is not broken by HTML entities. Maybe not the ideal way, so any other better solutions are welcome!
$filter_title = preg_replace("/&#?[a-z0-9]+;/i", '', get_the_title());
return do_shortcode('[feed-fetcher feeds="https://www.bing.com/news/search?q=' . str_replace(',', '', $filter_title) . '&format=rss" max="6"]');
}

Alternative to lookbehind with variable width

I have some html which contains a number of hyperlinks to html files, but they don't have any file extensions.
For example in the string <a href='variablelengthfilename'> I'm trying to match the trailing ' , so I can replace it with .html' (using a RegEx search in Notepad++) using something like this:
`(?<=href='[A-Za-z]*)'`
but that won't work because Notepad++ doesn't allow variable-length lookbehind assertions.
How else can I achieve this?
Thanks
Since you are working in Notepad++, here is a way to achieve what you are after:
Find what: \bhref='[^']*
Replace with: $&.html
The \bhref='[^']* regex matches a href as a whole word, then =' are matched literally, and [^']* matches 0 or more characters other than '. Note you will need to replace ' with " if the href value is inside double quotes.
Assuming all your links look like that, why not just do a simple replace
'>
with
.html'>
?

How to parse links and escape html entities?

I have some user provided content that I want to render.
Obviously the content should be escaped, rails does this by default. However I also want to parse the text so that urls are presented as links.
There is an auto_link helper which does just that. However no matter what order I do this in I can't get the desired result.
Take content:
content
=> "<img src=\"foo\" />\\r\\n\\r\\nhttp://google.com"
If this is escaped, because the slashes in the url are escaped, auto_link will not work:
Rack::Utils.escape_html(content)
=> "<img src="foo" />\\r\\n\\r\\nhttp://google.com"
If I use auto_link first obviously the link will be escaped. Additionally auto_link strips unwanted content rather than escaping. If a script tag is present in the input I want it escaped not removed.
auto_link(content)
=> "<img src=\"foo\" />\\r\\n\\r\\nhttp://google.com"
Any idea how to do get the desired output?
Thanks for any help.
You could strip out all the escaped whitespace characters with content.gsub!(/\\./, ""). Then you'll be able to use auto_link.
The solution I ended up using was ditching auto_link, letting Rack escape my content server side and then parsed the links out of the text on the client side using https://github.com/gabrielizaias/urlToLink
$('p').urlToLink();
I've had success with:
auto_link(h(content))

Usage: Escape HTML problem

I ran into an interesting problem.
In our webpage a user can write their own description. We escape all text to make it easy to write (<3 shows up properly and isnt the start of a tag). This also avoids any problems with trying to inject their javascript code or hide something or do anything with html.
A side effect is when a user writes
Hi
My name is
shows up as
Hi My name is
Initially we (really i) wrote var desc = (SafeHtml)obj.desc.HtmlEscape.replace("\n", "\n<br>") however this doesnt replace anything because what really happens is \n is replaced as #&10; since all characters < 0x20 (<--i think) needs an escape to be represented in html.
So my question is, am i doing things right? I changed the replace to ("
", "\n<br/>");. Is this the right way? Escape everything and replace characters you deem 'legal'? ATM i cant think of any other characters to escape.
That's how I'd do it - escape everything, and then replace safe escaped sequences. That said, I don't think you need to replace all characters < 0x20 - I'd leave 0x10 (newline) and 0x13 (carriage return) alone in the escaping step, and then replace them by <br />. Doesn't make much difference though.

escaping html inside comment tags

escaping html is fine - it will remove <'s and >'s etc.
ive run into a problem where i am outputting a filename inside a comment tag eg. <!-- ${filename} -->
of course things can be bad if you dont escape, so it becomes:
<!-- <c:out value="${filename}"/> -->
the problem is that if the file has "--" in the name, all the html gets screwed, since youre not allowed to have <!-- -- -->.
the standard html escape doesnt escape these dashes, and i was wondering if anyone is familiar with a simple / standard way to escape them.
Definition of a HTML comment:
A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".
Of course the parsing of a comment is up to the browser.
Nothing strikes me as an obvious solution here, so I'd suggest you str_replace those double dashes out.
There is no good way to solve this. You can't just escape them because comments are read in plaintext. You will have to do something like put a space between the hyphens, or use some sort of code for hyphens (like [HYPHEN]).
Since it is obvoius that you cannnot directly display the '--'s you can either encode them or use the fn:escapeXml or fn:replace tags for appropriate replacements.
JSTL documentation
There's no universal working way to escape those characters in html unless the - characters are in multiples of four so if you do -- it wont work in firefox but ---- will work. So it all depends on the browser. For Example, looking at Internet Explorer 8, it is not a problem, those characters are escaped properly. The same goes for Googles Chrome... However Firefox even the latest browser (3.0.4), it doesn't handle escaping of these characters well.
You shouldn't be trying to HTML-escape, the contents of comments are not escapable and it's fine to have a bare ‘>’ or ‘&’ inside.
‘--’ is its own, unrelated problem and is not really fixable. If you don't need to recover the exact string, just do a replacement to get rid of them (eg. replace with ‘__’).
If you do need to get a string through completely unmolested to a JavaScript that will be reading the contents of the comment, use a string literal:
<!-- 'my-string' -->
which the script can then read using eval(commentnode.data). (Yes, a valid use for eval() at last!)
Then your escaping problem becomes how to put things in JS string literals, which is fairly easily solvable by escaping the ‘'’ and ‘-’ characters:
<!-- 'Bob\x27s\x2D\x2Dstring' -->
(You should probably also escape ‘<’, ‘&’ and ‘"’, in case you ever want to use the same escaping scheme to put a JS string literal inside a <​script> block or inline handler.)