Urlize escaping %23 in href of text - html

I am trying to render text from a user with a url like this in it: https://example.com/%20%23654
I pass the url to urlize and I get this:
In[1]: outp = urlize('https://example.com/%20%23654'); print outp
Out[1]: u'https://example.com/%20%23654'
I understand that %20 escapes to a space and %23 to a hash, but why is it only escaping the hash in the href? Is this a bug? If it were intended, why is it not escaping the %20 to blank space?

I don't think this is a bug.
I see two parts to this question:
Why is it only unescaping the hash and not the space?
Why is it only doing the unescaping in the href and not in the visible linked text?
Here are my thoughts on the first:
A hash is a perfectly legal URL path character. It is most often used to go to anchors in HTML (example and link to docs in one!):
http://www.w3.org/TR/html4/struct/links.html#h-12.2
urlize realizes this. It unescapes the hash in the href. It works with any letter which is a legal URL character. Here is an example with the letter f:
>>> urlize('https://example.com/%66')
u'https://example.com/%66'
A space on the other hand is not a legal URL character (although it is often tolerated). Therefore, it remains encoded to %20 both in the link and in the visible link depiction.
The second part of the question is why is it only unescaping in the link but not in the visible depiction. That also makes sense. In the href, it does not matter whether you pass in https://example.com/%66 or https://example.com/f. The effect is the same, and the depiction is "under the hood." So urlize uses the simplest form, without the unnecessary encoding. On the other hand, the visible part is presented to the user. Therefore, urlize tries to preserve the exact depiction which it was passed in originally, as that is the least surprising thing to do.

Related

HTML Hrefs in Tomcat

I'm building a HTML string in tomcat and I notice that in my JSON object, my clickable href link is something like:
http://localhost/%22/https://myLinkHere.com/%22
This is a 2 part question. First, should it contain the http://localhost in front? And secondly, why is the %22 there?
Here is what my JSON href looks like in text:
linkDisplayName
This looks right to me, but I can't tell why the last %22 is there.
I think you won't need the localhost as long as you are supplying the relative path
The ascii code for %22 is " which is correctly referenced in your link.
HTML parsers are very lenient, which often leads to confusing behavior. Without the exact JSON it's hard to say for sure but there are a couple of obvious issues. Ultimately the issue is your HTML is malformed and/or mis-escaped.
%22 is " URL-encoded, which means that the quotes you've \-escaped are being included in the URL rather than surrounding them. That likely means that in the JSON they're double-escaped. That might mean it's \\" or something similar; try just a single backslash (\") or no backslash at all (").
Notice that the protocol (https:/) in your URL is also wrong; a URL starts with a protocol (like https) followed by a :, and generally followed by two slashes (//). Your URL follows the protocol with just a single slash, which makes it look like a relative URL rather than an absolute one. Browsers will prefix relative URLs with whatever they infer the current host to be, which in your context appears to be localhost.
The HTML should look like this:
linkDisplayName
So in summary no, the URL should probably not contain http://localhost, and it should not contain those %22s either. They're showing up because your JSON is malformed.

Unwanted characters being added to url in HTML

I'm trying to include a simple hyperlink in a website:
...Engineers (IEEE) projects:
So that it ends up looking like "...Engineers (IEEE) projects:" with "IEEE" being the hyperlink.
When I click on copy link address and paste the address, instead of getting
http://www.ieee.ucla.edu/
I get
http://www.ieee.ucla.edu/%C3%A2%E2%82%AC%C5%BD
and when I click on the link, it takes me to a 404 page.
Check the link. These special character are added automatically by browser (URL Encoding).
Url Encoding
Use this code and it will work::
IEEE
The proper format to add hyperlink to a html is as follow
(texts to be hyperlink)
and for better understanding go through this link http://www.w3schools.com/html/html_links.asp
%C3%A2%E2%82%AC%C5%BD represents „ which is when you get when a unicode „ is being parsed as Windows-1252 data.
Use straight quotes to delimit attribute values in your real code. You are doing this in the code you have included in the question, but that won't have the effect you are seeing. Presumably your codes are being transformed at some point in your real code.
Add appropriate HTTP headers and <meta> data to tell the browser what encoding your file is really using

A html space is showing as %2520 instead of %20

Passing a filename to the firefox browser causes it to replace spaces with %2520 instead of %20.
I have the following HTML in a file called myhtml.html:
<img src="C:\Documents and Settings\screenshots\Image01.png"/>
When I load myhtml.html into firefox, the image shows up as a broken image. So I right click the link to view the picture and it shows this modified URL:
file:///c:/Documents%2520and%2520Settings/screenshots/Image01.png
^
^-----Firefox changed my space to %2520.
What the heck? It converted my space into a %2520. Shouldn't it be converting it to a %20?
How do I change this HTML file so that the browser can find my image? What's going on here?
A bit of explaining as to what that %2520 is :
The common space character is encoded as %20 as you noted yourself.
The % character is encoded as %25.
The way you get %2520 is when your url already has a %20 in it, and gets urlencoded again, which transforms the %20 to %2520.
Are you (or any framework you might be using) double encoding characters?
Edit:
Expanding a bit on this, especially for LOCAL links. Assuming you want to link to the resource C:\my path\my file.html:
if you provide a local file path only, the browser is expected to encode and protect all characters given (in the above, you should give it with spaces as shown, since % is a valid filename character and as such it will be encoded) when converting to a proper URL (see next point).
if you provide a URL with the file:// protocol, you are basically stating that you have taken all precautions and encoded what needs encoding, the rest should be treated as special characters. In the above example, you should thus provide file:///c:/my%20path/my%20file.html. Aside from fixing slashes, clients should not encode characters here.
NOTES:
Slash direction - forward slashes / are used in URLs, reverse slashes \ in Windows paths, but most clients will work with both by converting them to the proper forward slash.
In addition, there are 3 slashes after the protocol name, since you are silently referring to the current machine instead of a remote host ( the full unabbreviated path would be file://localhost/c:/my%20path/my%file.html ), but again most clients will work without the host part (ie two slashes only) by assuming you mean the local machine and adding the third slash.
For some - possibly valid - reason the url was encoded twice. %25 is the urlencoded % sign. So the original url looked like:
http://server.com/my path/
Then it got urlencoded once:
http://server.com/my%20path/
and twice:
http://server.com/my%2520path/
So you should do no urlencoding - in your case - as other components seems to to that already for you. Use simply a space
When you are trying to visit a local filename through firefox browser, you have to force the file:\\\ protocol (http://en.wikipedia.org/wiki/File_URI_scheme) or else firefox will encode your space TWICE. Change the html snippet from this:
<img src="C:\Documents and Settings\screenshots\Image01.png"/>
to this:
<img src="file:\\\C:\Documents and Settings\screenshots\Image01.png"/>
or this:
<img src="file://C:\Documents and Settings\screenshots\Image01.png"/>
Then firefox is notified that this is a local filename, and it renders the image correctly in the browser, correctly encoding the string once.
Helpful link: http://support.mozilla.org/en-US/questions/900466
Try using this
file:///c:/Documents%20and%20Settings/screenshots/Image01.png
Whenever you are trying to open a local file in the browser using cmd or any html tag use "file:///" and replace spaces with %20 (url encoding of space)
The following code snippet resolved my issue. Thought this might be useful to others.
var strEnc = this.$.txtSearch.value.replace(/\s/g, "-");
strEnc = strEnc.replace(/-/g, " ");
Rather using default encodeURIComponent my first line of code is converting all spaces into hyphens using regex pattern /\s\g and the following line just does the reverse, i.e. converts all hyphens back to spaces using another regex pattern /-/g. Here /g is actually responsible for finding all matching characters.
When I am sending this value to my Ajax call, it traverses as normal spaces or simply %20 and thus gets rid of double-encoding.
Try this?
encodeURIComponent('space word').replace(/%20/g,'+')

Why is "&reg" being rendered as "®" without the bounding semicolon

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar&region=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online&region=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the &reg as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

How do I prevent the GET method from encoding HTML special characters in the URI?

I have a form using the GET method.
If values are submitted with special characters, they appear in the URI as:
?value=fudge%20and%20stuff
How do I make it clean?
I don't want to use the header function because this is happening within a page in drupal.
A URL cannot contain spaces and many other "special characters", therefore they get encoded. Unfortunately there isn't a lot you can do about it. The most you could do is some JavaScript trickery in the form, but I don't think it's worth it.
If %20 bothers you, you can substitute (GREP replace) the + character (?value=fudge+and+stuff) for better readability. Otherwise there's not a lot you can do. Other "exotic" characters will be similarly escaped, and need to be.
URL :?value=fudge%20and%20stuff
Encoded as: fulg< space > and < space >stuff