How do I stop Jsoup from encoding URL parameters? - html

I'm using Jsoup's parseBodyFragment() and parse() methods to work with blocks of code made up of script, noscript, and style tags. The goal isn't to clean them - just to select(), analyze, and output them. The select() portion works really well.
However, the issue is that it's automatically encoding the url parameters of src attributes. So, when the input is this:
<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&p_id=123"/>
</noscript>
I end up with this, returned from Jsoup, via the outerHTML() method:
<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&p_id=123"/>
</noscript>
The issue being the standard ampersand (&) in the url parameter is being encoded and output as &. Is there a way to disable this?
I'm looking for a way to get the html of the selected element without modification. Thanks!
Update (2/23/2016): Clarified problem. Also, found an issue on the Github repo describing the problem: https://github.com/jhy/jsoup/issues/372. Looks like this might not be possible.

The original HTML is invalid. An & which doesn't start a character reference must be expressed as & in an HTML attribute value.
HTML parsers are expected to perform error recovery and generate a valid DOM.
Jsoup works by parsing the HTML into a DOM, letting you run queries on it, then exporting the DOM back to HTML afterwards.
You can't avoid white space normalisation, error recovery, or any of the other things that parsers do. The approach used by Jsoup to extract data is not designed to support the preservation of errors.

Related

escaping malformed URI reference

I followed a youtube tutorial on webpack and got a linter error in my HTML stating:
escaping malformed URI reference
for this image tag:
<img class="img-responsive" src=<%=require('./images/tech-town-showcase-students.JPG') %> alt="students meeting with tech business owner"/>
What does "escaping" mean here? The code still seems to run just fine. What do I need to do differently to avoid the linter error?
Try wrapping your ASP embedded code in double quotations like this:
src="<%=require('./images/tech-town-showcase-students.JPG') %>"
Leaving them out causes me headaches to no end. It rarely breaks anything, but the html validators act as though you've tried to escape the tag you're embedding ASP in early by omitting the quotes. At least, in my experience.

<img src="#"> Means in HTML?

I do have the following code in my HTML:
<img src="#" alt="image alternative text" />
What does src="#" mean? Because in HTML, I cannot have empty src attribute for an image tag. And if we do not have that attribute, visual studio 2012 will throw a suggestion.
Thanks
When you don't want any image to be loaded using src attribute (probably load the image dynamically later), you need to set src empty. But when you do that, browsers still send calls to server.
Browser behavior for empty src (Source: http://developer.yahoo.com/performance/rules.html)
Internet Explorer makes a request to the directory in which the page
is located. Safari and Chrome make a request to the actual page
itself. Firefox 3 and earlier versions behave the same as Safari and
Chrome, but version 3.5 addressed this issue[bug 444931] and no longer
sends a request. Opera does not do anything when an empty image src is
encountered.
To avoid the unnecessary call to server, instead of using empty src, src="#" could be used which forms a hash url and hash urls are not sent to server.
Let's say base url is : http://mysite.com/myapp/
src="" -> Absolute url: http://mysite.com/myapp/
src="#" -> Absolute url: http://mysite.com/myapp/#
Most people use it as a placeholder for links just so that there are no errors when the code compiles. If a programmer gave you some code with the "#" as a placeholder, they probably want you to interchange it with a web URL.
As with the href attribute of an anchor element, <img src="#"> is roughly equivalent to <img src="thecurrenturl#" ..> (but see the fiddle example for why it's not identical).
As it is written src will never refer to a valid image resource but, presumably, something/someone can change the URL later or otherwise manipulate the element. Since the src attribute is required1, substituting in such a valid "dummy" value appeases tools like the Visual Studio editor.
This fiddle shows the behavior which can be observed by using the DOM src property.
1 "The src attribute must be present, and must contain a valid non-empty URL .."
More than likely just a placeholder subject to change dynamically based off of an event listener (or like #FabricioMatte suggested, a lazy loading technique.) It could also be just a placeholder to bypass any potential errors.
In src="#", the attribute value is a reference to the start of the document at the current base URL, according to the URL standard, STD 66 aka RFC 3986. You would need rather special arrangements to make that actually work as a reference to an image.
Why anyone would use such an attribute is a different question, and a yet another question is what is the problem that the construct is supposed to address.

iOS: Get Image From HTML String

I have an application that is pulling articles from the web and I need to retrieve the URL for the first image in an article. Here's an example of the code for these images:
<img alt="Twitter (zpower)" src="http://www.example.com/image.png" width="630" height="420">
I need to get just the value for the src. How would I do this?
You'll need to parse the HTML and extract the src attribute. You could do it by hand, but a better way is to rely on someone else's parsing library (for instance, ElementParser).
I'd like to second #ravuya's response, but also mention that you can also use the built in NSXMLParser to do the parsing for you.

How to W3C validate php image resize script

I use a php image resize script which is invoked using:
<img src="/images/image.php?img=test.png&maxw=100&maxh=100" alt="This is a test image" />
but this does not W3C validate. Are there anyways to get this to validate?
since you havn't given an exact eror-message, i have to assume the validation fails because of the ampersands. just take a look at the error description (wich also should be directly linked to from the validation-report, so you could have easily found this on your own) to see how to solve this.
To avoid problems with both validators and browsers, always use &
in place of & when writing URLs in HTML.
that said, just change your code to:
... src="/images/image.php?img=test.png&maxw=100&maxh=100" ...
It has nothing to do with PHP. All you need to do is turn those & characters into entities:
<img src="/images/image.php?img=test.png&maxw=100&maxh=100" alt="This is a test image" />
Really though, it's not that big of a deal. No browser (that I'm aware of) will misinterpret this, but if you want perfect validation then that's what you need to do.
If you output such URLs from PHP you can use htmlentities() to automatically convert e.g. & to &
htmlentities — Convert all applicable characters to HTML entities
Example:
$path = "/images/image.php?img=test.png&maxw=100&maxh=100";
$path = htmlentities($path);
echo $path;
This would output this in your html:
/images/image.php?img=test.png&maxw=100&maxh=100

What's the valid way to include an image with no src?

I have an image that I will dynamically populate with a src later with javascript but for ease I want the image tag to exist at pageload but just not display anything. I know <img src='' /> is invalid so what's the best way to do this?
Another option is to embed a blank image. Any image that suits your purpose will do, but the following example encodes a GIF that is only 26 bytes - from http://probablyprogramming.com/2009/03/15/the-tiniest-gif-ever
<img src="data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs=" width="0" height="0" alt="" />
Edit based on comment below:
Of course, you must consider your browser support requirements. No support for IE7 or less is notable. http://caniuse.com/datauri
While there is no valid way to omit an image's source, there are sources which won't cause server hits. I recently had a similar issue with iframes and determined //:0 to be the best option. No, really!
Starting with // (omitting the protocol) causes the protocol of the current page to be used, preventing "insecure content" warnings in HTTPS pages. Skipping the host name isn't necessary, but makes it shorter. Finally, a port of :0 ensures that a server request can't be made (it isn't a valid port, according to the spec).
This is the only URL which I found caused no server hits or error messages in any browser. The usual choice — javascript:void(0) — will cause an "insecure content" warning in IE7 if used on a page served via HTTPS. Any other port caused an attempted server connection, even for invalid addresses. (Some browsers would simply make the invalid request and wait for them to time out.)
This was tested in Chrome, Safari 5, FF 3.6, and IE 6/7/8, but I would expect it to work in any browser, as it should be the network layer which kills any attempted request.
These days IMHO the best short, sane & valid way for an empty img src is like this:
<img src="data:," alt>
or
<img src="data:," alt="Alternative Text">
The second example displays "Alternative Text" (plus broken-image-icon in Chrome and IE).
"data:," is a valid URI. An empty media-type defaults to text/plain. So it represents an empty text file and is equivalent to "data:text/plain,"
OT: All browsers understand plain alt. You can omit ="" , it's implicit per HTML spec.
I recommend dynamically adding the elements, and if using jQuery or other JavaScript library, it is quite simple:
http://api.jquery.com/appendTo/
http://api.jquery.com/prependTo/
http://api.jquery.com/html/
also look at prepend and append. Otherwise if you have an image tag like that, and you want to make it validate, then you might consider using a dummy image, such as a 1px transparent gif or png.
Use a truly blank, valid and highly compatible SVG, based on this article:
src="data:image/svg+xml;charset=utf8,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%3E%3C/svg%3E"
It will default in size to 300x150px as any SVG does, but you can work with that in your img element default styles, as you would possibly need in any case in the practical implementation.
I haven't done this in a while, but I had to go through this same thing once.
<img src="about:blank" alt="" />
Is my favorite - the //:0 one implies that you'll try to make an HTTP/HTTPS connection to the origin server on port zero (the tcpmux port?) - which is probably harmless, but I'd rather not do anyways. Heck, the browser may see the port zero and not even send a request. But I'd still rather it not be specified that way when that's probably not what you mean.
Anyways, the rendering of about:blank is actually very fast in all browsers that I tested. I just threw it into the W3C validator and it didn't complain, so it might even be valid.
Edit: Don't do that; it doesn't work on all browsers (it will show a 'broken image' icon as pointed out in the comments for this answer). Use the <img src='data:... solution below. Or if you don't care about validity, but still want to avoid superfluous requests to your server, you can do <img alt="" /> with no src attribute. But that is INVALID HTML so pick that carefully.
Test Page showing a whole bunch of different methods: http://desk.nu/blank_image.php - served with all kinds of different doctypes and content-types. - as mentioned in the comments below, use Mark Ormston's new test page at: http://memso.com/Test/BlankImage.html
As written in comments, this method is wrong.
I didn't find this answer before, but acording to W3 Specs valid empty src tag would be an anchor link #.
Example: src="#", src="#empty"
Page validates successfully and no extra request are made.
I found that simply setting the src to an empty string and adding a rule to your CSS to hide the broken image icon works just fine.
[src=''] {
visibility: hidden;
}
if you keep src attribute empty browser will sent request to current page url
always add 1*1 transparent img in src attribute if dont want any url
src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA="
I've found that using:
<img src="file://null">
will not make a request and validates correctly.
The browsers will simply block the access to the local file system.
But there might be an error displayed in console log in Chrome for example:
Not allowed to load local resource: file://null/
Building off of Ben Blank's answer, the only way that I got this to validate in the w3 validator was like so:
<img src="/./.:0" alt="">`
I personally use an about:blank src and deal with the broken image icon by setting the opacity of the img element to 0.
<img src="invis.gif" />
Where invis.gif is a single pixel transparent gif. This won't break in future browser versions and has been working in legacy browsers since the '90s.
png should work too but in my tests, the gif was 43 bytes and the png was 167 bytes so the gif won.
p.s. don't forget an alt tag, validators like them too.
I know this is perhaps not the solution you are looking for, but it may help to show the user the size of the image before hand. Again, I don't fully understand the end goal but this site might help: https://via.placeholder.com
It's stupid easy to use and allows to show the empty image with the needed size.
Again, I understand you did not want to show anything, but this might be an elegant solution as well.
<img src="https://via.placeholder.com/300" style='width: 100%;' />
Simply, Like this:
<img id="give_me_src"/>