Embed HTML within a URL - html

Is it possible to embed HTML within a URL and then render that HTML in the browser itself?
In theory, what 'm thinking of works similarly to an URL like below:
http://"<h1>Hello World</h1>"
this would show a page with "Hello World" wrapped in a <h1> tag.
Of course, I understand that the above does not work in the real world for a wide range of reasons. Is there however a a way in which I can encode data within a URL and show render that data as HTML within the browser?
I understand that you could easily set up a webserver to do this, but I am interested in a solution which would work natively without any dependencies.

It's Data URL. Data URLs, URLs prefixed with the data: scheme, allow content creators to embed small files inline in documents. They were formerly known as "data URIs" until that name was retired by the WHATWG.
Data URLs are treated as unique opaque origins by modern browsers, rather than inheriting the origin of the settings object responsible for the navigation.
Syntax:
data:[<mediatype>][;base64],<data>
The HTML:
Test

This is what a data url does.
Go to welcome page

Related

Where to find entire HTML content in Chromium source code

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?
I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.
Take a look at the following conceptual application layers which represent how Chromium displays web pages:
Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit
The different layers are described as:
WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
Browser: Represents the browser window, it contains multiple WebContentses.
Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).
Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:
The renderers use the Blink open-source layout engine for interpreting and laying out HTML.
Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:
WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();
Cleanest would be via the chrome remote debugging protocol
Use the DOM methods to get the root DOM and walk, search, or query the dom
This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.
If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.
You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
See: https://jsoup.org/

Is it possible to nest one data: URI inside another?

If I use a data URI to construct a src attribute for an HTML element, can it in turn have another data URI inside it?
I know you can't use data uri's for iframes (I'm actually trying to construct an OSDX document and pass it to the browser with an icon encoded in base64 but that's a really niche use case and this is more of a general question), but assuming you could, my use case would look like:
var iframe = document.createElement('iframe');
var icon = document.createElement('image');
var iSrc = '*[REALLY LONG STRING]*/';
iframe.src='data:text/html,<html><body><image src="'+iSrc+'" /></body</html>
document.body.appendChild(iframe);
Basically what I'm after is is there anything in a data uri that would break a parent data uri?
Yes you can. I really thought it was impossible, as did everyone I asked.
Example:
Pasting the following into your browser's URL bar should render a gmail logo in an html page that says hello world.
data:text/html,<html><body><p>hello world</p><img src="" /></body></html>
or for a shorter example courtesy of Pumbaa80:
data:text/html,<script src="data:text/javascript,alert('hello world')"></script>
MSDN explicitly supports this:
Data URIs can be nested.
An old blog entry talks a little bit more about embedding images within CSS using data: :
Neither dataURI spec nor any other mentions if dataURI’es can not be nested. So here’s the testcase where dataURI’ed CSS has dataURI’ed image embedded. IE8b1, Firefox3 and Safari applied the stylesheet and showed the image, Opera9.50 (build 9613) applies the stylesheet but doesn’t show the embedded image! So it seems that Opera9 doesn’t expect to get anything embedded inside of an already embedded resource! :D
But funny thing, as IE8b1 supports expressions and also supports nested data URI’es, it has the same potential security flaw as Firefox does (as described in the section above). See the testcase — embedded CSS has the following code: body { background: expression(a()); } which calls function a() defined in the javascript of the main page, and this function is called every time the expression is reevaluated. Though IE8b1 has limited expressions support (which is going to be explained in a separate post) you can’t use any code as the expression value, but you can only call already defined functions or use direct string values. So in order to exploit this feature we need to have a ready javascript function already located on the page and then we can just call it from the expression embedded in the stylesheet. That’s not very trivial obviously, but if you have a website that allows people to specify their own stylesheets and you want to be on the safe side, you have to either make sure you don’t have a javascript function that can cause any potential harm or filter expressions from people’s stylesheets.

How does a browser determine which objects to request?

When a browser receives the initial root HTML page, how does it determine exactly which other objects should be requested. Is there a list of HTML tags that the browser will always request the associated content from the server when they are detected?
I realize the need to implement an HTML parser for this, however I am not sure of all the individual tags and attributes that are important.
Browsers parse the HTML, and know which elements (with which attributes) require additional resources to be loaded.
i.e. They implement an HTML parser.

what is the # symbol in the url

I went to some photo sharing site, so when I click the photo, it direct me to a url like
www.example.com/photoshare.php?photoid=1234445
. and when I click the other photo in this page the url become
www.example.com/photoshare.php?photoid=1234445#3338901
and if I click other photos in the same page, the only the number behind # changes. Same as the pretty photo like
www.example.com/photoshare.php?album=holiday#!prettyPhoto[gallery2]/2/
.I assume they used ajax because the whole page seems not loaded, but the url is changed.
The portion of a URL (including and) following the # is the fragment identifier. It is special from the rest of the URL. The key to remember is "client-side only" (of course, a client could choose to send it to the server ... just not as a fragment identifier):
The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no participation from the server — of course the server typically helps to determine the MIME type, and the MIME type determines the processing of fragments. When an agent (such as a Web browser) requests a resource from a Web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.
This can be used to navigate to "anchor" links, like: http://en.wikipedia.org/wiki/Fragment_identifier#Basics (note how it goes the "Basics" section).
While this used to just go to "anchors" in the past, it is now used to store navigatable state in many JavaScript-powered sites -- gmail makes heavy use of it, for instance. And, as is the case here, there is some "photoshare" JavaScript that also makes use of the fragment identifier for state/navigation.
Thus, as suspected, the JavaScript "captures" the fragment (sometimes called "hash") changing and performs AJAX (or other background task) to update the page. The page itself is not reloaded when the fragment changes because the URL still refers to the same server resource (the part of the URL before the fragment identifier).
Newer browsers support the onhashchange event but monitoring has been supported for a long time by various polling techniques.
Happy coding.
It's called the fragment identifier. It identifies a "part" of the page. If there is an element with a name or id attribute equal to the fragment text, it will cause the page to scroll to that element. They're also used by rich JavaScript apps to refer to different parts of the app even though all the functionality is located on a single HTML page.
Lately, you'll often see fragments that start with "#!". Although these are still technically just fragments that start with the ! character, that format was specified by Google to help make these AJAXy pseudo-pages crawlable.
The '#' symbol in the context of a url (and other things) is called a hash, what comes after the hash is called a fragment. Using JavaScript you can access the fragment and use its contents.
For example most browsers implement a onhashchange event, which fires when the hash changes. Using JavaScript you can also access the hash from location.hash. For example, with a url like http://something.com#somethingelse
var frag = location.hash.substr(1);
console.log(frag);
This would print 'somethingelse' to the console. If we didn't use substr to remove the first character, it frag would be: '#somethingelse'.
Also, when you navigate to a URL with a hashtag, the browser will try and scroll down to an element which has an id corresponding to the fragment.
http://en.wikipedia.org/wiki/Fragment_identifier
It is the name attribute of an anchor URL: http://www.w3schools.com/HTML/html_links.asp
It is used to make a bookmark within an HTML page (and not to be confused with bookmarks in toolbars, etc.).
In your example, if you bookmarked the page with the # symbol in the URL, when you visit that bookmark again it will display the last image that you viewed, most likely an image that has the id of 3338901.
hey i used sumthing like this .... simple but useful
location.href = data.url.replace(/%2523/, '%23');
where data.url is my original url . It substitutes the # in my url

using encodeURI to display an entire page

Hi I am making a chrome extension. Where I save a page to the database as a string and then open it later as a dataURI scheme like:
d = 'data:text/html;charset=utf-8'+encodeURI('HTML TEXT')
location.reload(d);
The problem with this is that the page, say its name is http://X/, in which I executed the above command loses the javascript files in its head.
I considered using the document.write(d), if d has a string appeneded to it with the <head>...</head> of http://X/.
But this opens a big vulnerability problem for XSS. At this point I am trying to think of white listing tags when I save the original page... is there another way?
I'm not sure what you mean by http://X/, but if want copied website to retain its origin (i.e. have code you give run exactly as if it were downloaded from http://X/), then I'm afraid it's not possible with standard DOM methods (it would be a security vulnerability that bypasses same-origin security policy).
If you want to run 3rd party sourcecode safely, then use this:
<iframe sandbox src="data:…"></iframe>
You could modify the source and insert <base href="http://X/"> in there to make relative URLs work properly.