Mapping plain text back into HTML document - html

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.
I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:
Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.
Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.
(I was a bit iffy on how to tag this question, so feel free to re-tag it.)

Use a range utility method plus an annotation library such as one of the following:
artisan.js
annotator.js
vie.js

The free software Rangy JavaScript library is your friend. Regarding your two tasks:
Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.
Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by #Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

How to incorporate HTML tags for styling certain parts of localized strings (Polymer)

The web application I am working on uses resource strings for localization. The issue I am having is with styling certain parts of these strings. Let's say I want to display this string:
user1234 created a new document.
So in the resource file it would be localized like so:
{username} created a new document.
The issue is I also need <b></b> tags around {username}. I can't put these tags in the html file because I need it to apply just to the username, not to the whole localized string. So unless I split up the string into two localized strings (which I should definitely not do, because other languages do not necessarily have the same sentence structure), I have to put these html tags in the localized string itself:
<b>{username}</b> created a new document.
Even if we disregard best practices for a moment (of which I have read briefly) and go with this, this solution isn't working for me. I believe this is because the application is using Polymer (this seems to work with Angular). So if we stick by the following two requirements:
Use Polymer
Have the whole string together as one resource string
then there doesn't seem to be a way to style certain parts of the string. Does anyone know a solution?
I got it to work by setting the resource string to the inner HTML of the element which contains the string. So let's say the div containing the text has id="textElem", in the Javascript I set the inner HTML like so:
this.$.textElem.innerHTML = this.localize('user_created_document', 'username', this.username)
I suppose I should have specified in the question that my previous attempts of setting the string were just (a) simply binding the string to the property of an object and referencing that in the HTML, and (b) localizing the string directly in the HTML, neither of which worked.

Adding metadata to markdown text

I'm working on software creating annotations and would like my main data structure to be based around markdown.
I was thinking of working with an existing markdown editor, but hacking it so that certain tags, i.e. [annotation-id-001]Sample text.[/annotation-id-001] did not show up as rendered HTML; the above would output Sample text. in an HTML preview and link to a separate annotation with the ID 001.
My question is, is this the most efficient way to represent this kind of metadata inside of a markdown document? Also, if a user wants to legitimately use something like "[annotation-id-001]" as text inside of their document, I assume that I would have to make that string syntax illegal, correct?
I don't know what Markdown parser you use but you can abord your problem with different points of view:
first you can "hack" an existing parser to exclude your annotation tags from "classic" parsing and include them only in a certain mode
you can also use the internal "meta-data" information proposed by certain parsers (like MultiMarkdown or MarkdownExtended) and only write your annotations like meta-data with a reference to their final place in content
or, as mentionned by mb21, you can use simple links notation like [Sample text.](#annotation-id-001) or use footnotes like [Sample text.](^annotation-id-001) and put your annotations as footnotes.

Is it possible to nest one data: URI inside another?

If I use a data URI to construct a src attribute for an HTML element, can it in turn have another data URI inside it?
I know you can't use data uri's for iframes (I'm actually trying to construct an OSDX document and pass it to the browser with an icon encoded in base64 but that's a really niche use case and this is more of a general question), but assuming you could, my use case would look like:
var iframe = document.createElement('iframe');
var icon = document.createElement('image');
var iSrc = 'data:image/png;base64,/*[REALLY LONG STRING]*/';
iframe.src='data:text/html,<html><body><image src="'+iSrc+'" /></body</html>
document.body.appendChild(iframe);
Basically what I'm after is is there anything in a data uri that would break a parent data uri?
Yes you can. I really thought it was impossible, as did everyone I asked.
Example:
Pasting the following into your browser's URL bar should render a gmail logo in an html page that says hello world.
data:text/html,<html><body><p>hello world</p><img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACwAAAAsCAMAAAApWqozAAAAz1BMVEX///+6xs7iTULj5+vW2tzk49TX3+Tq6tz09Orz9/jm7O6wu8aNm6WwMCjfRDnm5ube2Mmos7vt7++jrLP29/eyTk3d4+fj6+foWFLM1t2bNzeZpa2trKPv8/ajo6Xud2f0r6rv7uG3ubvd6eXIysq9s6jeopfSUknx8fHbXGKvTx7Hz9X09PTwhXHw9Pb++/fp7/Xu4d3SZlPSiXjuaFmMkZX39/vZy6/j8+v2+/v76uP7+/v0/Pewl4raMi/f5+7KTjitf4W2rI3/9vDzz8uC91bdAAAAAXRSTlMAQObYZgAAAmBJREFUeF6VldeunDAQQNeF3jvbe729p/f//6aMxzaBKHe5OZiVJY6OBvbBAyTvZaDJSy/xzpJ4pdJzz1tfOme5LD0vly45fETe4/1vUoK26bnPznUPzrPrmSJsgtrPpQnpPCmvp2/gukyE/HX6JtYor9/kpqUcY3ogvUxv5RhlSrbHenFzY/+0cxtYApZlGZod3W9JaqJspuSR1vTqw41U4UZb+S/zMAz35FbJt4QK6sUnW+naxWwoaJXBpBi37bxTjuchyl+0zLGs4npmVK1dHSLBiLhmlg+chLukFiLqV3dQ1i84p1SEv42CgLhYzmR5vqh1XIUXeypGoN+LAMoVz5yBk/EKZfsOQioOsnGFWXr/eXLCMs9EeehK2bab+NKCbQj2/mEymRSjpqzkR5CXOv4AWSqyM3BnRSDKw255CWtBW0BWcILyCmQsMywv5b/xS8WpzGJZzMywPFZjvANTYO10VoNfIxqOW2WlwhU/QnbSMDu1y0yWpSowdiqry6OArEW5ka1GNTaGsWqViwDL4z/lezClu4nBN+K/ypWSIywrNY4NxaqZWZSjThlEkQXLUna8lTZ+unWnDE6T1fbmtZlBxWxbFvHZ5NR8DUeUj5QejQ1mu24cv2C5aJW3R3qv1a4bX8TbIih+wAv6IPvDKCIrAusV8FEpyz5njFUV/ERdWFTBHVWwmGkyKKPcT2kyjvKFy1jPgnJ6IeTMg30vzCUZHBTc52mvzdKhz+GYcJIxTw8ukOKlVmd/SPk4kSdQ4nv8fJh7PriIw7Mn/yxPGW8dm06e269eOTwe/De/AW24fWb7B21TAAAAAElFTkSuQmCC" /></body></html>
or for a shorter example courtesy of Pumbaa80:
data:text/html,<script src="data:text/javascript,alert('hello world')"></script>
MSDN explicitly supports this:
Data URIs can be nested.
An old blog entry talks a little bit more about embedding images within CSS using data: :
Neither dataURI spec nor any other mentions if dataURI’es can not be nested. So here’s the testcase where dataURI’ed CSS has dataURI’ed image embedded. IE8b1, Firefox3 and Safari applied the stylesheet and showed the image, Opera9.50 (build 9613) applies the stylesheet but doesn’t show the embedded image! So it seems that Opera9 doesn’t expect to get anything embedded inside of an already embedded resource! :D
But funny thing, as IE8b1 supports expressions and also supports nested data URI’es, it has the same potential security flaw as Firefox does (as described in the section above). See the testcase — embedded CSS has the following code: body { background: expression(a()); } which calls function a() defined in the javascript of the main page, and this function is called every time the expression is reevaluated. Though IE8b1 has limited expressions support (which is going to be explained in a separate post) you can’t use any code as the expression value, but you can only call already defined functions or use direct string values. So in order to exploit this feature we need to have a ready javascript function already located on the page and then we can just call it from the expression embedded in the stylesheet. That’s not very trivial obviously, but if you have a website that allows people to specify their own stylesheets and you want to be on the safe side, you have to either make sure you don’t have a javascript function that can cause any potential harm or filter expressions from people’s stylesheets.

What are "data-url" and "data-key" attributes of <a> tag?

I've faced two strange attributes of an html tag . They are "data-url" and "data-key".
What are they and how can they be used?
For some reasons i can't show the exact example of the HTML file I've found them in, but here are some examples from the web with such tags:
data-key
data-key
data-url
PS: I've tried to Google, but no useful results were found.
When Should I Use the Data Attribute?
Custom data attributes are intended to store custom data private to the page or application, for which there are no more appropriate attributes or elements.
This time the data attribute is used to indicate the bubble value of the notification bubble.
Profile
This time is used to show the text for the tooltip.
This is the link
When Shouldn’t I Use the Data Attribute?
We shouldn’t use data attributes for anything which already has an already established or more appropriate attribute. For example it would be inappropriate to use:
<span data-time="20:00">8pm<span>
when we could use the already defined datetime attribute within a time element like below:
<time datetime="20:00">8pm</time>
Using Data Attributes With CSS (Attribute selectors)
[data-role="page"] {
/* Styles */
}
Using Data Attributes With jQuery (.attr())
Google
$('.button').click(function(e) {
e.preventDefault();
thisdata = $(this).attr('data-info');
console.log(thisdata);
});
Examples and info from here
Those are called HTML5 Custom Data attributes.
Custom data attributes are intended to store custom data private to
the page or application, for which there are no more appropriate
attributes or elements. These attributes are not intended for use by
software that is independent of the site that uses the attributes.
Every HTML element may have any number of custom data attributes
specified, with any value.
The reason why you can't find it in Google is because those attribute are custom attributes generated by user for their own usage.
From seeing your code it seems:
The person who has written this code, wants to store some additional
information with the elements. Not sure he may handle this in
Javascript too.
What you should do is to check the Javascript code completely,
whether he is handling those data attributes or if possible check
with him.
Since you code is using jQuery library, check for .data()
method. After a complete code review, if you find it has no use,
then feel free to remove.
data-* attributes are for adding arbitrary data to an element for use solely by the code (usually client side JavaScript) running on the site hosting the HTML.
In order to tell what the three examples you give are for, we would have to reverse engineer the code that comes with them (unless they are documented on the sites) since they don't have standard meanings.
A new feature being introduced in HTML 5 is the addition of custom data attributes. Simply, the specification for custom data attributes states that any attribute that starts with “data-” will be treated as a storage area for private data (private in the sense that the end user can’t see it – it doesn’t affect layout or presentation). This allows you to write valid HTML markup (passing an HTML 5 validator) while, simultaneously, embedding data within your page. A quick example:
<li class="user" data-name="John Resig" data-city="Boston"
data-lang="js" data-food="Bacon">
<b>John says:</b> <span>Hello, how are you?</span>
</li>