How to turn text into DOM elements using Clojurescript? (And Om?) - clojurescript

In a ClojureScript/Om app, I have a DOM and a string of HTML.
How can I translate that string of HTML into elements I can insert into the DOM?
I've started down the path of parsing the HTML with hickory, planning to then process the hickory data to create DOM elements, but I think there must be a simpler way I'm overlooking.
(I don't need to validate the HTML, I can assume it's safe and valid enough.)

You don't need to parse the HTML string. It's unnecessary overhead. React/Om supports something like DOM's innerHTML property. Just set the prop this way:
(om.dom/div #js {:dangerouslySetInnerHTML #js {:__html "<b>Bold!</b>"}} nil)
If you work with plain DOM without Om, set the innerHTML property like:
(let [div (. js/document getElementById "elId")]
(set! (. div -innerHTML) "<b>Bold!</b>"))

Aleš Roubíček answer is much better. I'll leave this un case it helps somebody.
Hickory offers a as-hiccup function. Hiccup uses Clojure data structures to represent HTML. You could feed those data structures to Clojurescript libraries that follow the same conventions:
Hiccups to generate regular DOM elements.
Sablono to generate React DOM elements and use with Om.
You could also use Kioo/Enfocus and instead of passing a file path, pass the string directly. This would be more direct and instead of using two libraries (Hickory + Sablono) you would use only one. The caveat is that Kioo and Enfocus follow the Enlive style of templating (which is great but has a learning curve) and the docs are focused on file paths and not strings (even though it is possible to pass strings).

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

Why aren't NodeList / HtmlCollection seqable?

As a newcomer to Clojurescript it appears to me that every Clojurescript project will have some snippet of code like this:
(extend-type js/NodeList
ISeqable
(-seq [array] (array-seq array 0)))
Why isn't this part of the core library?
You have to think that clojurescript is a compiler to javascript as a language, not only browser JavaScript. You can also use it in other platforms like nodejs or with the QT library where NodeList does not exist (because it is part of the Dom api and not the standard language).
If you are looking for a way to create a sequence from a NodeList there is array-seq function.
(array-seq (js/document.querySelectorAll "div"))
With a patch for https://clojure.atlassian.net/browse/CLJS-3199 applied in ClojureScript 1.10.741 (seq (js/document.querySelectorAll "div")) now actually works out of the box.

How to set an attribute of a DOM element in Clojurescript?

I wish to set the "value" property of an "input" element using Clojurescript, but I am having trouble with the syntax of setProperties in goog.com. Has anyone got a working example?
Update
------
This seems to work:
(goog.dom.setProperties
(goog.dom/getElement "element-name")
(clj->js {:value "text"}))
If you need to create throwaway JS objects for use with JS APIs, you can do so directly using js-obj:
(js-obj "value" "text")
;; produces {"value": "text"} in the compiled output
Of course if you already have a ClojureScript map with the appropriate entries, clj->js will be more convenient.
More importantly, you might want to consider switching to a ClojureScript library for DOM manipulation. Several are available:
Luke VanderHart's Domina, which might have been the first one, is used by Enfocus (listed below) and Pedestal;
Prismatic's dommy, notable due to its own merits as well as the very entertaining blog posts about it on Prismatic's blog (which can serve as a great introduction to the benefits of macros: first one, second one, third one);
Creighton Kirkendall's Enfocus, which is in a nutshell an Enlive-like library for ClojureScript, which is awesome;
Kevin Lynagh's Singult, which is a Hiccup-style library for ClojureScript with cool functionality for merging in changes to the DOM, rather than rerendering from scratch.

Mapping plain text back into HTML document

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.
I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:
Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.
Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.
(I was a bit iffy on how to tag this question, so feel free to re-tag it.)
Use a range utility method plus an annotation library such as one of the following:
artisan.js
annotator.js
vie.js
The free software Rangy JavaScript library is your friend. Regarding your two tasks:
Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.
Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by #Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.
You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.