turning a html structure into a Clojure Structure

turning a html structure into a Clojure Structure - html

I have a html page, with one structure that I want to turn into Clojure data structure. I’m hitting a mental block on how to approach this in an idiomatic way
This is the structure I have:
<div class=“group”>
<h2>title1<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>
Structure I want:
'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)
The repetition of titles is intentional.
I’ve read David Nolan’s enlive tutorial. That offers a good solution if there was a parity between group and subgroup, but in this case it can be random.
Thanks for any advice.

You can use Hickory for parsing, and then Clojure has some very nice tools for transforming the parsed HTML to the form you want:
(require '[hickory.core :as html])
(defn classifier [tag klass]
(comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))
(defn only [x]
;; https://stackoverflow.com/a/14792289/5044950
{:pre [(seq x)
(nil? (next x))]}
(first x))
(defn identifier [tag element]
(->> element :content (filter (identifier? tag)) only :content only))
(defn process [data]
(for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
:let [title (identifier :h2 group)]
subgroup (filter subgroup? (:content group))
:let [subheading (identifier :h3 subgroup)]
path (filter path? (:content subgroup))]
[title subheading (:href (:attrs path))]))
Example:
(require '[clojure.pprint :as pprint])
(def data
"<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>")
(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;; ["title1" "subheading2" "“path2”"]
;; ["title2" "subheading3" "“path3”"])

The solution can be splited in two parts
Parsing: parse it with clojure html parser or any other parser.
Custom data structure: modify the parsed html, you can use clojure.walk for that if you want.

You can solve this problem with the tupelo.forest library. Here is an annotated unit test showing the approach. You can find more information in the API docs and both the unit tests and the example demos. Additional documentation is forthcoming.
(dotest
(with-forest (new-forest)
(let [html-str "<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>"
enlive-tree (->> html-str
java.io.StringReader.
en-html/html-resource
first)
root-hid (add-tree-enlive enlive-tree)
tree-1 (hid->hiccup root-hid)
; Removing whitespace nodes is optional; just done to keep things neat
blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
[hid]
(let [node (hid->node hid)]
(and (contains-key? node ::tf/value)
(ts/whitespace? (grab ::tf/value node)))))
blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
>> (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
tree-2 (hid->hiccup root-hid)
>> (is= tree-2 [:html
[:body
[:div {:class "“group”"}
[:h2 "title1"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading1"]
[:a {:href "“path1”"}]]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading2"]
[:a {:href "“path2”"}]]]
[:div {:class "“group”"}
[:h2 "title2"]
[:div {:class "“subgroup”"}
[:p "unused"]
[:h3 "subheading3"]
[:a {:href "“path3”"}]]]]])
; find consectutive nested [:div :h2] pairs at any depth in the tree
div-h2-paths (find-paths root-hid [:** :div :h2])
>> (is= (format-paths div-h2-paths)
[[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title1"}]]]]
[{:tag :html}
[{:tag :body}
[{:class "“group”", :tag :div}
[{:tag :h2, :tupelo.forest/value "title2"}]]]]])
; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
div-hids (mapv #(idx % -2) div-h2-paths)
; for each of div-hids, find and collect nested :h3 values
dif-h3-paths (vec
(lazy-gen
(doseq [div-hid div-hids]
(let [h2-value (find-leaf-value div-hid [:div :h2])
h3-paths (find-paths div-hid [:** :h3])
h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
(doseq [h3-value h3-values]
(yield [h2-value h3-value]))))))
]
(is= dif-h3-paths
[["title1" "subheading1"]
["title1" "subheading2"]
["title2" "subheading3"]])
)))

Related

How do i extract a list of specific dom-attributes in a nested dom tree?

I would like to traverse and collect the node with the data-table attribute, extract its value, then obtain its child with the data-field or additional attributes, and extract its value which would be saved as a list.
From the Html example below, I have set up anchor points of dom-attributes in a dom-tree, which is intended to be converted into a model structure after traversing and extracting them.
<body>
<div class="wrap" data-table="page"> Sample Text <p data-field="heading" class="format" >Welcome to this page</p>
<div class="flex-grid generic-card">
<h1 class="card " data-field="intro">Text </h1>
<div class="card " data-field="body"></div>
</div>
</div>
I am expecting the final result to be in a form of a flat list with something similar to (page . ("title" "intro" "body"))
with the following code, I'm able to traverse the node and extract 'data-table' but the problem is, I'm not able to extract data-field attached to data-table.
I unsuccessfully tried to use the recursion approach which consists of repeating the example of 'dom-struct' and dom-search function.
what I noticed is libxml-parse-html-region'' returns empty strings with newlines alongside the dom-nodes after parsing through the dom tree which generates an error.
This code's purpose is to extract the node from the tree recursively
(require 'dom)
(defun dom-struct (x)
(print (dom-attr x 'data-table)) ; extract the data-table attribute
(print (dom-tag (dom-node x))) ;extract dom-tag
(print (dom-children (dom-node x))) ; extract dom-children attached to a node but don't know how to extract data-field attribute
(print (dom-search (dom-children (dom-node x)) (lambda (node) (assq 'data-attribute (cadr node)))))
(mapconcat #'dom-struct (dom-children (dom-node x)) ""))
(defun macro-structify (tag-entries)
(with-temp-buffer
(insert tag-entries)
(let* ((mytags (libxml-parse-html-region (point-min) (point-max))))
(dom-struct (car (dom-by-tag mytags 'body))))))
(let ((myskel "<html>
<head>
<title>Demo: Gradient Slide</title>
</head>
<link href=\"https://fonts.googleapis.com/css?family=Nunito+Sans\" rel=\"stylesheet\">
<link rel=\"stylesheet\" href=\"dist/build.css\">
<body data-table=\"layout\">
<header data-field=\"title\">
<h1>Skeleton Screen</h1>
</header>
<div class=\"wrap\" data-table=\"page\"> Sample Text <p data-field=\"heading\" class=\"format\" data-attribute=\"somethingsomething\">Welcome to this page</p>
<div class=\"flex-grid generic-card\">
<div class=\"card loading\" data-field=\"intro\">Text </div>
<div class=\"card loading\" data-field=\"body\"></div>
</div>
</div>
</body>
</html>"))
(macro-structify myskel))

Here's a solution using esxml-query from the esxml package. It looks for all nodes with a data-field attribute that are children of a div node with a data-table attribute, then collects their attribute values into a list.
(require 'dom)
(require 'esxml-query)
(let* ((myskel "<html>
<head>
<title>Demo: Gradient Slide</title>
</head>
<link href=\"https://fonts.googleapis.com/css?family=Nunito+Sans\" rel=\"stylesheet\">
<link rel=\"stylesheet\" href=\"dist/build.css\">
<body data-table=\"layout\">
<header data-field=\"title\">
<h1>Skeleton Screen</h1>
</header>
<div class=\"wrap\" data-table=\"page\"> Sample Text <p data-field=\"heading\" class=\"format\" data-attribute=\"somethingsomething\">Welcome to this page</p>
<div class=\"flex-grid generic-card\">
<div class=\"card loading\" data-field=\"intro\">Text </div>
<div class=\"card loading\" data-field=\"body\"></div>
</div>
</div>
</body>
</html>")
(dom (with-temp-buffer
(insert myskel)
(libxml-parse-html-region (point-min) (point-max))))
(table-node (esxml-query "div[data-table]" dom))
(model-nodes (esxml-query-all "[data-field]" table-node))
(model-data-table (dom-attr table-node 'data-table))
(model-data-fields (mapcar (lambda (node) (dom-attr node 'data-field)) model-nodes)))
(cons model-data-table model-data-fields))
;; => ("page" "heading" "intro" "body")
The result is different from what you've specified for several reasons:
The whole HTML snippet contains a body tag with a data-table attribute before a div tag with a data-table attribute, but your HTML fragment looks at the latter, so I've changed the code to look for a div tag with a data-table attribute
There is a header tag with a data-field attribute set to "title" (the expected field), but it's part of the body tag with the data-table attribute set to "layout", not the div tag with the data-table attribute set to "page" (the actual field)
The remaining fields are as expected, but printed differently than specified, because in many Lisp languages, (foo . (bar baz)) is identical to (foo bar baz) and usually printed in the latter form

How can I extract text from an HTML element containing a mix of `p` tags and inner text?

I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver. Here is an example of some of the HTML structure:
<div id="article">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div>
This div represents an article on a wiki. I want to extract the text from it, but as you can see, some paragraphs are in p tags, and some are contained directly within the div. I also need the headlines and anchor tag text.
I know how to parse and extract the text from all of the p, a, and h tags, and I can select for the div and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.
How can I extract the text from this div, so that all of the text from the p, a, h tags, as well as the inner text on the div, are extracted in order? The result should be paragraphs of text in the same order as what is in the HTML.
Here is what I am currently using to extract, but the inner div text is missing from the results:
(defn get-texts [url]
(:paragraphs (extract (parse (slurp url))
[:paragraphs]
"#article > *:not(aside, nav, table, figure, dl)" text)))
Note also that additional unwanted elements appear in this div, e.g., aside, figure, etc. These elements contain text, as well as nested elements with text, that should not be included in the result.

You could extract the entire article as a JSoup object (likely an Element), then convert it to an EDN representation using reaver/to-edn. Then you go through the :content of that and handle both strings (the result of TextNodes) and elements that have a :tag that interests you.
(Code by vaer-k)
(defn get-article [url]
(:article (extract (parse (slurp url))
[:article]
"#article"
edn)))
(defn text-elem?
[element]
(or (string? element)
(contains? #{:p :a :b :i} (:tag element))))
(defn extract-text
[{content :content}]
(let [text-children (filter text-elem? content)]
(reduce #(if (string? %2)
(str %1 %2)
(str %1 (extract-text %2)))
""
text-children)))
(defn extract-article [url]
(-> url
get-article
extract-text))

You can solve this using the tupelo.forest library, which was presented in an "Unsession" of the Clojure/Conj 2019 just last week.
Below is the solution written as a unit test. First some declarations and the sample data:
(ns tst.demo.core
(:use tupelo.forest tupelo.core tupelo.test)
(:require
[clojure.string :as str]
[schema.core :as s]
[tupelo.string :as ts]))
(def html-src
"<div id=\"article\">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div> ")
To start off, we add the html data (a tree) to the forest after removing all newlines, etc. This uses the Java "TagSoup" parser internally:
(dotest
(hid-count-reset)
(with-forest (new-forest)
(let [root-hid (add-tree-html
(ts/collapse-whitespace html-src))
unwanted-node-paths (find-paths-with root-hid [:** :*]
(s/fn [path :- [HID]]
(let [hid (last path)
node (hid->node hid)
tag (grab :tag node)]
(or
(= tag :aside)
(= tag :nav)
(= tag :figure)
(= tag :dl)))))]
(newline) (spyx-pretty :html-orig (hid->bush root-hid))
The spyx-pretty shows the "bush" format of the data (similar to hiccup format):
:html-orig (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :aside, :value "unwanted text"}]
[{:tag :p, :value "Some text"}]
[{:tag :nav}
[{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
[{:tag :p, :value "Etc etc"}]]]]
So we can see the data has been loaded correctly. Now, we want to remove all of the unwanted nodes as identified by the find-paths-with. Then, print the modified tree:
(doseq [path unwanted-node-paths]
(remove-path-subtree path))
(newline) (spyx-pretty :html-cleaned (hid->bush root-hid))
:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :p, :value "Some text"}]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :p, :value "Etc etc"}]]]]
At this point, we simply walk the tree and accumulate any surviving text nodes into a vector:
(let [txt-accum (atom [])]
(walk-tree root-hid
{:enter (fn [path]
(let [hid (last path)
node (hid->node hid)
value (:value node)] ; may not be present
(when (string? value)
(swap! txt-accum append value))))})
To verify, we compare the found text nodes (ignoring whitespace) to the desired result:
(is-nonblank= (str/join \space #txt-accum)
"Some text
More text
A headline
More text
Here is a paragraph made of some raw text directly in the div
Another paragraph of text
More raw text and this one has an
anchor tag
inside
Etc etc")))))
For more details, see the README file and the API docs. Be sure to also view the Lightning Talk for an overview.

Using Antizer over AntDesign, how can we access things like Card.Meta?

How to translate examples like this from antd docs to the antizer clojurescript world? See code from https://ant.design/components/card/
<Card
style={{ width: 300 }}
cover={<img alt="example" src="https://gw.alipayobjects.com/zos/rmsportal/JiqGstEfoWAOHiTxclqi.png" />}
actions={[<Icon type="setting" />, <Icon type="edit" />, <Icon type="ellipsis" />]}
>
<Meta
avatar={<Avatar src="https://zos.alipayobjects.com/rmsportal/ODTLcjxAfvqbxHnVXCYX.png" />}
title="Card title"
description="This is the description"
/>

The translated reagent code should be like:
[card {:style {:width 300}
:cover (as-element
[img {:src "https://gw.alipayo..."
:alt "example"}])
:actions (map as-element
[[icon {:type "setting"}]
[icon {:type "edit"}]
[icon {:type "ellipsis"}]])}
[meta {:avatar (as-element
[avatar {:src "https://zos.alipayobj..."}])
:title "Card title"
:description "This is the description"}]]
as-element is used to convert reagent element to plain react element

ClojureScript Dropdown Menu

I have a html page that contains a navigation bar at the top of the screen. In the navigation bar, I have a search box and what I want it to do is, you type in this box, hit enter and the results are displayed as a dropdown menu
<li><input type="text" id="search-bar" placeholder="Search"></li>
This is the html search input box. I have given it an id search-bar to eventually create the dropdown menu in ClojureScript
(when-let [section (. js/document (getElementById "search-bar"))]
(r/render-component [search-bar-component] section))
Currently I have a search-form that looks like the following
(defn search-form
[]
[:div
[:p "What are you searching for? "
[:input
{:type :text
:name :search
:on-change #(do
(swap! fields assoc :search (-> % .-target .-value))
(search-index (:search #fields)))
:value (:search #fields)}]]
[:p (create-links #search-results)]])
(defn- search-component
[]
[search-form])
This is my search-component.
What I want to happen is when you type in the input box on the navbar (say "hello", it calls search-index from the search-form with the parameter being the value you type in ("hello") and then returns the results as a dropdown menu below.
search-form works right now as a form on a html page, where you input some text into a form and then the results are displayed below. I want to change it to be on the navbar instead of as a separate page, where the input form is on the navbar and the results are displayed below
How would I have to change my search-form in order to do this?
I think I can do something along the lines of this
(defn search-bar-form
[]
[:div
[:input
{:type :text
:name :search
:on-change #(do
(swap! fields assoc :search (-> % .-target .-value))
(search-index (:search #fields)))
:value (:search #fields)}]
[:p (create-links #search-results)]])
(defn- search-bar-component
[]
[search-form])
Any help would be much appreciated.

re-com provides a typeahead component. It looks like you're using reagent, so you could just use that, otherwise you can use it as inspiration.

More than one parameter in clostache/render function?

I am new to Clojure and I am trying to make a page where you can see all the news that are in a table on the left, and only sports news on the right of the page. I tried to add a new parameter to the clostache/render:
(defn render-template [template-file params param]
(clostache/render (read-template template-file) params param))
(defn welcome []
(render-template "index" {:sports (model/justSports)} {:news (model/all)}))
where the model/all and model/justSports are:
(defn all []
(j/query mysql-db
(s/select * :news)))
(defn justSports []
(j/query mysql-db
(s/select * :news ["genre = ?" "sports"])))
and the news should be shown like this:
<div style="background-color: #D3D3D3; width: 450px; height: 800px; position: absolute; right: 10px; margin-top: 10px; border-radius: 25px;">
<sections>
{{#sports}}
<h2>{{title}}</h2>
<p>{{text}}<p>
{{/sports}}
</sections>
</div>
<div class="container" style="width: 500px; height: 800px; position: absolute; left: 20px;">
<h1>Listing Posts</h1>
<sections>
{{#news}}
<h2>{{title}}</h2>
<p>{{text}}<p>
{{/news}}
</sections>
</div>
But it doesn't work. It just shows the data from the first parameter on the page. What do you think, how can I make this work?
P.S.
Don't mind the ugly css, I will work on that :)

The following should make it work:
(defn render-template [template-file params]
(clostache/render (read-template template-file) params))
(defn welcome []
(render-template "index" {:sports (model/justSports)
:news (model/all)}))
render has three "arities":
(defn render
"Renders the template with the data and, if supplied, partials."
([template]
(render template {} {}))
([template data]
(render template data {}))
([template data partials]
(replace-all (render-template template data partials)
[["\\\\\\{\\\\\\{" "{{"]
["\\\\\\}\\\\\\}" "}}"]])))
You were calling the 3-arity overload which takes [template data partials], so the second map with the :news key was being taken as the partials by clostache. You want to call the 2-arity version which takes just [template data], passing one map with keys :news & :sports.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

turning a html structure into a Clojure Structure - html

The solution can be splited in two parts Parsing: parse it with clojure html parser or any other parser. Custom data structure: modify the parsed html, you can use clojure.walk for that if you want.

Related

How do i extract a list of specific dom-attributes in a nested dom tree?

How can I extract text from an HTML element containing a mix of `p` tags and inner text?

Using Antizer over AntDesign, how can we access things like Card.Meta?

ClojureScript Dropdown Menu

More than one parameter in clostache/render function?

Categories

Resources