I'm trying to create an html page out indented text.
For examle:
text file:
1. hello
- stack
- overflow
- how
- are you
Will come out as:
<ol>
<il>hello</li>
<ul>
<li>stack</li> ...
so it will render as an indented list.
I thought it would be best to create a node tree inspired by this answer for a similar problem in Python
Here's my cloned struct from the link in Go which doesn't work as intended, it gets stuck in the recursion for some reason:
func (n *node) addChildren(nodes []node) {
childLevel := nodes[0].textStart
for len(nodes) > 0 {
// pop
tempNode := nodes[0]
nodes = nodes[1:]
if tempNode.textStart == childLevel {
n.children = append(n.children, tempNode)
} else if tempNode.textStart > childLevel {
nodes = append([]node{tempNode}, nodes...)
n.children[len(n.children)-1].addChildren(nodes)
} else if tempNode.textStart <= n.textStart {
nodes = append([]node{tempNode}, nodes...)
return
}
}
}
I have found Markdown
As an optimal tool for the task!
Related
Parse between elements
eg
<span>7:33</span>AM </dd>\n
<dt>Dinner</dt>\n <dd id=\"Dinner\">\n <span>12:23</span>PM </dd>\n <dt>Lunch</dt>\n <dd id=\"Lunch\">\n <span>2:43</span>PM </dd>\n
how do I get "AM/PM" Values
let test: [String] = document.querySelectorAll("span").compactMap({element in
guard let span = document.querySelector("dt") else {
return nil
}
return span.elementId
})
this is just looping 7:33 nine time :(
you can solve this the same way you would do it on any other browser. The problem is not HTMLKit specific.
Since there is no way to select a HTML Text Node via CSS, you have to select its parent and then access the text via the textContent property or access the parent node's child nodes.
So here are some options to solve your problem, using HTMLKit as an example and the following sample DOM:
let html = """
<html>
<body>
<dl>
<dt>Breakfast</dt>
<dd id="Breakfast"><span>10:00</span>AM</dd>
<dt>Dinner</dt>
<dd id="Dinner"><span>12:23</span>PM</dd>
</dl>
</body>
</html>
"""
let doc = HTMLDocument(string: html)
let elements = doc.querySelectorAll("dd")
Option 1: Select the dd elements and access the textContent
elements.forEach { ddElement in
print(ddElement.textContent)
}
// Would produce:
// 10:00AM
// 12:23PM
Option 2: Select the dd elements and iterate through their child nodes, while filtering out everything except for HTMLText nodes. Additionally you can provide your own custom filter:
elements.forEach { ddElement in
let iter: HTMLNodeIterator = ddElement.nodeIterator(showOptions: [.text], filter: nil)
iter.forEach { node in
let textNode = node as! HTMLText
print(textNode.textContent)
}
}
// Would produce:
// 10:00
// AM
// 12:23
// PM
Option 3: Expanding on the previous option, you can provide a custom filter for the node iterator:
for dd in elements {
let iter: HTMLNodeIterator = dd.nodeIterator(showOptions: [.text]) { node in
if !node.textContent.contains("AM") && !node.textContent.contains("PM") {
return .reject
}
return .accept
}
iter.forEach { node in
let textNode = node as! HTMLText
print(textNode.textContent)
}
}
// Would produce:
// AM
// PM
Option 4: Wrap the AM and PM in their own <span> elements and access those, e.g. with dd > span selector:
doc.querySelectorAll("dd > span").forEach { elem in
print(elem.textContent)
}
// Given the sample DOM would produce:
// 10:00
// 12:23
// if you wrap the am/pm in spans then you would also get those in the output
Your snippet produces: ["", ""] with the sample DOM from above. Here is why:
let test: [String] = doc.querySelectorAll("span")
.compactMap { element in // element is a <span> HTMLElement
// However the elements returned here are <dt> elements and not <span>
guard let span = doc.querySelector("dt") else {
return nil
}
// The <dt> elements in the DOM do not have IDs, hence an empty string is returned
return span.elementId
}
I hope this helps and clarifies some things.
I have the following, where I am trying to only capture the second case, where the text matches But I want this one here. Currently, it captures both cases.
package main
import (
"bytes"
"fmt"
"io"
"strings"
"golang.org/x/net/html"
)
func getTag(doc *html.Node, tag string) []*html.Node {
var nodes []*html.Node
var crawler func(*html.Node)
crawler = func(node *html.Node) {
if node.Type == html.ElementNode && node.Data == tag {
nodes = append(nodes, node)
return
}
for child := node.FirstChild; child != nil; child = child.NextSibling {
crawler(child)
}
}
crawler(doc)
return nodes
}
func main() {
doc, _ := html.Parse(strings.NewReader(testHTML))
nodes := getTag(doc, "a")
var buf bytes.Buffer
w := io.Writer(&buf)
for i, node := range nodes {
html.Render(w, node)
if i < (len(nodes) - 1) {
w.Write([]byte("\n"))
}
}
fmt.Println(buf.String())
}
var testHTML = `<html><body>
I do not want this link here link text
But I want this one here more link text
</body></html>`
This outputs:
link text
more link text
I would like to match specific text that precedes an <a> tag and, if it matches, return the <a> node. For instance, pass in But I want this one here and it returns more link text. I've been told to not parse html with regex but now I am stuck.
You're actually pretty close, because you are already using a proper parser (html.Parse from golang.org/x/net/html).
The trick here is that the various elements of the page are bound together conveniently, so you can use your existing crawler code with a later filtering function, if you like. (You could instead combine the filtering directly into the crawler.)
Each n *html.ElementNode is preceded by something unless it's the initial element in a block (first of a document or first child node), and that something is in n.PrevSibling. If its type is html.TextNode you have a sequence of the form:
some text<a ...>thing</a>
and you can examine the "some text" in the previous node:
func wanted(re *regexp.Regexp, n *html.Node) bool {
if n.PrevSibling == nil || n.PrevSibling.Type != html.TextNode {
return false
}
return re.MatchString(n.PrevSibling.Data)
}
This won't be perfect, because you could have, e.g.:
text <font></font> broken <font></font>uplast link
and the code will try to match against the string up, when you probably should put the text together into text broken up and pass that to the matcher. See more complete example here.
For starters, I'm seeing two types of problems with my the functionality of the code. I can't seem to find the correct element with the function xmlXPathEvalExpression. In addition, I am receiving errors similar to:
HTML parser error : Unexpected end tag : a
This happens for what appears to be all tags in the page.
For some background, the HTML is fetched by CURL and fed into the parsing function immediately after. For the sake of debugging, the return statements have been replaced with printf.
std::string cleanHTMLDoc(std::string &aDoc, std::string &symbolString) {
std::string ctxtID = "//span[id='" + symbolString + "']";
htmlDocPtr doc = htmlParseDoc((xmlChar*) aDoc.c_str(), NULL);
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression((xmlChar*) ctxtID.c_str(), context);
if (xmlXPathNodeSetIsEmpty(result->nodesetval)) {
xmlXPathFreeObject(result);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
printf("[ERR] Invalid XPath\n");
return "";
}
else {
int size = result->nodesetval->nodeNr;
for (int i = size - 1; i >= 0; --i) {
printf("[DBG] %s\n", result->nodesetval->nodeTab[i]->name);
}
return "";
}
}
The parameter aDoc contains the HTML of the page, and symbolString contains the id of the item we're looking for; in this case yfs_l84_aapl. I have verified that this is an element on the page in the style span[id='yfs_l84_aapl'] or <span id="yfs_l84_aapl">.
From what I've read, the errors fed out of the HTML Parser are due to a lack of a namespace, but when attempting to use the XHTML namespace, I've received the same error. When instead using htmlParseChunk to write out the DOM tree, I do not receive these errors due to options such as HTML_PARSE_NOERROR. However, the htmlParseDoc does not accept these options.
For the sake of information, I am compiling with Visual Studio 2015 and have successfully compiled and executed programs with this library before. My apologies for the poorly formatted code. I recently switched from writing Java in Eclipse.
Any help would be greatly appreciated!
[Edit]
It's not a pretty answer, but I made what I was looking to do work. Instead of looking through the DOM by my (assumed) incorrect XPath expression, I moved through tag by tag to end up where I needed to be, and hard-coded in the correct entry in the nodeTab attribute of the nodeSet.
The code is as follows:
std::string StockIO::cleanHTMLDoc(std::string htmlInput) {
std::string ctxtID = "/html/body/div/div/div/div/div/div/div/div/span/span";
xmlChar* xpath = (xmlChar*) ctxtID.c_str();
htmlDocPtr doc = htmlParseDoc((xmlChar*) htmlInput.c_str(), NULL);
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathObjectPtr result = xmlXPathEvalExpression(xpath, context);
if (xmlXPathNodeSetIsEmpty(result->nodesetval)) {
xmlXPathFreeObject(result);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
printf("[ERR] Invalid XPath\n");
return "";
}
else {
xmlNodeSetPtr nodeSet = result->nodesetval;
xmlNodePtr nodePtr = nodeSet->nodeTab[1];
return (char*) xmlNodeListGetString(doc, nodePtr->children, 1);
}
}
I will leave this question open in hopes that someone will help elaborate upon what I did wrong in setting up my XPath expression.
Actually I've parsed a website using htmlparser and I would like to find a specific value inside the parsed object, for example, a string "$199", and keep tracking that element(by periodic parsing) to see the value is still "$199" or has changed.
And after some painful stupid searching using my eyes, I found the that string is located at somewhere like this:
price = handler.dom[3].children[3].children[3].children[5].children[1].
children[3].children[3].children[5].children[0].children[0].raw;
So I'd like to know whether there are methods which are less painful? Thanks!
A tree based recursive search would probably be easiest to get the node you're interested in.
I've not used htmlparser and the documentation seems a little thin, so this is just an example to get you started and is not tested:
function getElement(el,val) {
if (el.children && el.children.length > 0) {
for (var i = 0, l = el.children.length; i<l; i++) {
var r = getElement(el.children[i],val);
if (r) return r;
}
} else {
if (el.raw == val) {
return el;
}
}
return null;
}
Call getElement(handler.dom[3],'$199') and it'll go through all the children recursively until it finds an element without an children and then compares it's raw value with '$199'. Note this is a straight comparison, you might want to swap this for a regexp or similar?
I am using Groovy's handy MarkupBuilder to build an HTML page from various source data.
One thing I am struggling to do nicely is build an HTML table and apply different style classes to the first and last rows. This is probably best illustrated with an example...
table() {
thead() {
tr(){
th('class':'l name', 'name')
th('class':'type', 'type')
th('description')
}
}
tbody() {
// Add a row to the table for each item in myList
myList.each {
tr('class' : '????????') {
td('class':'l name', it.name)
td('class':'type', it.type)
td(it.description)
}
}
}
}
In the <tbody> section, I would like to set the class of the <tr> element to be something different depending whether the current item in myList is the first or the last item.
Is there a nice Groovy-ified way to do this without resorting to something manual to check item indexes against the list size using something like eachWithIndex{}?
You could use
if(it == myList.first()) {
// First element
}
if(it == myList.last()) {
// Last element
}
The answer provided by sbglasius may lead to incorrect result like when the list contains redundants elements so an element from inside the list may equals the last one.
I'm not sure if sbglasius could use is() instead of == but a correct answer could be :
myList.eachWithIndex{ elt, i ->
if(i == 0) {
// First element
}
if(i == myList.size()-1) {
// Last element
}
}
if (it.after.value != null) {
......
}
Works for maps