Retrieve attributes and span using HTMLAgilityPack library - html

In this piece of HTML code:
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
<img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
Lower Demos
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
Rock
Pop
</div>
</div>
I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:
Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year : 2013
Genres: Rock, Pop
URL : http://www.mp3crank.com/wolf-eyes/lower-demos-121866
Which are these html lines:
Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year : <span>2013</span>
Genre1: Rock
Genre2: Pop
URL : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"
This is what I'm trying, but I always get an object reference not set exception when trying to select a single node,
Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?
Public Class Form1
Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing
Private Title As String = String.Empty
Private Cover As String = String.Empty
Private Genres As String() = {String.Empty}
Private Year As Integer = -0
Private URL as String = String.Empty
Private Sub Test() Handles MyBase.Shown
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop trough the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode("//div[#class='release']").Attributes("title").Value
Cover = node.SelectSingleNode("//div[#class='thumb']").Attributes("src").Value
Year = CInt(node.SelectSingleNode("//div[#class='release-year']").Attributes("span").Value)
Genres = ¿select multiple nodes?
URL = node.SelectSingleNode("//div[#class='release']").Attributes("href").Value
Next
End Sub
End Class

Your mistake here it to try to access an attribute of a childnode from the one you've found.
When you call node.SelectSingleNode("//div[#class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.
It's possible to write XPATH queries that select the sub-node, e.g. //div[#class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.
Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Dim releaseNode = node.SelectSingleNode(".//div[#class='release']")
'Assumes we find the node and it has a a-tag
Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
Dim thumbNode = node.SelectSingleNode(".//div[#class='thumb']")
Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
Dim releaseYearNode = node.SelectSingleNode(".//div[#class='release-year']")
Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
Dim genreNode = node.SelectSingleNode(".//div[#class='genre']")
Dim genreLinks = genreNode.SelectNodes(".//a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Next
Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....
Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.
If you want a shorter XPATH solution, here is the same code using that approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode(".//div[#class='release']/h4/a[#title]").Attributes("title").Value
URL = node.SelectSingleNode(".//div[#class='release']/h4/a[#href]").Attributes("href").Value
Cover = node.SelectSingleNode(".//div[#class='thumb']/a/img[#src]").Attributes("src").Value
Year = CInt(node.SelectSingleNode(".//div[#class='release-year']/span").InnerText)
Dim genreLinks = node.SelectNodes(".//div[#class='genre']/a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Console.WriteLine()
Next

You were not that far from the solution. Two important notes:
// is a recursive call. It can have some heavy performance impact, and also it may select nodes you don't want, so I suggest you only use it when the hierarchy is deep or complex or variable, and you don't want to specify the whole path.
There is a useful helper method on XmlNode named GetAttributeValue which will you get an attribute even if it does not exist (you need to specify the default value).
Here is a sample that seems to work:
' select the base/parent DIV (here we use a discriminant CLASS attribute)
' all select calls below will use this DIV element as a starting point
Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Title :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("title", CStr(Nothing))))
' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
Console.WriteLine(("Cover :" & node.SelectSingleNode("div[#class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))
' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
Console.WriteLine(("Year :" & node.SelectSingleNode("div[#class='release-year']//span").InnerText))
' get all A elements which are child or grand child(//) of a 'genre' DIV
Dim nodes As HtmlNodeCollection = node.SelectNodes("div[#class='genre']//a")
Dim i As Integer
For i = 0 To nodes.Count - 1
Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
Next i
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Url :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("href", CStr(Nothing))))

Related

Identify NextSibling in XMLHTTP response

I am still trying to learn about NextSibling and I am using XMLHTTP in excel VBA.
Here's the HTML for the element
<ul class="list-unstyled list-specification">
<li><span>ID</span> <span class="text-info">22928</span></li>
<li><span>Category</span> <span class="text-info">Mechanical</span></li>
<li><span>Discipline</span> <span class="text-info">Mechanical </span></li>
<li><span>Commodity</span> <span class="text-info">Pipe</span></li>
<li><span>Sub commodity</span> <span class="text-info">12 In Pipe </span></li>
<li><span>UOM</span> <span class="text-info">EA</span></li>
<li><span>Available quantity</span> <span class="text-info">30</span></li>
<li><span>Age</span> <span class="text-info">8</span></li>
</ul>
I have used this line to spot on the first span in the li (lists) so as to identify the headers for each part
Set post = html.querySelectorAll(".list-specification li span")
Then I used loops like that
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
Debug.Print post.Item(j).NextSibling.innerText
End If
Next j
I got an error when trying to use NextSibling. I feel stuck as for that NextSibling .. Can you guide me?
for example ID is the first in the list and I would like to get that ID based on my approach
I got an error when trying nextElementSibling
Sub Test()
Dim html As New HTMLDocument, post As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "Get", "C:\Sample.html", False
.send
html.body.innerHTML = .responseText
End With
Set post = html.querySelectorAll(".list-specification li span")
For i = 0 To post.Length - 1
If post.Item(i).innerText = "ID" Then
MsgBox post.Item(i).nextElementSibling.innerText: Exit For
End If
Next i
End Sub
Try doing another NextSibling and then you should find it working:
Set post = Html.querySelectorAll(".list-specification li span")
For j = 0 To post.Length - 1
If post.Item(j).innerText = "ID" Then
MsgBox post.Item(j).NextSibling.NextSibling.innerText
Exit For
End If
Next j
The correct property to access I was expecting to be nextElementSibling, but it seems VBA does not implement this.
The NonDocumentTypeChildNode.nextElementSibling read-only property
returns the element immediately following the specified one in its
parent's children list, or null if the specified element is the last
one in the list.
You can however, more correctly, simply take the next index in post i.e. post.item(1). You are collecting both headers and values in the same nodeList so you can use odd/even distinction to separate headers from values.
You can see this if you run the following in console:
post = document.querySelectorAll(".list-specification li span");
var res = ''; for (let [i] of Object.entries(post)) {res += post.item(`${i}`).innerText + ' '};console.log(res);
Spans are inline containers and you can see from html that you have a space between spans which is part of the parent li and this becomes a child text node. This is why your nextSibling hits a text node and errors with the attempt at .innerText accessor. You would want a text node property such as .nodeValue (if you were at the right node).
You can step through, in the console, and see the different properties in action:
As nextElementSibling is not implemented in VBA you would need to chain nextSibling, as per #Sim's answer, if you want to explore nextSibling to solve this particular navigation. However, note that a test of nodeType would avoid throwing an error as you could then apply the appropriate accessor.

Compare index of 2 elements in a collection

Issue : I have some issues figuring out a way to select elements in my HTMLDocument which are under a certain point in the page.
In the following code sample, as you can see in the comments, I first select a part of them which respect my queryselector criteria
IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")
In this example I have 10 elements in this collection. Each of this element in contained in a table which is its parent on the 7th degree.
MsgBox TypeName(IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(2).ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode) ' HTMLTable
Some of those elements are in the same table.
You can see here the form which contains all the tables .
Now, the thing is that I want to select the innerHTML of some of those elements only and not all of them. The criterion to know if I one of those 10 elements interests me or not is it's position on the webpage. I want all the elements which are under the message Part Usage. There is only one table containing the Part Usage text and so my idea was to see if the table in which are contained each element has a higher or lower index in the "form" collection.
If the index is higher I want this element, otherwise I discard it.
What I did for this is the following code :
I set the ID Bim to all the tables containing one or more
from the 10 elements.
For Each Element In IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' here for all of the 10 numbers found with the queryselectorall we'll find their respective table in the collection (form) and set its Class as "Bim". But since some of the numbers are in the same table, we won't have 10 tables with a classname "Bim" at the end of the process. We'll have only x tables with the classname "Bim"
Element.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.ParentNode.Class = "Bim"
Next
I set the ID Stop to the table containing the text Part Usage
For Each Element In IEDoc.getElementsByClassName("SectionHead")
If Element.innerHTML = "Part Usage" Then
'MsgBox TypeName(Element.ParentNode.ParentNode.ParentNode)' HTMLTable
Element.ParentNode.ParentNode.ParentNode.ID = "Stop"
End If
Next
I check which tables with the Classname Bim are under (=higher index) the table with the ID Stop. For the table ( there is actually only one) matching the criterion of point 3 I apply IEDoc.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") inside of them so that I get all the elements in contains and more paricularly their innerHTML.
For Each Element In IEDoc.getElementsByClassName("Bim") ' Here we check all the x tables which have the Classname "Bim"
If Element.indexInTheWholeForm > IEDoc.getElementById("Stop").indexInTheWholeForm Then 'and compare somehow if their index in the (form) collection if higher than the table with the ID "Stop" ( this is similar to checking if the element if lower on the webpage in thic case) ( we only want the element which have a higher index aka under the Part Usage table)
For Each Element2 In Element.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' Now we are in the table which contains the part numbers and we'll look for all the part numbers it contains by applying the queryselectorall again, but this time only in this specific table
array_parts2(iteration2) = Element.querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(iteration2).innerHTML
ActiveWorkbook.Worksheets(1).Cells(iteration2 + 1, 19) = array_parts2(iteration2)
iteration2 = iteration2 + 1
Next
End If
Next
of course what doesn't work is the indexInTheWholeForm property which doesn't exist. Any ideas on how to do this ?
Thank for reaching that line :)
Untested but I would do something like this (assuming I understood you correctly)
Sub Tester()
Const S_MATCH As String = "td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']"
Dim e, tbl, bHit As Boolean
'...
'load page etc
'...
'get all the matching rows and cycle though them
For Each e In IEDoc.querySelectorAll(S_MATCH)
'did we get to the table of interest yet?
If Not bHit Then
Set tbl = e.ParentNode.ParentNode.ParentNode.ParentNode. _
ParentNode.ParentNode.ParentNode
If IsPartUsageTable(tbl) Then bHit = True
End If
If bHit Then
'we reached the table of interest, so
' do something with e
End If
Next
End Sub
Function IsPartUsageTable(tbl) As Boolean
Dim e, rv As Boolean
For Each e In tbl.getElementsByClassName("SectionHead")
If Element.innerHTML = "Part Usage" Then
rv = True
Exit For
End If
Next
IsPartUsageTable = rv
End Function
Ok, so as unexpected as it sounds, I think I found a solution to my own question. I will confirm you that it works as soon as I have the possibility to run it with my colleague.
So I keep point 1 and 2 from my initial post and I replaced point 3 with the following :
For i = 0 To IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table").length
If IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).ID = "Stop" Then
index_Part_Usage = i
Position_Part_Usage = index + 1
Exit For
End If
Next
'MsgBox Position_Part_Usage
For i = 0 To IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table").length
If IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).className = "Bim" Then
index = i
Position = index + 1
If index > index_Part_Usage Then
For Each Element2 In IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']") ' Now we are in the table which contains the part numbers and we'll look for all the part numbers it contains by applying the queryselectorall again, but this time only in this specific table
array_parts2(iteration2) = IEDoc.getElementsByTagName("form")(0).getElementsByTagName("table")(i).querySelectorAll("td[width='100'][class='ListMainCent'][rowSpan='1'][colSpan='1']")(iteration2).innerHTML
ActiveWorkbook.Worksheets(1).Cells(iteration2 + 1, 19) = array_parts2(iteration2)
iteration2 = iteration2 + 1
Next
End If
End If
Next i

Loop Through HTML Elements and Nodes

I'm working on an HTML page highlighter project but ran into problems when a search term is a name of an HTML tag metadata or a class/ID name; eg if search terms are "media OR class OR content" then my find and replace would do this:
<link href="/css/DocHighlighter.css" <span style='background-color:yellow;font-weight:bold;'>media</span>="all" rel="stylesheet" type="text/css">
<div <span style='background-color:yellow;font-weight:bold;'>class</span>="container">
I'm using Lucene for highlighting and my current code (sort of):
InputStreamReader xmlReader = new INputStreamReader(xmlConn.getInputStream(), "UTF-8");
if (searchTerms!=null && searchTerms!="") {
QueryScorer qryScore = new QueryScorer(qp.parse(searchTerms));
Highlighter hl = new Highlighter(new SimpleHTMLFormatter(hlStart, hlEnd), qryScore);
}
if (xmlReader!=null) {
BufferedReader br = new BufferedReader(xmlReader);
String inputLine;
while((inputLine = br.readLine())!=null) {
String tmp = inputLine.trim();
StringReader strReader = new stringReader(tmp);
HTMLStripCharFilter htm = HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader));
String tHL = hl.getBestFragment(analyzer, "", htm);
tmp = (tHL==null ? tmp : tHL);
}
xmlDoc+=tmp;
}
bufferedReader.close()
As you can see (if you understand Lucene highlighting) this does an indiscriminate find/replace. Since my document will be HTML and the search terms are dictated by users there is no way for me to parse on certain elements or tags. Also, since the find/replace basically loops and appends the HTML to a string (the return type of the method) I have to keep all HTML tags and values in place and order. I've tried using Jsoup to loop through the page but handles the HTML tag as one big result. I also tried tag soup to remove the broken HTML caused by the problem but it doesn't work correctly. Does anyone know how to basically loop though the elements and node (data value) of html?
I've been having the most luck with this
StringBuilder sb = new StringBuilder();
sb.append("<?xml version=\"1.0\" enconding=\"UTF-8\"?><!DOCTYPE html>");
Document doc = Jsoup.parse(txt.getResult());
Element elements = doc.getAllElements();
for (Element e : elements) {
if (!(e.tagName().equalsIgnoreCase("#root"))) {
sb.append("<" + e.tagName() + e.attributes() + ">" + e.ownText() + "\n");
}// end if
}// end for
return sb;
The one snag I still get is the nesting isn't always "repaired" properly but still semi close. I'm working more on this.

hInserting line-break into XML so that it appears after XSL rendering in VB.NET

I have a System.xml.xmlDocument() object which is rendered onto a web page by using XSL. I want to insert a 'linebreak` inside certain nodes in the XML object, so when the XML is rendered using XSLT there is an actual line break there. My Code to do this looks like this:
Dim parentNodes As System.Xml.XmlNodeList = objOutput.SelectNodes("//PARENT")
Dim currentParentValue As String = String.Empty
Dim resultParent As String = String.Empty
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
Dim parArray As String() = currentParentValue.Split(";")
If parArray.Length > 2 Then
resultParent = String.Empty
Dim parCounter As Integer = 0
For Each Parent As String In parArray
parCounter = parCounter + 1
resultParent = resultParent + Parent + "; "
If (parCounter Mod 2) = 0 Then
resultParent = resultParent + "
"
End If
Next
End If
par.InnerText = resultParent
Next
And in XSL:
<td width="50%" nowrap="nowrap">
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</td>
However, it looks like xmlDocument is automatically escaping the next line character, so it just appears as text on the page, can anyone tell how to fix this?
If you change
<td width="50%" nowrap="nowrap">
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</td>
to
<td width="50%" nowrap="nowrap">
<pre>
<xsl:value-of select="STUDENT_DETAILS/PARENT"/>
</pre>
</td>
the browser will render line breaks.
you can just simple append "<'br\>" next to your nodes, that will insert the linebreak between yours two nodes.
Notes:
please remove the ' before br.
You problem resolves around this line....
resultParent = resultParent + "
"
Now, you are probably trying to output your XML like this:
<PARENT>George Aaron
Susan Lee Aaron
Richard Elliot Aaron
</PARENT>
However, this escaped
entity is only relevant if the document has yet to be parsed. If it were a text document, that gets subsequent read and parsed into an XML document, then the entities would be handled as expected. But you are working with an XML document that has already been parsed. Therefore, when you do resultParent = resultParent + "
" it is actually going to insert a string of five characters into an existing text node, and because & is a special character, it gets escaped.
Now, what you can simply do is this...
resultParent = resultParent + chr(10)
But ultimately this will prove fruitless because HTML doesn't recognise line-break characters, so you would have to write your XSLT to replace the line break with a <br /> element.
If you wanted to do this in your VB code though, you could create new br elements yourself, and insert them
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
par.InnerText = String.Empty
Dim parArray As String() = currentParentValue.Split(";")
For Each Parent As String In parArray
If Parent.Length > 0 Then
Dim person As XmlText = objOutput.CreateTextNode(Parent)
par.AppendChild(person)
par.AppendChild(objOutput.CreateElement("br"))
End If
Next
Next
So, this takes the PARENT node, clears it down, then adds a text node, and new br element for each parent. The output would then be like so, which would be much easier to output as HTML using XSLT
<PARENT>George Aaron<br />Susan Lee Aaron<br />Richard Elliot Aaron<br /></PARENT>
(It shouldn't be too hard to add the br after every second parent if required).
However, if may not necessarily be a good idea to put "presentational" information in a XML file. Suppose you later had to transform the XML into a different format? An alternate approach would be separate each parent into their own element.
For Each par As System.Xml.XmlNode In parentNodes
currentParentValue = par.InnerText
par.InnerText = String.Empty
Dim parArray As String() = currentParentValue.Split(";")
For Each Parent As String In parArray
If Parent.Length > 0 Then
Dim person As XmlElement = objOutput.CreateElement("PERSON")
person.InnerText = Parent.Trim()
par.AppendChild(person)
End If
Next
Next
This would output something like this..
<PARENT>
<PERSON>George Aaron</PERSON>
<PERSON>Susan Lee Aaron</PERSON>
<PERSON>Richard Elliot Aaron</PERSON>
<PERSON>Albert Smith</PERSON>
</PARENT>
Displaying this as HTML would also be straight-forward
Hint: To display in groups of two, your XSLT may look something like this....
<xsl:for-each select="PERSON[postion() mod 2 = 1]">
<xsl:value-of select=".">;
<xsl:value-of select="following-sibling::PERSON[1]" />
<br />
</xsl:for-each>

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?
Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match
Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!
I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.