HTMLAgilityPack getting <P> and <STRONG> text

HTMLAgilityPack getting <P> and <STRONG> text - html

Hey all I am looking for a way to get this HTML code:
<DIV class=schedule_block>
<DIV class=channel_row><SPAN class=channel>
<DIV class=logo><IMG src='/images/channel_logos/WGNAMER.png'></DIV>
<P><STRONG>2</STRONG><BR>WGNAMER </P></SPAN>
using the HtmlAgilityPack.
I have been trying this:
For Each channel In doc.DocumentNode.SelectNodes(".//div[#class='channel_row']")
Dim info = New Dictionary(Of String, Object)()
With channel
info!Logo = .SelectSingleNode(".//img").Attributes("src").Value
info!Channel = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(0).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(2).InnerText
End With
.......
I can get the Logo but it comes up with a blank string for the Channel and for the Station it says
Index was out of range. Must be non-negative and less than the size of
the collection.
I've tried all types of combinations:
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(1).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(1).ChildNodes(3).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(1).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(2).InnerText
info!Station = .SelectSingleNode(".//span[#class='channel']").ChildNodes(0).ChildNodes(3).InnerText
What do I need to do in order to correct this?

If the whitespace is actually there, it counts as a child node. So:
Dim channelSpan = .SelectSingleNode(".//span[#class='channel']")
info!Channel = channelSpan.ChildNodes(3).ChildNodes(0).InnerText
info!Station = channelSpan.ChildNodes(3).ChildNodes(2).InnerText

Related

VBA MS Word content controls messed order

I have a table with about 15 content controls. The content controls have different titles.
Now, I copy-paste the table with content controls a couple of times, and later, get different values into every single content control from the database. Since the content controls from different tables share the same name, I thought of looping through number of tables using something like this
seqNo = 1
For Each t in MyTables
ActiveDocument.SelectContentControlsByTitle("title1").Item(seqNo).Range.Text = "some value 1 from DB"
ActiveDocument.SelectContentControlsByTitle("title2").Item(seqNo).Range.Text = "some value 2 from DB"
' and so on
seqNo = seqNo + 1
Next
The problem is when I use this code, my content controls don't get filled in sequentially. I mean, for example, content control with title title1 from table1 isn't filled with its value, instead, content control with title title1 from table4 gets that value. And this mess goes around really bad: values from table 2 can end up in table 4, 9, 10 and so forth.
I think the order of content controls gets messed up somehow when I copy-paste the tables.
And clue how to get it right?

Didn't really find why this happens, but went with giving unique names to the content controls, like title1, title2, and so on, and then looping through all of them to set the needed values.

Oh my god yes... I have stumbled upon the same annoying issue too. My workaround has been after the copy change the title in code then paste and change that one too (see below). Now my issue is that this takes WAY too long to run since I'm filling out many of these templates in my code. I'm currently at a lose as how to speed this process up or a different approach I should been using.
objWord.ActiveDocument.Range(start:=objWord.ActiveDocument.Tables(3).Range.Rows(1).Range.start, End:=objWord.ActiveDocument.Tables(3).Range.Rows(5).Range.End).Copy
objWord.Selection.EndKey Unit:=wdStory
objDoc.SelectContentControlsByTitle("Date").Item(1).Title = "Date1"
objDoc.SelectContentControlsByTitle("StartTime").Item(1).Title = "StartTime1"
objDoc.SelectContentControlsByTitle("EndTime").Item(1).Title = "EndTime1"
objDoc.SelectContentControlsByTitle("Mins").Item(1).Title = "Mins1"
objDoc.SelectContentControlsByTitle("Note").Item(1).Title = "Note1"
objDoc.SelectContentControlsByTitle("Grp").Item(1).Title = "Grp1"
objDoc.SelectContentControlsByTitle("acc1").Item(1).Title = "acc1_1"
objDoc.SelectContentControlsByTitle("acc2").Item(1).Title = "acc2_1"
objDoc.SelectContentControlsByTitle("acc3").Item(1).Title = "acc3_1"
objDoc.SelectContentControlsByTitle("acc4").Item(1).Title = "acc4_1"
objDoc.SelectContentControlsByTitle("acc5").Item(1).Title = "acc5_1"
objDoc.SelectContentControlsByTitle("acc6").Item(1).Title = "acc6_1"
objDoc.SelectContentControlsByTitle("acc7").Item(1).Title = "acc7_1"
objDoc.SelectContentControlsByTitle("acc8").Item(1).Title = "acc8_1"
For j = 2 To UBound(Narray)
objWord.Selection.Paste
objDoc.SelectContentControlsByTitle("Date").Item(1).Title = "Date" & j
objDoc.SelectContentControlsByTitle("StartTime").Item(1).Title = "StartTime" & j
objDoc.SelectContentControlsByTitle("EndTime").Item(1).Title = "EndTime" & j
objDoc.SelectContentControlsByTitle("Mins").Item(1).Title = "Mins" & j
objDoc.SelectContentControlsByTitle("Note").Item(1).Title = "Note" & j
objDoc.SelectContentControlsByTitle("Grp").Item(1).Title = "Grp" & j
objDoc.SelectContentControlsByTitle("acc1").Item(1).Title = "acc1_" & j
objDoc.SelectContentControlsByTitle("acc2").Item(1).Title = "acc2_" & j
objDoc.SelectContentControlsByTitle("acc3").Item(1).Title = "acc3_" & j
objDoc.SelectContentControlsByTitle("acc4").Item(1).Title = "acc4_" & j
objDoc.SelectContentControlsByTitle("acc5").Item(1).Title = "acc5_" & j
objDoc.SelectContentControlsByTitle("acc6").Item(1).Title = "acc6_" & j
objDoc.SelectContentControlsByTitle("acc7").Item(1).Title = "acc7_" & j
objDoc.SelectContentControlsByTitle("acc8").Item(1).Title = "acc8_" & j
Next

Reading HTML page using Libreoffice Basic

I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this:
Here is the pseudocode:
Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.
My problem is with the second line, I don't know how to read a HTML file. How should I do this in LibreOffice Basic?

There are two mainly issues with this.
1. Performance
Your UDF will need get the HTTP resource in every cell, in which it is stored.
2. HTML
Unfortunately there is no HTML parser in OpenOffice or LibreOffice. There is only a XML parser. Thats why we cannot parse HTML directly with the UDF.
This will work, but slow and not very universal:
Public Function FETCHHOUSE(sHouse as String) as String
sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse
oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
on error goto falseHouseName
oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
on error goto 0
dim delimiters() as long
sContent = oInpDataStream.readString(delimiters(), false)
lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
if lStartPos = 0 then
FETCHHOUSE = "no infobox on page"
exit function
end if
lEndPos = instr(lStartPos, sContent, "</table>")
sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)
lStartPos = instr(1, sTable, "Words" )
if lStartPos = 0 then
FETCHHOUSE = "no Words on page"
exit function
end if
lEndPos = instr(lStartPos, sTable, "</tr>")
sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)
oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
oOptions.searchString = "<td[^<]*>"
oTextSearch.setOptions(oOptions)
oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
If oFound.subRegExpressions = 0 then
FETCHHOUSE = "Words header but no Words content on page"
exit function
end if
lStartPos = oFound.endOffset(0) + 1
lEndPos = instr(lStartPos, sRow, "</td>")
sWords = mid(sRow, lStartPos, lEndPos-lStartPos)
FETCHHOUSE = sWords
exit function
falseHouseName:
FETCHHOUSE = "House name does not exist"
End Function
The better way would be, if you could get the needed informations from a Web API that would offered from the Wiki. You know the people behind the Wiki? If so, then you could place this there as a suggestion.
Greetings
Axel

CDO.Message w/ multiple address#school.edu.au won't send

So here is a fun one. Messages won't be sent for certain situations. Here's an instance that I found. Note that it doesn't produce an error.
Set objMail = server.CreateObject("CDO.Message")
Set obj_conf = server.CreateObject("CDO.Configuration")
Set obj_fields = obj_conf.Fields
obj_fields("http://schemas.microsoft.com/cdo/configuration/sendusing") = 2
obj_fields("http://schemas.microsoft.com/cdo/configuration/smtpserver") = "smtp.school.edu.au"
obj_fields("http://schemas.microsoft.com/cdo/configuration/smtpauthenticate") = "0"
obj_fields("http://schemas.microsoft.com/cdo/configuration/smtserverpport") = "25"
obj_fields.Update
Set objMail.Configuration = obj_conf
objMail.From = """Person 1"" <Person1#school.edu.au>;"
objMail.ReplyTo = """Person 1"" <Person1#school.edu.au>;"
objMail.Subject = "school COMM TEST"
objMail.TextBody = "Comm testing"
objMail.To = """Person 2"" <Person2#school.edu.au>;"
objMail.cc = """My Name"" <myEmail#soemthing.com>;"
objMail.bcc = """Person 2"" <Person2#school.edu.au>;"
objMail.AddAttachment "..\Attachment.htm"
objMail.send
Set objMail = Nothing
Set obj_conf = Nothing
Set obj_fields = Nothing
If I were to not add a ReplyTo or Attachment then it sends fine. There may be a few other combinations to get it to work. But why won't the current settings work, and why will it work without an attachment or without a ReplyTo? Thanks for any input!

After looking at their attachment I noticed they had a line with: <br><br>
I replaced them, along with the other tags with: <br /> and it now works. Very interesting, strange and convoluted issue. But glad its over!

Matlab text string/html parse

I am trying to get information from a website (html) into MATLAB. I am able to get the html from online into a string using:
urlread('http://www.websiteNameHere.com...');
Once I have the string I have a very LONG string variable, containing the entire html file contents. From this variable, I am looking for the value/characters in very specific classes. For example, the html/website will have a bunch of lines, and then will have the classes of interest in the following form:
...
<h4 class="price">
<span class="priceSort">$39,991</span>
</h4>
<div class="mileage">
<span class="milesSort">19,570 mi.</span>
</div>
...
<h4 class="price">
<span class="priceSort">$49,999</span>
</h4>
<div class="mileage">
<span class="milesSort">9,000 mi.</span>
</div>
...
I need to be able to get the information between <span class="priceSort"> and </span>; ie $39,991 and $49,999 in the above example. What is the best way to go about this? If the tags were specific beginning and ends that were also the same (such as <price> and </price>), I would have no problem...
I also need to know the most robust method, since I would like to be able to find <span class="milesSort"> and other information of this sort too. Thanks!

Try this and let us know if it works for you -
url_data = urlread('http://www.websiteNameHere.com...');
start_string = '<span class="priceSort">'; %// For your next case, edit this to <span class="milesSort">
stop_string = '</span>';
N1 = numel(start_string);
N2 = numel(stop_string);
start_string_ind = strfind(url_data,start_string);
for count1 = 1:numel(start_string_ind)
relative_stop_string_ind = strfind(url_data(start_string_ind(count1)+N1:end),stop_string);
string_found_start_ind = start_string_ind(count1)+N1;
string_found = url_data(string_found_start_ind:string_found_start_ind+relative_stop_string_ind(1)-2);
disp(string_found);
end

Simple solution using strsplit
s = urlread('http://www.websiteNameHere.com...');
x = 'class="priceSort">'; %starting string x
y = 'class="milesSort">'; %starting string y
z = '</span>'; %ending string z
s2 = strsplit(s,x); %split for starting string x
s3 = strsplit(s,y); %split for starting string y
result1 = cell(size(s2,2)-1,1); %create cell array 1
result2 = cell(size(s3,2)-1,1); %create cell array 2
%loop through values ignoring first value
%(change ind=2:size(s2,2) to ind=1:size(s2,2) to see why)
%starting string x loop
for ind=2:size(s2,2)
m = strsplit(s2{1,ind},z);
result1{ind-1} = m{1,1};
end
%starting string y loop
for ind=2:size(s3,2)
m = strsplit(s3{1,ind},z);
result2{ind-1} = m{1,1};
end
Hope this helps

Retrieve attributes and span using HTMLAgilityPack library

In this piece of HTML code:
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
<img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
Lower Demos
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
Rock
Pop
</div>
</div>
I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:
Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year : 2013
Genres: Rock, Pop
URL : http://www.mp3crank.com/wolf-eyes/lower-demos-121866
Which are these html lines:
Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year : <span>2013</span>
Genre1: Rock
Genre2: Pop
URL : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"
This is what I'm trying, but I always get an object reference not set exception when trying to select a single node,
Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?
Public Class Form1
Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing
Private Title As String = String.Empty
Private Cover As String = String.Empty
Private Genres As String() = {String.Empty}
Private Year As Integer = -0
Private URL as String = String.Empty
Private Sub Test() Handles MyBase.Shown
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop trough the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode("//div[#class='release']").Attributes("title").Value
Cover = node.SelectSingleNode("//div[#class='thumb']").Attributes("src").Value
Year = CInt(node.SelectSingleNode("//div[#class='release-year']").Attributes("span").Value)
Genres = ¿select multiple nodes?
URL = node.SelectSingleNode("//div[#class='release']").Attributes("href").Value
Next
End Sub
End Class

Your mistake here it to try to access an attribute of a childnode from the one you've found.
When you call node.SelectSingleNode("//div[#class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.
It's possible to write XPATH queries that select the sub-node, e.g. //div[#class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.
Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Dim releaseNode = node.SelectSingleNode(".//div[#class='release']")
'Assumes we find the node and it has a a-tag
Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
Dim thumbNode = node.SelectSingleNode(".//div[#class='thumb']")
Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
Dim releaseYearNode = node.SelectSingleNode(".//div[#class='release-year']")
Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
Dim genreNode = node.SelectSingleNode(".//div[#class='genre']")
Dim genreLinks = genreNode.SelectNodes(".//a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Next
Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....
Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.
If you want a shorter XPATH solution, here is the same code using that approach:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode(".//div[#class='release']/h4/a[#title]").Attributes("title").Value
URL = node.SelectSingleNode(".//div[#class='release']/h4/a[#href]").Attributes("href").Value
Cover = node.SelectSingleNode(".//div[#class='thumb']/a/img[#src]").Attributes("src").Value
Year = CInt(node.SelectSingleNode(".//div[#class='release-year']/span").InnerText)
Dim genreLinks = node.SelectNodes(".//div[#class='genre']/a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Console.WriteLine()
Next

You were not that far from the solution. Two important notes:
// is a recursive call. It can have some heavy performance impact, and also it may select nodes you don't want, so I suggest you only use it when the hierarchy is deep or complex or variable, and you don't want to specify the whole path.
There is a useful helper method on XmlNode named GetAttributeValue which will you get an attribute even if it does not exist (you need to specify the default value).
Here is a sample that seems to work:
' select the base/parent DIV (here we use a discriminant CLASS attribute)
' all select calls below will use this DIV element as a starting point
Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[#class='item']")
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Title :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("title", CStr(Nothing))))
' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
Console.WriteLine(("Cover :" & node.SelectSingleNode("div[#class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))
' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
Console.WriteLine(("Year :" & node.SelectSingleNode("div[#class='release-year']//span").InnerText))
' get all A elements which are child or grand child(//) of a 'genre' DIV
Dim nodes As HtmlNodeCollection = node.SelectNodes("div[#class='genre']//a")
Dim i As Integer
For i = 0 To nodes.Count - 1
Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
Next i
' get to the A tag which is a child or grand child (//) of a 'release' DIV
Console.WriteLine(("Url :" & node.SelectSingleNode("div[#class='release']//a").GetAttributeValue("href", CStr(Nothing))))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

HTMLAgilityPack getting <P> and <STRONG> text - html

If the whitespace is actually there, it counts as a child node. So: Dim channelSpan = .SelectSingleNode(".//span[#class='channel']") info!Channel = channelSpan.ChildNodes(3).ChildNodes(0).InnerText info!Station = channelSpan.ChildNodes(3).ChildNodes(2).InnerText

Related

VBA MS Word content controls messed order

Reading HTML page using Libreoffice Basic

CDO.Message w/ multiple address#school.edu.au won't send

Matlab text string/html parse

Retrieve attributes and span using HTMLAgilityPack library

Categories

Resources