vb net + getting content from a div with htmlagilitypack - json

Flow:
1. (OK) i download a json
2. (OK) i parse a value from the json object that contains html
3. (NOT OK) i display the values inside div.countries
my code:
Dim webClient As New System.Net.WebClient
Dim result As String = webClient.DownloadString("http://example.com/countries.json")
Dim values As JObject = JObject.Parse(result)
Dim finalHTML As String = values.GetValue("countries_html")
basically finalHTML variable looks like this:
<div class="country_name">USA</div>
<div class="country_name">Ireland</div>
<div class="country_name">Australia</div>
Im stuck and dont know how to move on.
I need to go over all div.country_name and get the inner_text of it. Hope that make sense.

Since the finalHTML string already contain only the target div elements, you can simply load the string to HtmlDocument object and use a bit of LINQ to project the divs into collection -IEnumerable, List<T>, or whatever most suitable to your need- of InnerText strings :
....
Dim finalHTML As String = values.GetValue("countries_html")
Dim doc = New HtmlDocument()
doc.LoadHtml(finalHTML)
Dim countries = doc.DocumentNode.Elements("div").Select(Function(o) o.InnerText.Trim())
'print the result as comma separated text to console:
Console.WriteLine(String.Join(",", countries))
Dotnetfiddle Demo
output :
USA,Ireland,Australia

here's a nice article on using the HAP: http://www.mikesdotnetting.com/article/273/using-the-htmlagilitypack-to-parse-html-in-asp-net.

Related

VB.net extract variable value inside <span> tags?

I'm trying to extract a decimal value that may vary from inside HTML using VB.net.
As sort of a test, here is the code I'm using:
Dim result As String = "<td class='fl'><label>Balance:</label></td><td nowrap class='fd'><span>$999,999.99</span></td></tr></table></td>"
Dim RegexResult = Regex.Match(result, "^(\$|)([1-9]\d{0,2}(\,\d{3})*|([1-9]\d*))(\.\d{2})?$")
Console.WriteLine(RegexResult)
FYI, I found that expression here:
In this example, the extracted result should be: $999999.99. This will then be modified to strip the dollar sign.
Regex result, when viewed in the Visual Studio console is {}. How do I modify the expression to account for the <span> tags?
Even if your regex would work now, don't use regex to parse dynamic HTML content.
Use an available HTML parser like HtmlAgilityPack, that's a much more reliable solution:
Dim html = "<td class='fl'><label>Balance:</label></td><td nowrap class='fd'><span>$999,999.99</span></td></tr></table></td>"
Dim doc As New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim td = doc.DocumentNode.SelectSingleNode("//*[contains(#class,'fd')]")
Dim spanText = td.Descendants("span").First().InnerText
Dim balance As Decimal
Dim usCulture = New CultureInfo("en-us")
Dim valid = Decimal.TryParse(spanText, NumberStyles.Currency, usCulture, balance)

vb.net from string to listbox line by line

i made an webrequestto get an htmlcode of an website and then i extract the
the wanted links with htmlagilitypack
like this :
'webrequest'
Dim rt As String = TextBox1.Text
Dim wRequest As WebRequest
Dim WResponse As WebResponse
Dim SR As StreamReader
wRequest = FtpWebRequest.Create(rt)
WResponse = wRequest.GetResponse
SR = New StreamReader(WResponse.GetResponseStream)
rt = SR.ReadToEnd
TextBox2.Text = rt
'htmlagility to extract the links'
Dim htmlDoc1 As New HtmlDocument()
htmlDoc1.LoadHtml(rt)
Dim links = htmlDoc1.DocumentNode.SelectNodes("//*[#id='catlist-listview']/ul/li/a")
Dim hrefs = links.Cast(Of HtmlNode).Select(Function(x) x.GetAttributeValue("href", ""))
'join the `hrefs`, separated by newline, into one string'
textbox3.text = String.Join(Environment.NewLine, hrefs)
the links are like this :
http://wantedlink1
http://wantedlink2
http://wantedlink3
http://wantedlink4
http://wantedlink5
http://wantedlink6
http://wantedlink7
Now i want to add every line in the string to listbox instead of textbox
one item for each line
THERE IS ABOUT 400 http://wantedlink
hrefs in your case already contained IEnumerable(Of String). Joining them into one string and then split it again to make it work is weird. Since String.Split() returns array, maybe you only need to project hrefs into array to make .AddRange() to work :
ListBox1.Items.AddRange(hrefs.ToArray())
Use the AddRange method of the listbox's items collection and pass it the lines array of the textbox.
AddRange
Lines
Hint: It's one line of code.
its ok i find the answer
Dim linklist = String.Join(Environment.NewLine, hrefs)
Dim parts As String() = linklist.Split(New String() {Environment.NewLine},
StringSplitOptions.None)
ListBox1.Items.AddRange(parts)
this add all the 400 links to the listbox

VB.net Fill Textbox with HTML string

I have a string from an HTML enabled email of something like so:
</div><span color="red">Hi how are you?!</span></div><table><tr><td>Company Information</td></tr></table>
and so on [its a long string of stuff but you get the idea. There are <div>..<spans>..<table> and so forth.
I want to display the text in the text box formatted like html [which will format it based on the <table>..<span> and so forth while removing the actual text of <span> and so forth from the textbox's text.
I need this to happen because I get a page error because it reads the <span> and etc as being a security issue.
My current way of reading the whole text and removing the issues are like so:
If Not DsAds.Tables(0).Rows(0).Item(0) Is DBNull.Value Then
Dim bodyInitial As String = DsAds.Tables(0).Rows(0).Item(0).ToString()
Dim newbody As String = bodyInitial.Replace("<br>", vbNewLine)
newbody = newbody.Replace("<b>", "")
newbody = newbody.Replace("</b>", "")
Bodylisting.Text = newbody
End If
I tried to encorporate the following:
Dim bodyInitial As String = DsAds.Tables(0).Rows(0).Item(0).ToString()
Dim myEncodedString As String
' Encode the string.
myEncodedString = bodyInitial.HttpUtility.HtmlEncode(bodyInitial)
Dim myWriter As New StringWriter()
' Decode the encoded string.
HttpUtility.HtmlDecode(bodyInitial, myWriter)
but I get errors with HTTpUtility and strings
Question:
So my question is, is there a way to actually see the HTML formatting and fill the textbox that way, or do I have to stick with my .Replace method?
<asp:TextBox ID="Bodylisting" runat="server" style="float:left; margin:10px; padding-bottom:500px;" Width="95%" TextMode="MultiLine" ></asp:TextBox>
I suggest you investigate HtmlAgilityPack. This library includes a parser giving you the ability to 'select' the <span> tags. Once you have them, you can strip them out, or grab the InnerHtml and do further processing. This is an example of how I do something similar with it.
Private Shared Function StripHtml(html As String, xPath As String) As String
Dim htmlDoc As New HtmlDocument()
htmlDoc.LoadHtml(html)
If xPath.Length > 0 Then
Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
If Not invalidNodes Is Nothing Then
For Each node As HtmlNode In invalidNodes
node.ParentNode.RemoveChild(node, True)
Next
End If
End If
Return htmlDoc.DocumentNode.WriteContentTo()
End Function

Html Agility Pack return Null for a Nodecollection

I am trying to parse a PHP page which contain many tables. Now when I try to select those tables, the collection return null?
Dim web As New HtmlAgilityPack.HtmlWeb()
Dim htmlDoc As HtmlAgilityPack.HtmlDocument = web.Load("URL")
Dim html As String
Dim tabletag As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//table")
Dim tableNode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//table[#summary='List of services']")
If Not tabletag Is Nothing Then
html = tableNode.InnerText
End If
This page certainly has some tables, Just dont understand why it returns null for it?
I loaded your URL up in a browser and viewed the source and there actually aren't any tables that I could find. Wrong URL?

VB.NET ~ how does one navigate to a website and download the html then parse out code to only display input elements?

I have tried a few things like converting HTML to XML and then using an XML navigator to get input elements but I get lost whenever I start this process.
What I am trying to do is to navigate to a website which will be loaded using textbox1.text
Then download the html and parse out the input elements like . username, password, etc and place the element by type (id or name) into the richtextbox with the attribute beside the name.
Example.
Username id="username"
Password id="password"
Any clues or how to properly execute an HTML to XML conveter, reader, parser?
Thanks
It sounds like you just need a good HTML parsing library (instead of trying to use an XML parser). The HTML Agility Pack often fits this need. There are other options as well.
Somthing like below uses a streamreader to extract the source of the page into a string result
Dim uri As String = "https://www.yourUrl.com"
Dim request As HttpWebRequest = CType(WebRequest.Create(uri), HttpWebRequest)
Dim objRequest As HttpWebRequest = WebRequest.Create(uri)
Dim result As String
objRequest.Method = "GET"
Dim objResponse As HttpWebResponse = objRequest.GetResponse()
Dim sr As StreamReader
sr = New StreamReader(objResponse.GetResponseStream())
result = sr.ReadToEnd()
sr.Close
Then use regular expression (regex) to extra the attributes needed. for example something like this
Dim pattern As String = "(?<=Username id="")\w+"
Dim m0 As MatchCollection = Regex.Matches(result, pattern, RegexOptions.Singleline)
Dim m As Match
Dim k As Integer = 0
dim strUserID as String = ""
For Each m In m0
'extract the values for username id
strUserID = m0[k].Value;
k=k+1
Next
You'll need to change the pattern so it can pick up the other attributes you want to find, but this shouldn't be difficult