Get Text from<p> using HttpAgilityPack - html

I am using this code to return text from html "P" tag
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
string query = doc.DocumentNode.SelectSingleNode("//p/text()").InnerText;
if (query.Length >0)
{
query = query.Substring(0, 60) + "...";
}
>
Here the problem is if the "P" tag contains another tag this will not return text. for Ex.
<p><img src="http://localhost:49171/Images/MyImages/80ef7d03-6a8b-49e2-a4da-fa9f5f1773dd.jpg" alt="" />Thank you for choosing Microsoft Windows 8.1 Pro. This is a license agreement between you and Microsoft Corporation (or, based on where you live, one of its affiliates) </p>
in my code, query returns "Images/MyImages/80ef7d03-6a8b-49e2-a4da-fa9f5f1773dd.jpg",
Anybody please help me to retrive these lines "Thank you for choosing Microsoft Windows 8.1 Pro."
instead of "Images/MyImages/80ef7d03-6a8b-49e2-a4da-fa9f5f1773dd.jpg".
Thanks in Advance...

Since every text node in HtmlAgilityPack has the name #text you can do the following:
string query = doc.DocumentNode.Descendants("p")
.First()
.ChildNodes.First(node => node.Name == "#text").InnerText;
This takes the first <p> node in the document and selects the inner text of the first text node that is a direct child of the <p> node.

Related

Libgdx: How to show HTML text in a label?

I have a string like this:
"noun<br> an expression of greeting <br>- every morning they exchanged polite hellos<br> <font color=dodgerblue> ••</font> Syn: hullo, hi, howdy, how-do-you-do<be>"
want to show it in a label as a rich text. for example Instead of <br> tags, text must go to the next line.
in Android we can do that with:
Html.fromHtml(myHtmlString)
but I don't know how to do it in libgdx.
I try to use Jsoup but it removes all tags and does not go to the next line for <br> tag for example.
Jsoup.parse(myHtmlString).text()
Jsoup.parse returns a document containing many elements -of- strings. Not a single string so you are only seeing the first bit. You can assemble the complete string yourself by going through the elements or try
Document doc = Jsoup.parse(yourHtmlInput);
String htmlString = doc.toString();
String htmlText = "<p>This is an <strong>Example</strong></p>";
//this will convert your HTML text into normal text
String normalText = Jsoup.parse(htmlText).text();
in kotlin i use this code:
var definition = "my html string"
definition = definition.replace("<br>", "\n")
definition = definition.replace("<[^>]*>".toRegex(), "")

How to use the text between HTML tags to access an element - Selenium WebDriver

I have following HTML code.
<span class="ng-binding" ng-bind="::result.display">All Sector ETFs</span>
<span class="ng-binding" ng-bind="::result.display">China Macro Assets</span>
<span class="ng-binding" ng-bind="::result.display">Consumer Discretionary (XLY)</span>
<span class="ng-binding" ng-bind="::result.display">Consumer Staples (XLP)</span>
As it can be seen that tags are all the same for every line except the text between the tags.
How can I access each of the above line separately based on the text between tags.
use the below as xpath
//span[text()='All Sector ETFs']
You can use x-path function text() for that.
For example
//span[text()="All Sector ETFs"]
to find first span
You can use following xPath to find desired element based on text
String text = 'Your text';
//text may be ==>All Sector ETFs, China Macro Assets, Consumer Discretionary (XLY), Consumer Staples (XLP)
String xPath = "//*[contains(text(),'"+text+"')]";
By this you can find each elements..
Hope it will help you..:)
Hi please do it like below
Way One
public static void main(String[] args) {
WebDriver driver = new FirefoxDriver();
List<WebElement> mySpanTags = driver.findElements(By.xpath("ur xpath"));
System.out.println("Count the number of total tags : " + mySpanTags.size());
// print the value of the tags one by one
// or do whatever you want to do with a specific tag
for(int i=0;i<mySpanTags.size();i++){
System.out.println("Value in the tag is : " + mySpanTags.get(i).getText());
// either perform next operation inside this for loop
if(mySpanTags.get(i).getText().equals("Consumer Staples (XLP)")){
// perform your operation here
mySpanTags.get(i).click(); // clicks on the span tag
}
}
// or perform next operations on span tag here outside the for loop
// in this case use index for a specific tag (e.g below)
mySpanTags.get(3).click(); // clicks on the 4 th span tag
}
Way Two
find the tag directly //span[text()='Consumer Staples (XLP)']

How to retrieve specific data from html using XPath?

Hey guys I am having a hard time trying to get the stock price from a site using XPath.
the html is this:
<span class=" price">
<meta content="14.400" itemprop="price">
14.400
<span itemprop="priceCurrency"> BRL</span>
</span>
The path I used to retrieve the 14.400 value (all of them getting me null), were:
#"//span[#class=' price']";
#"/span[#class=' price']";
#"span[#class=' price']";
#"//meta[#itemprop='price'"];
#"/html/body/div[2]/div/div/div/div[2]/span/meta";
#"//html/body/div[2]/div/div/div/div[2]/span/meta";
After trying a lot more the closest I could get to what I need was using this xPath:
#"//span[#class=' price']/meta";
to get this log:
2014-02-07 13:50:39.616 manejoderisco[2838:60b] {
nodeAttributeArray = (
{
attributeName = itemprop;
nodeContent = price;
},
{
attributeName = content;
nodeContent = "14.280";
}
);
nodeName = meta;
}
But still returning me null value...
I finally managed to create the correct xPath which is this one:
#"//span/meta/#content
The HTML you are trying to parse isn't well formed, since there is no closing tag for meta.
However, if you are indeed able to catch the meta tag, you may want to select the content:
//span[#class=' price']/meta/#content
Or, if you need the first text field,
//span[#class=' price']//text()[1]
might work as well.
Don't forget that when you do //span/meta you are selecting the meta node, so <meta content="14.400" itemprop="price">14.400 (ending wherever, depending on what is using your xpath, since the HTML is malformed). If you want the content, you need to select either #content attribute or the text field with text().

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?
Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match
Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!
I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.

HTML Agility Pack - Get Page Summary

How would I use the HTML Agility Pack to get the First Paragraph of text from the body of an HTML file. I'm building a DIGG style link submission tool, and want to get the title and the first paragraph of text. Title is easy, any suggestions for how I might get the first paragraph of text from the body? I guess it could be within P or DIV depending on the page.
Is this html that you control? If so, you could give the p an id or a class and find it via
//p[#id=\"YOUR ID\"] or //p[#class=\"YOUR CLASS\"]
EDIT:
Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.
String summary = FindSummary(page.DocumentNode);
private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
if (childNode.InnerText.Length >= THRESHOLD) {
return childNode.InnerText;
}
}
String summary = FindSummary(childNode);
if (summary.Length >= THRESHOLD) {
return summary;
}
}
return String.Empty;
}
The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");