HTML Agility Pack - Get Page Summary

HTML Agility Pack - Get Page Summary - html

How would I use the HTML Agility Pack to get the First Paragraph of text from the body of an HTML file. I'm building a DIGG style link submission tool, and want to get the title and the first paragraph of text. Title is easy, any suggestions for how I might get the first paragraph of text from the body? I guess it could be within P or DIV depending on the page.

Is this html that you control? If so, you could give the p an id or a class and find it via
//p[#id=\"YOUR ID\"] or //p[#class=\"YOUR CLASS\"]
EDIT:
Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.
String summary = FindSummary(page.DocumentNode);
private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
if (childNode.InnerText.Length >= THRESHOLD) {
return childNode.InnerText;
}
}
String summary = FindSummary(childNode);
if (summary.Length >= THRESHOLD) {
return summary;
}
}
return String.Empty;
}

The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");

Related

How to extract the href value from links in HTML data based on then element's text?

I have been tasked with the coding of a web crawler that goes through several URLs (around 400, but the list could grow), each with a completely different html structure and extract the links containing certain information. The only thing the program knows beforehand is what are the keywords it should search for, but the html structure and any semantic cues as to where to look for those keywords is unknown.
So far, I have used the request-promise module for Node.js to send a request to the URL where the search for keywords will take place:
const htmlResult = await request.get(url);
htmlResult stores the response as a string, and I can save it both as an .txt or .html if needed.
The problem I have is that I don't know how to instruct the program how to extract a URL based on words that aren't necessarily present in the url string. An example might help clarify:
<a href="site/with/no/keywords-just-a-random-string" title="Keywords might be here, but title attribute might be absent"><span class="img"><img data-cfsrc="/thumbpdf/618a8nb4.jpg" alt="" style="display:none;visibility:hidden;"><noscript><img src="/thumbpdf/8bfa84.jpg" alt=""></noscript></span>
<h2>KEYWORDS ARE IN THIS TAG, WHICH IN TURN IS INSIDE THE <a> TAG</h2>
<span class="date--type">2 Nov 2021 </span>
<span class="tag">
oher stuff with no keywords in it</span>
</a>
As you can see, this tag has a complex structure. The keywords I need to parse are inside an h2 tag which, in turn, is inside the a tag. But he a tag could also be like this:
KEYWORDS TO PARSE
Here the keywords are simply within the a tag.
My question, thus, is how do I parse htmlResult (either as a string or saved as a .txt/.html file), and, once I get a match, instruct the program to extract the url that is in the bounds of the a tag wherein I go the match of keywords?
As I am using Node.js I open to using any tool available.
Could someone offer some advice on how to tackle this challenge?
Thanks so much in advance.

This is very quick and dirty, and I'm sure it can be further streamlined, but it should get you at least closer to where you need to be.
This assumes a bunch of <div> elements, each containing one of your your <a> elements, all in one document (see link below). It uses xpath to locate the data:
function xpathEval(xpath, context) {
return document.evaluate(xpath, context, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
}
desiredHrefs = []
let targets = xpathEval("//div[#class='container']", document);
for (let i = 0; i < targets.snapshotLength; i++) {
let attribs = xpathEval('.//*/#*', targets.snapshotItem(i)),
texts = xpathEval('.//*/text()', targets.snapshotItem(i));
for (let k = 0; k < attribs.snapshotLength; k++) {
attribData = attribs.snapshotItem(k).textContent
if (attribData.includes("trainer") & attribData.includes("dog")) {
//either
//console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
//ot
let href = xpathEval('.//a/#href', targets.snapshotItem(i));
desiredHrefs.push(href.snapshotItem(0).textContent)
}
}
for (let j = 0; j < texts.snapshotLength; j++) {
data = texts.snapshotItem(j).nodeValue.trim().toLowerCase()
if (data.includes("trainer") & data.includes("dog")) {
//either
//console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
//or
let href = xpathEval('.//a/#href', targets.snapshotItem(i));
desiredHrefs.push(href.snapshotItem(0).textContent)
}
}
}
for (let href of [...new Set(desiredHrefs)])
console.log(href)
You can see it in action here.

Regex only captures the last occurrence of a match in html format

I've been learning regex and for that I have been working on Hackerrank problems. I came across a problem where I am asked to remove html format and only keep whatever is inside an anchor tag's reference (the value of the href part), and the text inside the tag, then present this separated by a comma.
I came up with the following code to extract such information:
public static void main(String[] args) {
Scanner s = new Scanner(System.in);
int n = s.nextInt();
s.nextLine();
for (int i = 0; i < n; i++) {
String line = s.nextLine();
Pattern p = Pattern.compile("(.*)<a href=\"([^\"]+)\"([^<>]*)>(<\\w+>)*([^<>]+)</a>(</\\w+>)*");
Matcher m = p.matcher(line);
while (m.find()) {
System.out.println(m.group(2).trim() + "," + m.group(5).trim());
}
}
}
This code, when presented with cases such as <p>text</p> Passes and outputs folder/page,text
But if the input has multiple <a> tags, it will only grab the last occurrence of it and output that, instead of outputting all possible matches for that single input. Why is this happening? Please don't feel obliged to answer my question fully if you think I can answer it myself with just a hint. Thank you for any answers in advance

Loop Through HTML Elements and Nodes

I'm working on an HTML page highlighter project but ran into problems when a search term is a name of an HTML tag metadata or a class/ID name; eg if search terms are "media OR class OR content" then my find and replace would do this:
<link href="/css/DocHighlighter.css" <span style='background-color:yellow;font-weight:bold;'>media</span>="all" rel="stylesheet" type="text/css">
<div <span style='background-color:yellow;font-weight:bold;'>class</span>="container">
I'm using Lucene for highlighting and my current code (sort of):
InputStreamReader xmlReader = new INputStreamReader(xmlConn.getInputStream(), "UTF-8");
if (searchTerms!=null && searchTerms!="") {
QueryScorer qryScore = new QueryScorer(qp.parse(searchTerms));
Highlighter hl = new Highlighter(new SimpleHTMLFormatter(hlStart, hlEnd), qryScore);
}
if (xmlReader!=null) {
BufferedReader br = new BufferedReader(xmlReader);
String inputLine;
while((inputLine = br.readLine())!=null) {
String tmp = inputLine.trim();
StringReader strReader = new stringReader(tmp);
HTMLStripCharFilter htm = HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader));
String tHL = hl.getBestFragment(analyzer, "", htm);
tmp = (tHL==null ? tmp : tHL);
}
xmlDoc+=tmp;
}
bufferedReader.close()
As you can see (if you understand Lucene highlighting) this does an indiscriminate find/replace. Since my document will be HTML and the search terms are dictated by users there is no way for me to parse on certain elements or tags. Also, since the find/replace basically loops and appends the HTML to a string (the return type of the method) I have to keep all HTML tags and values in place and order. I've tried using Jsoup to loop through the page but handles the HTML tag as one big result. I also tried tag soup to remove the broken HTML caused by the problem but it doesn't work correctly. Does anyone know how to basically loop though the elements and node (data value) of html?

I've been having the most luck with this
StringBuilder sb = new StringBuilder();
sb.append("<?xml version=\"1.0\" enconding=\"UTF-8\"?><!DOCTYPE html>");
Document doc = Jsoup.parse(txt.getResult());
Element elements = doc.getAllElements();
for (Element e : elements) {
if (!(e.tagName().equalsIgnoreCase("#root"))) {
sb.append("<" + e.tagName() + e.attributes() + ">" + e.ownText() + "\n");
}// end if
}// end for
return sb;
The one snag I still get is the nesting isn't always "repaired" properly but still semi close. I'm working more on this.

Counting inner text letters of HTML element

Is there a way to count the letters of inner text of an HTML element, without counting the letters of inner element's texts?
I tried out the ".getText()" method of "WebElements" using the Selenium library, but this counts the inner Texts of inner web elements in (e.G. "<body><div>test</div></body>" results in 4 letters for the "div" and the "body" element, instead of 0 for the "body" element)
Do I have to use an additional HTML parsing library, and when yes which one would you recommend?
I'm using Java 7...

Based on this answer for a similar question, I cooked you a solution:
The piece of JavaScript takes an element, iterates over all its child nodes and if they're text nodes, it reads them and returns them concatenated:
var element = arguments[0];
var text = '';
for (var i = 0; i < element.childNodes.length; i++)
if (element.childNodes[i].nodeType === Node.TEXT_NODE) {
text += element.childNodes[i].textContent;
}
return text;
I saved this script into a script.js file and loaded it into a single String via FileUtils.readFileToString(). You can use Guava's Files.toString(), too. Or just embed it into your Java code.
final String script = FileUtils.readFileToString(new File("script.js"), "UTF-8");
JavascriptExecutor js = (JavascriptExecutor)driver;
...
WebElement element = driver.findElement(By.anything("myElement"));
String text = (String)js.executeScript(script, element);

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?

Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match

Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!

I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

HTML Agility Pack - Get Page Summary - html

The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like... HtmlDocument htmldoc = new HtmlDocument(); htmldoc.LoadHtml(content); HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");

Related

How to extract the href value from links in HTML data based on then element's text?

Regex only captures the last occurrence of a match in html format

Loop Through HTML Elements and Nodes

Counting inner text letters of HTML element

replace keyword within html string

Categories

Resources