How to find the html tag node position with Html Agility Pack - html

I am trying to find the start/end positions of different Html tags inside my Html string by using Html Agility Pack.
Sample html string:
This is a custom made html string that will serve as an example for the StackOverflow question described above.
After successfully running the code I need to get 2 arrays with values from the start index of the a tags as follows:
int[] startIndex = new int[] { 11, 124 };
int[] endIndex = new int[] { 68, 176 };
Where 11 and 125 are the index positions that mark the begining of the a tag and 68 and 175 represents the last index position of the same tag.
I know that using the html agility pack HtmlNode I can get the LinePosition value that will give me the start index and along with the innerHtml.Lenght of the element I can calculate the end index position of the html element.
I was able to count the a elements by using:
int aNodesCount = htmlDoc.DocumentNode.SelectNodes("//a").Count;
And now I need to itereate through all of them and get the LinePosition values of each one. This is where I find myself stuck.

Well, that was pretty simple so I will post an answer for myself of others getting the same problem:
foreach (HtmlNode aNode in htmlDoc.DocumentNode.SelectNodes("//a"))
{
startIndex.Add(aNode.LinePosition);
endIndex.Add(aNode.LinePosition + aNode.OuterHtml.Length);
}

Related

Regex only captures the last occurrence of a match in html format

I've been learning regex and for that I have been working on Hackerrank problems. I came across a problem where I am asked to remove html format and only keep whatever is inside an anchor tag's reference (the value of the href part), and the text inside the tag, then present this separated by a comma.
I came up with the following code to extract such information:
public static void main(String[] args) {
Scanner s = new Scanner(System.in);
int n = s.nextInt();
s.nextLine();
for (int i = 0; i < n; i++) {
String line = s.nextLine();
Pattern p = Pattern.compile("(.*)<a href=\"([^\"]+)\"([^<>]*)>(<\\w+>)*([^<>]+)</a>(</\\w+>)*");
Matcher m = p.matcher(line);
while (m.find()) {
System.out.println(m.group(2).trim() + "," + m.group(5).trim());
}
}
}
This code, when presented with cases such as <p>text</p> Passes and outputs folder/page,text
But if the input has multiple <a> tags, it will only grab the last occurrence of it and output that, instead of outputting all possible matches for that single input. Why is this happening? Please don't feel obliged to answer my question fully if you think I can answer it myself with just a hint. Thank you for any answers in advance

Loop Through HTML Elements and Nodes

I'm working on an HTML page highlighter project but ran into problems when a search term is a name of an HTML tag metadata or a class/ID name; eg if search terms are "media OR class OR content" then my find and replace would do this:
<link href="/css/DocHighlighter.css" <span style='background-color:yellow;font-weight:bold;'>media</span>="all" rel="stylesheet" type="text/css">
<div <span style='background-color:yellow;font-weight:bold;'>class</span>="container">
I'm using Lucene for highlighting and my current code (sort of):
InputStreamReader xmlReader = new INputStreamReader(xmlConn.getInputStream(), "UTF-8");
if (searchTerms!=null && searchTerms!="") {
QueryScorer qryScore = new QueryScorer(qp.parse(searchTerms));
Highlighter hl = new Highlighter(new SimpleHTMLFormatter(hlStart, hlEnd), qryScore);
}
if (xmlReader!=null) {
BufferedReader br = new BufferedReader(xmlReader);
String inputLine;
while((inputLine = br.readLine())!=null) {
String tmp = inputLine.trim();
StringReader strReader = new stringReader(tmp);
HTMLStripCharFilter htm = HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader));
String tHL = hl.getBestFragment(analyzer, "", htm);
tmp = (tHL==null ? tmp : tHL);
}
xmlDoc+=tmp;
}
bufferedReader.close()
As you can see (if you understand Lucene highlighting) this does an indiscriminate find/replace. Since my document will be HTML and the search terms are dictated by users there is no way for me to parse on certain elements or tags. Also, since the find/replace basically loops and appends the HTML to a string (the return type of the method) I have to keep all HTML tags and values in place and order. I've tried using Jsoup to loop through the page but handles the HTML tag as one big result. I also tried tag soup to remove the broken HTML caused by the problem but it doesn't work correctly. Does anyone know how to basically loop though the elements and node (data value) of html?
I've been having the most luck with this
StringBuilder sb = new StringBuilder();
sb.append("<?xml version=\"1.0\" enconding=\"UTF-8\"?><!DOCTYPE html>");
Document doc = Jsoup.parse(txt.getResult());
Element elements = doc.getAllElements();
for (Element e : elements) {
if (!(e.tagName().equalsIgnoreCase("#root"))) {
sb.append("<" + e.tagName() + e.attributes() + ">" + e.ownText() + "\n");
}// end if
}// end for
return sb;
The one snag I still get is the nesting isn't always "repaired" properly but still semi close. I'm working more on this.

split data(Text) of a text field into two strings, and use each string on different section of a visualforce page

I have an account with a Note in notesandattachment related list.
I want to display the text of 8 lines, out of 50 lines on a section of visualforce page.
And remaining 42 lines in another section of same visualforce page.
what i know :
I need to split the notes.body into two substrings. One, with 8 lines of text and second one with 42 line.I need to use div tags to get this done on the visualforce page. I might be wrong.
Please suggest me, how do i achieve this.
Appreciate the help.
You can split the text in controller by using split() method like this
String allLines[] = allText.split('\n');
String first8Lines = '';
String restLines ='';
for(Integer i =0; i<8 ; i++){
first8Lines += allLines[i];
}
for(Integer i =7; i<50 ; i++){
restLines += allLines[i];
}
And then put {!first8Lines} in one place and {!restLines} in another

Re-stacking MovieClips in an Array

I was trying to make a similar thing with the game SameGame (ie. the block above the removed blocks fall downward). Before trying this with an Array that contains MovieClips, this code worked (tried it with int values). With MovieClips on the array, it seems not working the same way.
With int values, example:
popUp(0, 4): Before: 1,2,3,4,5,6,7,8,9,10; After: 1,2,3,4,6,7,8,9,10
But with MovieClips:
popUp(0, 4): Before: 1,2,3,4,5,6,7,8,9,10; After; 1,2,3,4
// Assume the numbers are movieclips XD
Basically, it strips everything else, rather than just the said block >_<
Here's the whole method. Basically, two extra arrays juggle the values above the soon-to-be removed value, remove the value, then re-stack it to the original array.
What could be wrong with this? And am I doing the right thing for what I really wanted to emulate?
function popUp(col:uint, row:uint)
{
var tempStack:Array = new Array();
var extraStack:Array = new Array();
tempStack = IndexArray[col];
removeChild(tempStack[0]);
for(var ctr:uint = tempStack.length-(row+1); ctr > 0; ctr--)
{
removeChild(tempStack[ctr]);
extraStack.push(tempStack.pop());
trace(extraStack);
}
tempStack.pop();
for(ctr = extraStack.length; ctr > 0; ctr--)
{
tempStack.push(extraStack.pop());
//addChild(tempStack[ctr]);
}
IndexArray[col] = tempStack;
}
PS: If it's not too much to ask, are there free step-by-step guides on making a SameGame in AS3 (I fear I might not be doing things right)? Thanks in advance =)
I think you just want to remove an element and have everything after that index shift down a place to fill what you removed. There's an inbuilt function for this called splice(start:uint, length:uint);
Parameters:
start - the index to start removing elements from
length - the amount of elements to remove
var ar:Array = ["hello","there","sir"];
ar.splice(1, 1);
ar is now -> ["hello", "sir"];
As per question:
Here's an example with different types of elements:
var ar:Array = [new MovieClip(), "some string", new Sprite(), 8];
ar.splice(2, 1);
trace(ar); // [object MovieClip], some string, 8
And further example to display the indexes being changed:
trace(ar[2]); // was [object Sprite], is now 8

HTML Agility Pack - Get Page Summary

How would I use the HTML Agility Pack to get the First Paragraph of text from the body of an HTML file. I'm building a DIGG style link submission tool, and want to get the title and the first paragraph of text. Title is easy, any suggestions for how I might get the first paragraph of text from the body? I guess it could be within P or DIV depending on the page.
Is this html that you control? If so, you could give the p an id or a class and find it via
//p[#id=\"YOUR ID\"] or //p[#class=\"YOUR CLASS\"]
EDIT:
Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.
String summary = FindSummary(page.DocumentNode);
private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
if (childNode.InnerText.Length >= THRESHOLD) {
return childNode.InnerText;
}
}
String summary = FindSummary(childNode);
if (summary.Length >= THRESHOLD) {
return summary;
}
}
return String.Empty;
}
The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");