Counting inner text letters of HTML element - html

Is there a way to count the letters of inner text of an HTML element, without counting the letters of inner element's texts?
I tried out the ".getText()" method of "WebElements" using the Selenium library, but this counts the inner Texts of inner web elements in (e.G. "<body><div>test</div></body>" results in 4 letters for the "div" and the "body" element, instead of 0 for the "body" element)
Do I have to use an additional HTML parsing library, and when yes which one would you recommend?
I'm using Java 7...

Based on this answer for a similar question, I cooked you a solution:
The piece of JavaScript takes an element, iterates over all its child nodes and if they're text nodes, it reads them and returns them concatenated:
var element = arguments[0];
var text = '';
for (var i = 0; i < element.childNodes.length; i++)
if (element.childNodes[i].nodeType === Node.TEXT_NODE) {
text += element.childNodes[i].textContent;
}
return text;
I saved this script into a script.js file and loaded it into a single String via FileUtils.readFileToString(). You can use Guava's Files.toString(), too. Or just embed it into your Java code.
final String script = FileUtils.readFileToString(new File("script.js"), "UTF-8");
JavascriptExecutor js = (JavascriptExecutor)driver;
...
WebElement element = driver.findElement(By.anything("myElement"));
String text = (String)js.executeScript(script, element);

Related

Chrome extension: Add style to element if it contains particular text

I want to add styling to an element only if it contains a particular string. i.e. if(el contains str) {el:style}.
If I wanted just the links containing w3.org to be pink, how would I find <a href="http://www.w3.org/1999/xhtml">article < /a> inside the innerHTML and then style the word "article" on the page.
So far, I can turn ALL the links pink but I can't selectively target the ones containing "www.w3.org".
var links = [...document.body.getElementsByTagName("a")];
for (var i = 0; i < links.length; i++) {
links[i].style["color"] = "#FF00FF";
}
How would I apply this ONLY to the elements containing the string "w3.org"?
I thought this would be so simple at first! Any and all help is appreciated.
While you can't filter by non-exact href values when finding the initial list, and you can't filter by contained text then either, you can filter the list after the fact using plain javascript:
var links = [...document.body.getElementsByTagName("a")];
for (var i = 0; i < links.length; i++) {
if (links[i]['href'].indexOf('www.w3.org') == -1) { continue };
links[i].style["color"] = "#FF00FF";
}
Assuming you want to filter by the href, that is. If you mean the literal text, you would use links[i]['text'] instead.

Loop Through HTML Elements and Nodes

I'm working on an HTML page highlighter project but ran into problems when a search term is a name of an HTML tag metadata or a class/ID name; eg if search terms are "media OR class OR content" then my find and replace would do this:
<link href="/css/DocHighlighter.css" <span style='background-color:yellow;font-weight:bold;'>media</span>="all" rel="stylesheet" type="text/css">
<div <span style='background-color:yellow;font-weight:bold;'>class</span>="container">
I'm using Lucene for highlighting and my current code (sort of):
InputStreamReader xmlReader = new INputStreamReader(xmlConn.getInputStream(), "UTF-8");
if (searchTerms!=null && searchTerms!="") {
QueryScorer qryScore = new QueryScorer(qp.parse(searchTerms));
Highlighter hl = new Highlighter(new SimpleHTMLFormatter(hlStart, hlEnd), qryScore);
}
if (xmlReader!=null) {
BufferedReader br = new BufferedReader(xmlReader);
String inputLine;
while((inputLine = br.readLine())!=null) {
String tmp = inputLine.trim();
StringReader strReader = new stringReader(tmp);
HTMLStripCharFilter htm = HTMLStripCharFilter(strReader.markSupported() ? strReader : new BufferedReader(strReader));
String tHL = hl.getBestFragment(analyzer, "", htm);
tmp = (tHL==null ? tmp : tHL);
}
xmlDoc+=tmp;
}
bufferedReader.close()
As you can see (if you understand Lucene highlighting) this does an indiscriminate find/replace. Since my document will be HTML and the search terms are dictated by users there is no way for me to parse on certain elements or tags. Also, since the find/replace basically loops and appends the HTML to a string (the return type of the method) I have to keep all HTML tags and values in place and order. I've tried using Jsoup to loop through the page but handles the HTML tag as one big result. I also tried tag soup to remove the broken HTML caused by the problem but it doesn't work correctly. Does anyone know how to basically loop though the elements and node (data value) of html?
I've been having the most luck with this
StringBuilder sb = new StringBuilder();
sb.append("<?xml version=\"1.0\" enconding=\"UTF-8\"?><!DOCTYPE html>");
Document doc = Jsoup.parse(txt.getResult());
Element elements = doc.getAllElements();
for (Element e : elements) {
if (!(e.tagName().equalsIgnoreCase("#root"))) {
sb.append("<" + e.tagName() + e.attributes() + ">" + e.ownText() + "\n");
}// end if
}// end for
return sb;
The one snag I still get is the nesting isn't always "repaired" properly but still semi close. I'm working more on this.

highlight words in html using regex & javascript - almost there

I am writing a jquery plugin that will do a browser-style find-on-page search. I need to improve the search, but don't want to get into parsing the html quite yet.
At the moment my approach is to take an entire DOM element and all nested elements and simply run a regex find/replace for a given term. In the replace I will simply wrap a span around the matched term and use that span as my anchor to do highlighting, scrolling, etc. It is vital that no characters inside any html tags are matched.
This is as close as I have gotten:
(?<=^|>)([^><].*?)(?=<|$)
It does a very good job of capturing all characters that are not in an html tag, but I'm having trouble figuring out how to insert my search term.
Input: Any html element (this could be quite large, eg <body>)
Search Term: 1 or more characters
Replace Txt: <span class='highlight'>$1</span>
UPDATE
The following regex does what I want when I'm testing with http://gskinner.com/RegExr/...
Regex: (?<=^|>)(.*?)(SEARCH_STRING)(?=.*?<|$)
Replacement: $1<span class='highlight'>$2</span>
However I am having some trouble using it in my javascript. With the following code chrome is giving me the error "Invalid regular expression: /(?<=^|>)(.?)(Mary)(?=.?<|$)/: Invalid group".
var origText = $('#'+opt.targetElements).data('origText');
var regx = new RegExp("(?<=^|>)(.*?)(" + $this.val() + ")(?=.*?<|$)", 'gi');
$('#'+opt.targetElements).each(function() {
var text = origText.replace(regx, '$1<span class="' + opt.resultClass + '">$2</span>');
$(this).html(text);
});
It's breaking on the group (?<=^|>) - is this something clumsy or a difference in the Regex engines?
UPDATE
The reason this regex is breaking on that group is because Javascript does not support regex lookbehinds. For reference & possible solutions: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript.
Just use jQuerys built-in text() method. It will return all the characters in a selected DOM element.
For the DOM approach (docs for the Node interface): Run over all child nodes of an element. If the child is an element node, run recursively. If it's a text node, search in the text (node.data) and if you want to highlight/change something, shorten the text of the node until the found position, and insert a highligth-span with the matched text and another text node for the rest of the text.
Example code (adjusted, origin is here):
(function iterate_node(node) {
if (node.nodeType === 3) { // Node.TEXT_NODE
var text = node.data,
pos = text.search(/any regular expression/g), //indexOf also applicable
length = 5; // or whatever you found
if (pos > -1) {
node.data = text.substr(0, pos); // split into a part before...
var rest = document.createTextNode(text.substr(pos+length)); // a part after
var highlight = document.createElement("span"); // and a part between
highlight.className = "highlight";
highlight.appendChild(document.createTextNode(text.substr(pos, length)));
node.parentNode.insertBefore(rest, node.nextSibling); // insert after
node.parentNode.insertBefore(highlight, node.nextSibling);
iterate_node(rest); // maybe there are more matches
}
} else if (node.nodeType === 1) { // Node.ELEMENT_NODE
for (var i = 0; i < node.childNodes.length; i++) {
iterate_node(node.childNodes[i]); // run recursive on DOM
}
}
})(content); // any dom node
There's also highlight.js, which might be exactly what you want.

JSFL: convert text from a textfield to a HTML-format string

I've got a deceptively simple question: how can I get the text from a text field AND include the formatting? Going through the usual docs I found out it is possible to get the text only. It is also possible to get the text formatting, but this only works if the entire text field uses only one kind of formatting. I need the precise formatting so that I convert it to a string with html-tags.
Personally I need this so I can pass it to a custom-made text field component that uses HTML for formatting. But it could also be used to simply export the contents of any text field to any other format. This could be of interest to others out there, too.
Looking for a solution elsewhere I found this:
http://labs.thesedays.com/blog/2010/03/18/jsfl-rich-text/
Which seems to do the reverse of what I need, convert HTML to Flash Text. My own attempts to reverse this have not been successful thus far. Maybe someone else sees an easy way to reverse this that I’m missing? There might also be other solutions. One might be to get the EXACT data of the text field, which should include formatting tags of some kind(XML, when looking into the contents of the stored FLA file). Then remove/convert those tags. But I have no idea how to do this, if at all possible. Another option is to cycle through every character using start- and endIndex, and storing each formatting kind in an array. Then I could apply the formatting to each character. But this will result in excess tags. Especially for hyperlinks! So can anybody help me with this?
A bit late to the party but the following function takes a JSFL static text element as input and returns a HTML string (using the Flash-friendly <font> tag) based on the styles found it its TextRuns array. It's doing a bit of basic regex to clear up some tags and double spaces etc. and convert /r and /n to <br/> tags. It's probably not perfect but hopefully you can see what's going on easy enough to change or fix it.
function tfToHTML(p_tf)
{
var textRuns = p_tf.textRuns;
var html = "";
for ( var i=0; i<textRuns.length; i++ )
{
var textRun = textRuns[i];
var chars = textRun.characters;
chars = chars.replace(/\n/g,"<br/>");
chars = chars.replace(/\r/g,"<br/>");
chars = chars.replace(/ /g," ");
chars = chars.replace(/. <br\/>/g,".<br/>");
var attrs = textRun.textAttrs;
var font = attrs.face;
var size = attrs.size;
var bold = attrs.bold;
var italic = attrs.italic;
var colour = attrs.fillColor;
if ( bold )
{
chars = "<b>"+chars+"</b>";
}
if ( italic )
{
chars = "<i>"+chars+"</i>";
}
chars = "<font size=\""+size+"\" face=\""+font+"\" color=\""+colour+"\">"+chars+"</font>";
html += chars;
}
return html;
}

HTML Agility Pack - Get Page Summary

How would I use the HTML Agility Pack to get the First Paragraph of text from the body of an HTML file. I'm building a DIGG style link submission tool, and want to get the title and the first paragraph of text. Title is easy, any suggestions for how I might get the first paragraph of text from the body? I guess it could be within P or DIV depending on the page.
Is this html that you control? If so, you could give the p an id or a class and find it via
//p[#id=\"YOUR ID\"] or //p[#class=\"YOUR CLASS\"]
EDIT:
Since you don't control the html, maybe the below will work. It takes all the HtmlTextNodes and tries to find a grouping of text greater than the threshold specified. It's far from perfect but might get you going in the right direction.
String summary = FindSummary(page.DocumentNode);
private const int THRESHOLD = 50;
private String FindSummary(HtmlAgilityPack.HtmlNode node) {
foreach (HtmlAgilityPack.HtmlNode childNode in node.ChildNodes) {
if (childNode.GetType() == typeof(HtmlAgilityPack.HtmlTextNode)) {
if (childNode.InnerText.Length >= THRESHOLD) {
return childNode.InnerText;
}
}
String summary = FindSummary(childNode);
if (summary.Length >= THRESHOLD) {
return summary;
}
}
return String.Empty;
}
The agility pack uses xpath for querying the html load you just use a simple xpath statement. Something like...
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(content);
HtmlNodeCollection firstParagraph = htmldoc.DocumentNode.SelectNodes("//p[1]");