How do I test for the percentage of bold text on a webpage? - html

I have an instance where I need to test how page content is styled (not necessarily only with CSS).
For example, a test (cucumber) I would like to write is:
In order to standardize text weight
As a webmaster
I want to be told the percentage of bold text on the page
The problem is, I'm having a hard time figuring out how to actually generate this result. Looking at various HTML testing frameworks (Selenium, Watir, Capybara), it seems like I can only test for the presence of tags or the presence of css classes, and not the calculated visual result.
In Firebug, I can see the calculated CSS result (which works for <strong>, <b>, and font-weight:bold definitions), but I need to be able to put this into a testing framework to run under CI.

In Watir, you can get access to an elements font-weight by directly accessing the win32ole object. For example:
ie.div(:index, 1).document.currentStyle.fontWeight
This will give you a numbers representing the weight as described in http://www.w3schools.com/cssref/pr_font_weight.asp
What I think you would then need to do is iterate through all elements on the page checking what its fontWeight is and how much text is in the element. The way you do that will depend on the page you are testing.
Solution 1 - If all text is in divs that are leaf nodes:
If all your text is in leaf nodes like this:
<body>
<div style='font-weight:bold'>Bold</div>
<div>Plain</div>
</body>
You could easily do:
bold_text = 0
plain_text = 0
ie.divs.each{ |x|
if x.document.currentStyle.fontWeight >= 700
bold_text += x.text.length
else
plain_text += x.text.length
end
}
Solution 2 - If styles interact or using multiple elements:
If not all of the text is in leaf nodes or you use other tags like <b> (see example HTML below), you would need a more complicated check. This is due to .text returning all text in the element, including its children elements.
<body>
<div style='font-weight:normal'>
Start
<div style='font-weight:bold'>Bold1</div>
<div style='font-weight:bold'>Bold2</div>
End
</div>
<b>Bold Text</b>
</body>
In this case, I believe the following works for most cases (but may need refinement):
#Counting letters, but you could easily change to words
bold_count = 0
plain_count = 0
#Check all elements, though you can change this to restrict to a particular containing element if desired.
node_list = ie.document.getElementsByTagName("*")
0.upto(node_list.length-1) do |i|
#Name the node so it is easier to work with.
node = node_list["#{i}"]
#Determine if the text for the current node is bold or not.
#Note that this works in IE. You might need to modify for other browsers.
if node.currentStyle.fontWeight >= 700
bold = true
else
bold = false
end
#Go through the childNodes. If the node is text, count it. Otherwise ignore.
node.childNodes.each do |child|
unless child.nodeValue.nil?
if bold
bold_count += child.nodeValue.length
else
plain_count += child.nodeValue.length
end
end
end
end
#Determine number of characters that are bold and not. These can be used to determine your percentage.
puts bold_count
puts plain_count
It is not a very Watir-like solution, but hopefully solves your problem.

Related

RegExp to search text inside HTML tags

I'm having some difficulty using a RegExp to search for text between HTML tags. This is for a search function to search text on a HTML page without find the characters as a match in the tags or attributes of the HTML. When a match has been found I surround it with a div and assign it a highlight class to highlight the search words in the HTML page. If the RegExp also matches on tags or attributes the HTML code is becoming corrupt.
Here is the HTML code:
<html>
<span>assigned</span>
<span>Assigned > to</span>
<span>assigned > to</span>
<div>ticket assigned to</div>
<div id="assigned" class="assignedClass">Ticket being assigned to</div>
</html>
and the current RegExp I've come up with is:
(?<=(>))assigned(?!\<)(?!>)/gi
which matches if assigned or Assigned is the start of text in a tag, but not on the others. It does a good job of ignoring the attributes and tags but it is not working well if the text does not start with the search string.
Can anyone help me out here? I've been working on this for a an hour now but can' find a solution (RegExp noob here..)
UPDATE 2
https://regex101.com/r/ZwXr4Y/1 show the remaining problem regarding HTML entities and HTML comments.
When searching the problem left is that is not ignored, all text inside HTML entities and comments should be ignored. So when searching for "b" it should not match even if the HTML entity is correctly between HTML tags.
Update #2
Regex:
(<)(script[^>]*>[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|\/?\b[^<>]+>|!(?:--\s*(?:(?:\[if\s*!IE]>\s*-->)?[^-]*(?:-(?!->)-*[^-]*)*)--|\[CDATA[^\]]*(?:](?!]>)[^\]]*)*]])>)|(e)
Usage:
html.replace(/.../g, function(match, p1, p2, p3) {
return p3 ? "<div class=\"highlight\">" + p3 + "</div>" : match;
})
Live demo
Explanation:
As you went through more different situations I had to modify RegEx to cover more possible cases. But now I came with this one that covers almost all cases. How it works:
Captures all <script> tags and their contents
Captures all CDATAblocks
Captures all HTML tags (opening / closing)
Captures all HTML comments (as well as IE if conditional statements)
Captures all targeted strings defined in last group inside remaining text (here it is
(e))
Doing so lets us quickly manipulate our target. E.g. Wrap it in tags as represented in usage section. Talking performance-wise, I tried to write it in a way to perform well.
This RegEx doesn't provide a 100% guarantee to match correct positions (99% does) but it should give expected results most of the time and can get modified later easily.
try this
Live Demo
string.match(/<.{1,15}>(.*?)<\/.{1,15}>/g)
this means <.{1,15}>(.*?)</.{1,15}> that anything that between html tag
<any> Content </any>
will be the target or the result for example
<div> this is the content </content>
"this is the content" this is the result

Extract whitespace-collapsed text from html as it would be rendered

I use an html parser (Neko) in order to extract the free-text of an html document.
Since I'm interested in text's semantic I must give special attention to the distance between words as it appears in browser.
for example:
<H1>My
title</H1>
<P>Hello
World</P>
Is rendered as:
My title
Hello world
While containing the paragraph inside <pre> tags or with style:
<style>
p { white-space:pre; }
</style>
would result:
My title
Hello
World
which I would like to treat differently since "Hello" for that matter is not semantically tied to the word "World". As said in other posts - there's a difference between what parsing does and what rendering does. I'm interested in the connection between words as it appears after rendering since obviously parsing doesn't collapse white-spaces as would been shown on browser.
Is there any way to extract whitespace-collapsed text from html as it's read on browser?
I have not used Neko before, but you will need to access the styles of the elements and see if the white-space property is set to either pre, pre-wrap, or preline.
If it is either pre or pre-wrap, replace any whitespace group in the text with a single space.
Else if pre-line, only replace groups of spaces/tabs with a single space.
Else, do not modify the text.
Here's an example using JQuery: JSFiddle
JQuery
function getRenderedText(obj) {
var text = obj.text();
var renderedText;
switch (obj.css('white-space')) {
case 'pre':
case 'pre-wrap':
renderedText = text;
break;
case 'pre-line':
renderedText = text.replace(/[ \t]+/,' ');
break;
default:
renderedText = text.replace(/\s+/,' ');
}
return renderedText;
}
Just look at this basic info on w3schools
http://www.w3schools.com/cssref/pr_text_white-space.asp
and a bit better explained with examples:
http://css-tricks.com/almanac/properties/w/whitespace/
i also think that you have to put hello in 1 <p> and world in another for the effect to work.
otherwise they both go to the right.

how to apply font properties on <span> while passing html to pdf using itextsharp

I am converting html to pdf using itextsharp and I want to set the font size for tags. How can I do this?
Currently I am using:
StyleSheet
styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.SPAN, HtmlTags.FONTSIZE, "9f");
string contents = File.ReadAllText(Server.MapPath("~/PDF TEMPLATES/DeliveryNote.html"));
List
parsedHtmlElements = HTMLWorker.ParseToList(new StringReader(contents), styles);
But it didn't work.
The constants listed in HtmlTags are actually a hodgepodge of HTML tags and HTML and CSS properties and values and it can be a little tricky sometimes figuring out what to use.
In your case try HtmlTags.SIZE instead of HtmlTags.FONTSIZE and you should get what you want.
EDIT
I've never really seen a good tutorial on what properties do what, I usually just go directly to the source code. For instance, in the ElementFactory class there's a method called GetFont() that shows how font information is parsed. Specifically on line 130 (of revision 229) you'll see where the HtmlTags.SIZE is used. However, the actual value for the size is parsed in ChainedProperties in a method called AdjustFontSize(). If you look at it you'll see that it first looks for a value that ends with pt such as 12pt. If it finds that then it drops the pt and parses the number literally. If it doesn't end with pt it jumps over to HtmlUtilities to a method called GetIndexedFontSize(). This method is expecting either values like +1 and -1 for relative sizes or just integers like 2 for indexed sizes. Per the HTML spec user agents are supposed to accept values 1 through 7 for the font size and map those to a progressively increasing font size list. What this means is that your value of 9f is actually not a valid value to pass to this, you should probably be passing 9pt instead.
Anyway, you kind of half to jump around in the source to figure out what's being parsed where.

variable or tag

hello I never understand the difference between a variable and a tag. Can anybody help me? Is there one at all and how bad is it if you mix them up?
An example of tags would be all the <bracketed> things in this HTML snippet:
<p>This is a sentence with <em>tags</em>. Tags add <b>meaning</b> to text.</p>
The tags add meaning to the text ("this is a <p>aragraph", "this should be <em>phasized", "this should be <b>old"). Any consumer (any program or human reading this text) may do with it what he likes. A web browser would choose to render the text like so:
This is a sentence with tags. Tags add meaning to text.
Other consumers may display the text as-is including the tags, or may discard the tags. Tags that do not exist as part of an agreed standard are ignored. As such tags can't be declared, they're just used.
Variables are a mathematical thing and do not exist in HTML. HTML is a passive markup language. Variables OTOH are used in calculations and computations:
var a = 5;
var b = 10;
var c = a + b;
b = 42;
a, b and c are variables that hold values. Variables are declared into existence (var a), they do not exist before you declare that you want to use them and their names are completely arbitrary (as opposed to HTML tags, which are agreed upon in advance in the HTML spec). Their value varies (e.g. the value of b changes from 10 to 42), hence "variables".
CSS is sort of a mix. In CSS, you can declare styles:
.foobar {
text-size: 200%;
}
This says that any HTML element (tag) with the class "foobar" should have a text size of 200%. This is declared arbitrarily, i.e. you can choose any name for .foobar and add new styles at any time. There aren't any variables in standard CSS though.
Hope that helps.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE