Extract whitespace-collapsed text from html as it would be rendered - html

I use an html parser (Neko) in order to extract the free-text of an html document.
Since I'm interested in text's semantic I must give special attention to the distance between words as it appears in browser.
for example:
<H1>My
title</H1>
<P>Hello
World</P>
Is rendered as:
My title
Hello world
While containing the paragraph inside <pre> tags or with style:
<style>
p { white-space:pre; }
</style>
would result:
My title
Hello
World
which I would like to treat differently since "Hello" for that matter is not semantically tied to the word "World". As said in other posts - there's a difference between what parsing does and what rendering does. I'm interested in the connection between words as it appears after rendering since obviously parsing doesn't collapse white-spaces as would been shown on browser.
Is there any way to extract whitespace-collapsed text from html as it's read on browser?

I have not used Neko before, but you will need to access the styles of the elements and see if the white-space property is set to either pre, pre-wrap, or preline.
If it is either pre or pre-wrap, replace any whitespace group in the text with a single space.
Else if pre-line, only replace groups of spaces/tabs with a single space.
Else, do not modify the text.
Here's an example using JQuery: JSFiddle
JQuery
function getRenderedText(obj) {
var text = obj.text();
var renderedText;
switch (obj.css('white-space')) {
case 'pre':
case 'pre-wrap':
renderedText = text;
break;
case 'pre-line':
renderedText = text.replace(/[ \t]+/,' ');
break;
default:
renderedText = text.replace(/\s+/,' ');
}
return renderedText;
}

Just look at this basic info on w3schools
http://www.w3schools.com/cssref/pr_text_white-space.asp
and a bit better explained with examples:
http://css-tricks.com/almanac/properties/w/whitespace/
i also think that you have to put hello in 1 <p> and world in another for the effect to work.
otherwise they both go to the right.

Related

HTML text don't rendering as in code lines [duplicate]

I have an MVC3 app that has a details page. As part of that I have a description (retrieved from a db) that has spaces and new lines. When it is rendered the new lines and spaces are ignored by the html. I would like to encode those spaces and new lines so that they aren't ignored.
How do you do that?
I tried HTML.Encode but it ended up displaying the encoding (and not even on the spaces and new lines but on some other special characters)
Just style the content with white-space: pre-wrap;.
div {
white-space: pre-wrap;
}
<div>
This is some text with some extra spacing and a
few newlines along with some trailing spaces
and five leading spaces thrown in
for good
measure
</div>
have you tried using <pre> tag.
<pre>
Text with
multipel line breaks embeded between pre tag
will work and
also tabs..will work
it will preserve the formatting..
</pre>
You can use white-space: pre-line to preserve line breaks in formatting. There is no need to manually insert html elements.
.popover {
white-space: pre-line;
}
or add to your html element style="white-space: pre-line;"
You would want to replace all spaces with (non-breaking space) and all new lines \n with <br> (line break in html). This should achieve the result you're looking for.
body = body.replace(' ', ' ').replace('\n', '<br>');
Something of that nature.
I was trying the white-space: pre-wrap; technique stated by pete but if the string was continuous and long it just ran out of the container, and didn't warp for whatever reason, didn't have much time to investigate.. but if you too are having the same problem, I ended up using the <pre> tags and the following css and everything was good to go..
pre {
font-size: inherit;
color: inherit;
border: initial;
padding: initial;
font-family: inherit;
}
As you mentioned on #Developer 's answer, I would probably HTML-encode on user input. If you are worried about XSS, you probably never need the user's input in it's original form, so you might as well escape it (and replace spaces and newlines while you are at it).
Note that escaping on input means you should either use #Html.Raw or create an MvcHtmlString to render that particular input.
You can also try
System.Security.SecurityElement.Escape(userInput)
but I think it won't escape spaces either. So in that case, I suggest just do a .NET
System.Security.SecurityElement.Escape(userInput).Replace(" ", " ").Replace("\n", "<br>")
on user input.
And if you want to dig deeper into usability, perhaps you can do an XML parse of the user's input (or play with regular expressions) to only allow a predefined set of tags.
For instance, allow
<p>, <span>, <strong>
... but don't allow
<script> or <iframe>
There is a simple way to do it. I tried it on my app and it worked pretty well.
Just type: $text = $row["text"];
echo nl2br($text);

Getting two different result from same xpath

HTML Code:
<div class="deviceName truncate"><a ng-href="#" href="#" style="">Hello World</a></div>
Each <a> element, the Link text contains double space as "Hello World"
Retrieving information, in List
List<WebElement> findAllUserName = driver.findElements
(By.xpath("//div[#class='deviceName truncate']//a[text()]"));
for (WebElement webElement : findAllUserName) {
String findUSerText = webElement.getText();
System.out.println(findUSerText);
}
Its gives list Result with single space, "Hello World"
How should overcome to this situation ? To compare text,
Concern behind this, wants to compare list element with given string :
driver.findElement(By.xpath("//div[#class='deviceName truncate']//a[contains(text(),'" + name + "')]"))
And its considering double space,
If you just check how you browser renders the link you will realize that it is also having a single space. See this discussion: Do browsers remove whitespace in between text of any html tag
So you either should to add a style to your page like this
a {
white-space: pre;
}
That will make all your links unformatted.
or try to inject the style of the particular element on the fly like it is shown here: How can i set new style of element using selenium web-driver
Here is the example having two identical links but with different styles set.

Why does Qt::mightBeRichText() not detact HTML table tags as rich text?

I'm using a HTML table within a QML Text component. My problem is that textFormat: Text.AutoText does not automatically recognize my HTML table as a rich text (QML Text documentary).
Searching for a solution I found HTML formatting in QML Text which is quite close to my problem.
The solution given: just setting textFormat: Text.RichText I knew before. But I can not use it as setting the textFormat: Text.RichText also changes how the contentWidth of the QML Text component behaves.
Text {
id: myPlainText
width: 500
wrapMode: Text.Wrap
text: "Hallo stackoverflow.com"
textFormat: Text.AutoText
}
Text {
id: myRichText
width: 500
wrapMode: Text.Wrap
text: "Hallo stackoverflow.com"
textFormat: Text.RichText
}
Accessing myPlainText.contentWidth will give me the actual used with of the text even if it is shorter than 500.
Accessing myRichText.contentWidth does always give me 500.
For me the information of the actual used with, which is contentWidth when no RichText is involved, is important for layout reasons, as this is what my component is mostly used for. Hitting the with limit (eg. 500) for HTML tables would be ok, even so I would prefer knowing the actual table with.
From the Documentation
If the text format is Text.AutoText the Text item will automatically determine whether the text should be treated as styled text. This determination is made using Qt::mightBeRichText() which uses a fast and therefore simple heuristic. It mainly checks whether there is something that looks like a tag before the first line break. Although the result may be correct for common cases, there is no guarantee.
As you can see, it distiguishes between plain and styled text.
The third category: RichText is not supported by AutoText.
This means for AutoText you need to resort to the reduced set of tags, seen in the documentation:
<b></b> - bold
<strong></strong> - bold
<i></i> - italic
<br> - new line
<p> - paragraph
<u> - underlined text
<font color="color_name" size="1-7"></font>
<h1> to <h6> - headers
<a href=""> - anchor
<img src="" align="top,middle,bottom" width="" height=""> - inline images
<ol type="">, <ul type=""> and <li> - ordered and unordered lists
<pre></pre> - preformatted
> < &
If you need the width of your text, try to use
myRichText.implicitWidth
This will give you the width of the text, if it is not wrapped.
Propbably, due to the advanced posibilities, it always works with a maximum contentWidth. Therefore it is not possible to use e.g. elide together with RichText. The unexpected behavior of contentWidth however seems like a bug to me - in either the source or more likely in the documentation.

RegExp to search text inside HTML tags

I'm having some difficulty using a RegExp to search for text between HTML tags. This is for a search function to search text on a HTML page without find the characters as a match in the tags or attributes of the HTML. When a match has been found I surround it with a div and assign it a highlight class to highlight the search words in the HTML page. If the RegExp also matches on tags or attributes the HTML code is becoming corrupt.
Here is the HTML code:
<html>
<span>assigned</span>
<span>Assigned > to</span>
<span>assigned > to</span>
<div>ticket assigned to</div>
<div id="assigned" class="assignedClass">Ticket being assigned to</div>
</html>
and the current RegExp I've come up with is:
(?<=(>))assigned(?!\<)(?!>)/gi
which matches if assigned or Assigned is the start of text in a tag, but not on the others. It does a good job of ignoring the attributes and tags but it is not working well if the text does not start with the search string.
Can anyone help me out here? I've been working on this for a an hour now but can' find a solution (RegExp noob here..)
UPDATE 2
https://regex101.com/r/ZwXr4Y/1 show the remaining problem regarding HTML entities and HTML comments.
When searching the problem left is that is not ignored, all text inside HTML entities and comments should be ignored. So when searching for "b" it should not match even if the HTML entity is correctly between HTML tags.
Update #2
Regex:
(<)(script[^>]*>[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|\/?\b[^<>]+>|!(?:--\s*(?:(?:\[if\s*!IE]>\s*-->)?[^-]*(?:-(?!->)-*[^-]*)*)--|\[CDATA[^\]]*(?:](?!]>)[^\]]*)*]])>)|(e)
Usage:
html.replace(/.../g, function(match, p1, p2, p3) {
return p3 ? "<div class=\"highlight\">" + p3 + "</div>" : match;
})
Live demo
Explanation:
As you went through more different situations I had to modify RegEx to cover more possible cases. But now I came with this one that covers almost all cases. How it works:
Captures all <script> tags and their contents
Captures all CDATAblocks
Captures all HTML tags (opening / closing)
Captures all HTML comments (as well as IE if conditional statements)
Captures all targeted strings defined in last group inside remaining text (here it is
(e))
Doing so lets us quickly manipulate our target. E.g. Wrap it in tags as represented in usage section. Talking performance-wise, I tried to write it in a way to perform well.
This RegEx doesn't provide a 100% guarantee to match correct positions (99% does) but it should give expected results most of the time and can get modified later easily.
try this
Live Demo
string.match(/<.{1,15}>(.*?)<\/.{1,15}>/g)
this means <.{1,15}>(.*?)</.{1,15}> that anything that between html tag
<any> Content </any>
will be the target or the result for example
<div> this is the content </content>
"this is the content" this is the result

Inserting LTR marks automatically

I am working with bidirectional text (mixed English and Hebrew) for a project. The text is displayed in HTML, so sometimes a LTR or RTL mark (‎ or ‏) is required to make 'weak characters' like punctuation display properly. These marks are not present in the source text due to technical limitations, so we need to add them in order for the final displayed text to appear correct.
For instance, the following text: (example: מדגם) sample renders as sample (מדגם :example) in right-to-left mode. The corrected string would look like ‎(example:‎ מדגם) sample and would render as sample (מדגם (example:.
We'd like to do on-the-fly insertion of these marks rather than re-authoring all the text. At first this seems simple: just append an ‎ to each instance of punctuation. However, some of the text that needs to get modified on-the-fly contains HTML and CSS. The reasons for this are unfortunate and unavoidable.
Short of parsing HTML/CSS, is there a known algorithm for on-the-fly insertion of Unicode directional marks (pseudo-strong characters)?
I don't know of an algorithm to insert directional marks into an HTML string safely without parsing it. Parsing the HTML into a DOM and manipulating the text nodes is the safest way of ensuring you don't accidentally add directional marks to text inside <script> and <style> tags.
Here is a short Python script which might help you transform your files automatically. The logic should be easy to translate into other languages if necessary. I'm not familiar enough with the RTL rules you're trying to encode, but you can tweak the regexp '(\W([^\W]+)(\W)' and substituion pattern ur"\u200e\1\2\3\u200e" to get your expected result:
import re
import lxml.html
_RE_REPLACE = re.compile('(\W)([^\W]+)(\W)', re.M)
def _replace(text):
if not text:
return text
return _RE_REPLACE.sub(ur'\u200e\1\2\3\u200e', text)
text = u'''
<html><body>
<div>sample (\u05de\u05d3\u05d2\u05dd :example)</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
'''
# convert the text into an html dom
tree = lxml.html.fromstring(text)
body = tree.find('body')
# iterate over all children of <body> tag
for node in body.iterdescendants():
# transform text with trails after the current html tag
node.tail = _replace(node.tail)
# ignore text inside script and style tags
if node.tag in ('script','style'):
continue
# transform text inside the current html tag
node.text = _replace(node.text)
# render the modified tree back to html
print lxml.html.tostring(tree)
Output:
python convert.py
<html><body>
<div>sample (מדגם ‎:example)‎</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>