In what cases do browsers create multiple adjacent text nodes? - html

According to MDN,
New documents have a single Text node for each block of text. Over time, more Text nodes may be created as the document's content changes
I'm running into a rare bug in a project that I think is triggered by multiple text nodes being created in a single element when I only expect there to be one, but I can't reproduce it. Is there any way I can trigger this browser behavior, particularly in iOS Safari?
To illustrate, I manually made a div with two text nodes. I'm trying to figure out when the browser would take a single text node and split it in two like in the attached image

At least one case of particularly Safari/WebKit unexpectedly breaking textNodes seems to be documented somehow in WebCore’s HTMLConstructionSite.cpp lines 584–592, which refers to WebKit bug #55898. The limit comes from Text.h which sets defaultLengthLimit to 1 << 16 (65536).
I’m not entirely sure which part triggers this, since adding long text to node using textContent or appendChild(textNode) both created a single text node even with long text. However, I did manage to replicate this behavior with innerHTML.
Example:
// empty <p> element
let p = document.getElementById("test");
p.innerHTML = "a".repeat(65536+100);
console.log(p.childNodes.length); // 2
Obviously HTMLConstructionSite.cpp is related to parsing HTML so it would make sense that it applies to innerHTML, but I have no idea if some other places in WebCore use the text splitting textNode creation too. I hope this helps to track down the problem at least.

Related

contenteditable div in UiWebView - new lines are not saved when clicking on done

I have the following div in UIWebView:
<div contenteditable="true"></div>
If the user inserts new line (using the return key in the visual keyboard), and when he is done he clicks on done in the previous/next/done grey visual keyboard, it combines the lines to one line.
How can I avoid it?
Perhaps this JSFiddle can shed some light onto what's happening within your application. If you type some lines in the top DIV (gray background color), the HTML code that you get as the return value of its innerHTML property will first display in a textarea field below it (including HTML tags formatting). As you will soon see it's not merely what you'd expect to handle in your application ('line one' + CRLF + 'line two'...), but it also contains HTML elements separating lines one from another. That's how browsers are able to display contenteditable DIVs as if they're 'memo' type controls - by parsing their HTML (that's what browsers do). This HTML formatted text is also how your application receives user submitted text, and there you have to decide what to do with this formatting. You can either strip them away (which is, I suspect, how you set that object's property and it deals with that for you) replacing HTML elements like <DIV></DIV> and so on with a space character, or choose (with your control's property, or in code) to handle this formatting whichever way you'd like them to be handled. I'm not familiar with UIWebView though, and you'll have to find on your own how to retrieve complete HTML formatted values that you want to apply to the next DIV element that you're displaying (or same one that you're assigning new values to).
UPDATE: After searching the web for UIWebView reference, I've actually stumbled across one related thread on SO that shows how to retrieve innerHTML value of an element in your underlying HTML document:
//where 'wView' is your UIWebView
NSString *webText = [wView stringByEvaluatingJavaScriptFromString:#"document.getElementById('inputDIV').innerHTML"];
This way you'd be able to retrieve the whole innerHTML string contained within the contenteditable DIV that you use in a webText string variable and parse its HTML formatted text to whatever suits your needs better. Note though, that different browsers format contenteditable DIVs differently when Enter Key is pressed and some will return the next line enclosed in a new DIV, while others might enclose it in paragraph P and/or end the line with a break <BR> or <BR />, when shift+enter were used together to move to the next line. You will have to account for all these possibilities when processing your input string. Refer to the JSFiddle script I wrote using your UIWebView component to check what formatting applies.
Of course, in your case, it might be simpler to replace your contenteditable DIV with a textarea that will return more commonly formatted \n end-of-line (CR+LF). DIVs however are easier to design, so choose whichever suits your needs better.
Cheers!
I don't believe there's a solution to this from the objective-c side of the stack. The standard HTML- element only delivers a single string. It might be possible to achieve through some javascript magic or similar on the web-end of things.
My HTML-skills are not up to scratch but if you also control that end perhaps changing the to a textArea might help?

Using neutral <div> as word boundary?

I have a .html file containing text content like:
<div> The study concludes that 1+1 = 2. (Author in Journal..., Page ...) Another study finds...</div>
Now when viewing this in Firefox, I want to be able to conveniently copy the text in the () brackets. But 2 left mouseclicks only mark one word like "Journal", and 3 clicks mark the content of the whole div.
So my idea was to put the brackets in another div like:
<div> The study concludes that 1+1 = 2. <div>(Author in Journal..., Page ...)</div> Another study finds...</div>
But this leads to the () text being pushed into a new line, but the text flow shouldn't be altered at all, I just want to achieve the copy+paste behavior. Is there a way to achieve this? I thought about applying a div class to the () and canceling the attributes in the .css file, but somehow it did not work.
Essentially a triple click will mark a paragraph. So even if you were able to make your inner div inline (which is very simple, you can use style="display:inline"), the browsers text analyzing engine would still read it as one paragraph (or one block) and use the standard behaviour: mark the paragraph.
So basically: no, not if you use only CSS. You have to use JavaScript to identify a triple click on the element and mark it.

How to resolve issue where table column is too narrow?

I'm new on this particular project, and I've been tasked with resolving an issue that's appearing in IE8.
If you check http://funds.ft.com/ETFHomepage.aspx, There's a section called "News". In that section, there's a column called "Most Popular ETFs". This should be the same width as the "Recently Viewed ETFs" column.
For reference, this page is appearing correctly in Firefox. Can somebody please point out what I can do with CSS or (some other means)* to resolve this?
*I know the best way to resolve this issue is to scrap the terrible design and implement it correctly!! :-) -- we're actually doing that right now. It's a big job, so it's taking a long time. In the mean time however, we have to fix the bugs as they appear. Thanks
Update: just to note what I've said to Hristo, "I think the problem is with the table (rather, nested tables) on the left. The table in the center has its width defined by the image, and the table on the right doesn't have an image so it gets crushed"
Well the reason this is happening is because of the url you have under the "Alphaville: Overcoming the Volcker rule, with ETFs" header. Since the url has no whitespace in it, the table tries to give it space. So there are a couple of ways to fix this problem:
Plain text urls aren't very becoming on a webpage (especially when they're not in anchor tags so you can click on them.) Could you update the content so that you don't have a raw url in your content?
If you must be able to handle long lines of text with no whitespace then you need to figure out how to change the layout of the page so it forces the text to either wrap or clip to fit the container. Try playing around with putting "table-layout: fixed" on your tables to force the column widths to be sized based on the table's specifications only (instead of content). Firefox seems to be wrapping on dashes and slashes in the url whereas IE only wants to wrap on the dashes in the url.
I would say your layout is fine, and you just need to fix the content generation so it doesn't include any long plain text urls (option 1 above)
EDIT: If you do decide to go with option 2 above, then look into the css rule "word-break: break-all". It is IE only and it forces the text to break as soon as it reaches the end of the container. Not good for words, but it works for url's. So you couldn't apply this to the whole news table, but you could to just the cell that contains the url.

TextField autoSize+italics cuts of last character

In actionscript 3, my TextField has :
CSS styling
embedded fonts
textAlign : CENTER
autoSize : CENTER
... when italics are used the very right character gets slightly cut off (specially caps).
It basically seems that it fails detecting the right size.
I've had this problem before but just wondered is there a nice workaround (instead of checking textWidth or offsetting text etc.)?
Initialize your textField as you always do, using multiline, autosize, htmlText...
Then do this little trick :
// saving wanted width and height plus 1px to get some space for last char
var savedWidth = myTextField.width + 1;
var savedHeight = myTextField.height + 1;
// removing autoSize, wich is the origin of the problem i think
myTextField.autoSize = "none";
// now manually autoSizing the textField with saved values
myTextField.width = savedWidth;
myTextField.height = savedHeight;
Not that it is much comfort to you, but Flash sometimes has trouble with this seemingly simple task. CSS styling of html TextField was a nice addition but it has caused headaches for text-rendering. In fact I very rarely use CSS for styling text for that reason. I can only imagine that combining bold, italic and normal type faces within the HTML causes Flash to get some of the width calculations wrong which causes autoSize to set the mask a tiny bit short. I hope very much that the new text rendering engine in Flash Player 10 will finally fix these issues (it certainly looks better in theory).
So my solution is never to use HTML with the exception being when I require <a> links in my text ... and there are even some tricky text shifting issues there. In those cases I avoid mixing different font weights and font styles within the same text field. All other cases I use TextFormat directly on TextField.
I suppose if you can't get out of your current architecture (for some reason) you could try adding to the end of your html encoded strings. Or you could manually set the width of the field and not rely on autoSize (as you have mentioned). But if you keep on the CSS/HTML route you may find another new and painful limitation just when you don't want it.
I've had issues with TextField masks behaving differently in the Flash preview, and in the actual browser plugin. Usually, and this is strange to me, it would appear more correctly in the browser. Have you tried running the swf in a browser to see if the problem is actually an annoyance rather than a permanent problem?
I had said this:
My in-ideal approach to solving this is to attach a change event to the TextField which always adds a space after the last character of the field. And then to remember to trim this space off when using the value.
But that didn't take into account that this probably doesn't have a change event and that it's an HTML rendered text field. To add a trailing space in the HTML text field throw in an again, that's not really fixing the problem.

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text.
For example, it would pick the div "content" in the following HTML:
<html>
<body>
<div id="header">This is the header we don't care about</div>
<div id="content">This is the <b>Main Page</b> content. it is the
longest block of text in this document and should be chosen as
most likely being the important page content.</div>
</body>
</html>
I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.
Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.
One word: Boilerpipe
Here's roughly how I would approach this:
// get array of all elements (body is used as parent here but you could use whatever)
var elms = document.body.getElementsByTagName('*');
var nodes = Array.prototype.slice.call( elms, 0 );
// get inline elements out of the way (incomplete list)
nodes = nodes.filter(function (elm) {
return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName );
});
// sort elements by most text first
nodes.sort(function(a,b){
if (a.textContent.length == b.textContent.length) return 0;
if (a.textContent.length > b.textContent.length) return -1;
return 1;
});
Using ancestry functions like a.compareDocumentPosition(b), you can also sink elements during sorting (or after), depending on how complex this thing needs to be.
You will also have to formulate a level on which you want to select the node. In your example, the 'body' node has an even larger amount of text in it. So you have to formulate what a 'parent element' exactly is.
You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree, keeping track of the immediate parent (because that is your output).
Start form parent nodes and traverse the tree for each node that is just formatting, it would continue the 'count' within that sub block. It would count the characters of the content.
Once you find the most content block, traverse back up the tree to its parent to get your answer.
I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.
What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.
I can also say that word banks are a great help. Any lists of common 'advertisey' words like twitter and click and several capitalized nouns in a row. Having a POS tagger can improve accuracy. For news sites, a list of all known major cities in the world can help separate. In fact, you can almost scrape a page without even looking at the HTML.