Scraping largest block of text from HTML document

Scraping largest block of text from HTML document - html

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text.
For example, it would pick the div "content" in the following HTML:
<html>
<body>
<div id="header">This is the header we don't care about</div>
<div id="content">This is the <b>Main Page</b> content. it is the
longest block of text in this document and should be chosen as
most likely being the important page content.</div>
</body>
</html>
I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.
Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

One word: Boilerpipe

Here's roughly how I would approach this:
// get array of all elements (body is used as parent here but you could use whatever)
var elms = document.body.getElementsByTagName('*');
var nodes = Array.prototype.slice.call( elms, 0 );
// get inline elements out of the way (incomplete list)
nodes = nodes.filter(function (elm) {
return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName );
});
// sort elements by most text first
nodes.sort(function(a,b){
if (a.textContent.length == b.textContent.length) return 0;
if (a.textContent.length > b.textContent.length) return -1;
return 1;
});
Using ancestry functions like a.compareDocumentPosition(b), you can also sink elements during sorting (or after), depending on how complex this thing needs to be.

You will also have to formulate a level on which you want to select the node. In your example, the 'body' node has an even larger amount of text in it. So you have to formulate what a 'parent element' exactly is.

You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree, keeping track of the immediate parent (because that is your output).
Start form parent nodes and traverse the tree for each node that is just formatting, it would continue the 'count' within that sub block. It would count the characters of the content.
Once you find the most content block, traverse back up the tree to its parent to get your answer.
I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.
What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.

I can also say that word banks are a great help. Any lists of common 'advertisey' words like twitter and click and several capitalized nouns in a row. Having a POS tagger can improve accuracy. For news sites, a list of all known major cities in the world can help separate. In fact, you can almost scrape a page without even looking at the HTML.

Related

In what cases do browsers create multiple adjacent text nodes?

According to MDN,
New documents have a single Text node for each block of text. Over time, more Text nodes may be created as the document's content changes
I'm running into a rare bug in a project that I think is triggered by multiple text nodes being created in a single element when I only expect there to be one, but I can't reproduce it. Is there any way I can trigger this browser behavior, particularly in iOS Safari?
To illustrate, I manually made a div with two text nodes. I'm trying to figure out when the browser would take a single text node and split it in two like in the attached image

At least one case of particularly Safari/WebKit unexpectedly breaking textNodes seems to be documented somehow in WebCore’s HTMLConstructionSite.cpp lines 584–592, which refers to WebKit bug #55898. The limit comes from Text.h which sets defaultLengthLimit to 1 << 16 (65536).
I’m not entirely sure which part triggers this, since adding long text to node using textContent or appendChild(textNode) both created a single text node even with long text. However, I did manage to replicate this behavior with innerHTML.
Example:
// empty <p> element
let p = document.getElementById("test");
p.innerHTML = "a".repeat(65536+100);
console.log(p.childNodes.length); // 2
Obviously HTMLConstructionSite.cpp is related to parsing HTML so it would make sense that it applies to innerHTML, but I have no idea if some other places in WebCore use the text splitting textNode creation too. I hope this helps to track down the problem at least.

XPath - get text from whole document except text from specified elements

I'm trying to figure out how to get text using XPath and exclude some tags.
Let's say (for illustration) I want to get all text from this page's body tag (so all visible text), but I don't want my text to contain text from tags with class="comment-copy" i.e. I don't want text to include comments.
I tried this but it doesn't work. It returns text including comments.
//body//text()[not(*[contains(#class,"comment-copy")])]
Do you have any idea?
EDIT:
Probably figured it out but maybe there are better or faster approaches so I won't delete the question.
//body//text()[not(ancestor-or-self::*[contains(#class,"comment-copy")])]

You were very close.
Just change
//body//text()[not(*[contains(#class,"comment-copy")])]
to
//body//text()[not(contains(../#class,"comment-copy"))]
Note that this will only exclude immediate children text() nodes of comment-copy marked elements. Your follow-up XPath will exclude all descendant text() nodes beneath comment-copy marked elements.
Note: You might want to beef up the robustness of the #class test; see Xpath: Find element with class that contains spaces.

How define an HTML element on the side and reference it later on?

I am low level programmer and new to HTML.
I have the html body which has the structure of my page.
one of the elements in this body is a long dropdown list. I was thinking that it makes sense to have this list defined separately, at the bottom of the file or at the top, and only reference it inside the body, so the full structure size stay reasonable and easy to read.
Is this something I can actually do? is this a reasonable request?

I would consider populating this dropdown list with Javascript code if it really is that long.
For example, you can make an array of the values/names of the select options you need to create and then iteratively add options elements to a select element. If you give us an idea of the select you're using, we can help you come up with a way. What have you tried thus far?

Using neutral <div> as word boundary?

I have a .html file containing text content like:
<div> The study concludes that 1+1 = 2. (Author in Journal..., Page ...) Another study finds...</div>
Now when viewing this in Firefox, I want to be able to conveniently copy the text in the () brackets. But 2 left mouseclicks only mark one word like "Journal", and 3 clicks mark the content of the whole div.
So my idea was to put the brackets in another div like:
<div> The study concludes that 1+1 = 2. <div>(Author in Journal..., Page ...)</div> Another study finds...</div>
But this leads to the () text being pushed into a new line, but the text flow shouldn't be altered at all, I just want to achieve the copy+paste behavior. Is there a way to achieve this? I thought about applying a div class to the () and canceling the attributes in the .css file, but somehow it did not work.

Essentially a triple click will mark a paragraph. So even if you were able to make your inner div inline (which is very simple, you can use style="display:inline"), the browsers text analyzing engine would still read it as one paragraph (or one block) and use the standard behaviour: mark the paragraph.
So basically: no, not if you use only CSS. You have to use JavaScript to identify a triple click on the element and mark it.

How can I access an element by using its DOM hierarchy(parent element)?

I want to access an element using a DOM hierarchy Node structure, through its parent nodes.I am trying to find the DOM hierarchy through firebug; want something like, <parent_node1>.<child_node1>.<child_node2> (not by document.getElementByID, getElementbyname) to access an element.
I want to automate a scenario like, I have column headers and corresponding values. Want to test, whether the values present under each column header, is correct...
I am thinking of using DOM as a method of automating this case...But, how can I find the DOM hierarchy...?
What I see through Inspect Element in Firebug is something like, list of events, elements and is not looking like a hierarchy node structure...Can somebody help in this regard please?

As discussed, you probably mean the DOM Element properties like element.childNodes, element.firstChild or similar.
Have a look at the DOM Element property reference over at JavaScriptKit, you'll get a good overview there how to access the hierarchy.
var currentTD = document.getElementsByTagName("td")[0];
var currentTable = document.getElementsByTagName("table")[0];
currentTD.parentNode // contains the TR element the TD resides in.
currentTable.childNodes // contains THEAD TBODY and TFOOT if present.
DOM Tables even have more properties like a rows collection and a cells collection.
A reminder of caution: Beware that these collections are live collections, so iterating over them and accessing collection.length in each iteration can be really slow because to get the length, the DOM has to be queried each time.

document.getElementById and document.getElementByTagname are using the DOM. They take an object within the DOM (specifically the document object, though you can also call both of those on elements) and return an object which is a single element or a collection of zero or more elements, respectively. That's a DOM operation. From there you can do other DOM operations on the results like getting children, parents or siblings, changing values etc.
All DOM operations come down to:
Take a starting point. This is often document though it's so often that the first thing we do is call document.getElementById or document.getElementByTagname and then work from the result that we could really consider that the starting point.
Find the element or elements we are interested in, relative to the starting point whether through startingPoint.getElementById* or startingPoing.getElementByTagname perhaps combined with some test (e.g. only working on those with a particular classname, if they have children of particular types, etc.
Read and/or change certain values, add new child nodes and/or delete nodes.
In a case like yours the starting point will be one or more tables found by document.getElementById(someID), document.getElementById(someID).getElementsByTagname('table')[0], or similar. From that table, myTable.getElementsByTagname('th') will get you the column headings. Depending on the structure, and what you are doing with it, you could just select corresponding elements from myTable.getElementsByTagname('td') or go through each row and then work on curRow.getElementsByTagname('td').
You could also just use firstChild, childNodes etc. though it's normally more convenient to have elements you don't care about filtered out by tagname.
*Since there can only be one element with a given id in a document, this will return the same if called on any element higher in the document hierarchy, so we normally just call this on document. It can be useful to call it on an element if we want to do something if the element is a descendant of our current element, and not otherwise.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008