JSoup Select Tag Recursive Search - html

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.

No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

Related

Scraping pseudo-elements from a website with XPath

I want to extract data from a website, but it seems that the elements that I want to extract are not "accessible".I also discovered they seem to be pseudo-elements. I can se that their tags are marked with a # before in my web-inspector.
Moreover, while using XPath I can't extract the text I want to access. Their is a point in the CSS "cascade tree" when I can't extract the content of a tag, you can see it below.
Here I can extract information up to the tag 'content fond'. But when I ask for the tag "fos_comment_thread" which is the tag just below, the return is empty. And it is especially this tag which is a pseudo-element, and the following behind. However the text I want to access is even more deeper in this part of the CSS tree...
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond'].extract()
Output
['<div id="foc_comment_thread"<div>']
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond']/div[id#='fos_comment_thread'].extract()
Output
[]
I don't understand why I can't extract, I think it is due to the fact that the rest of my tags are pseudo-elements,but I haven't found a solution to solve the problem...
The first thing you need to do is to not using your web-inspector tool and look at the raw HTML of the website.
Web inspectors take into account the transformations made by Javascript and may show you an update HTML after Javascript execution, that scrapy obviously can't see.

Algorithm to develop an article extractor

I have undertaken a project which will extract the main content from any webpage. For example, if I input the URL of any news article, it will return the article part only. The first step would be getting the source code of the given URL. There are many ways to do it. After getting HTML code of given webpage, I will keep the part inside <body> tag because obviously article will be somewhere inside body.
After this, I am selecting each div element and checking how much text it contains. At end I am selecting the div with most text inside it.
Other way I am thinking is, for each <p> element, I will check the parent of it. At end, I will select the div which has most <p> child directly. To understand it better check this tree- Tree of an HTML
Now I know that these methods are the basic and that's why I am asking this question. I want to know the suggestions of the community about this. What approaches you all use?
I like the idea of implementing your own 'News' crawler...
A few suggestions:
Check the source ('Right Click' > 'Inspect' at chrome) of some popular sites (e.g. The New York Times); search for common html object names, ids or classes they use to identify the different blocks in the html; for instance: divs with 'story' or 'story-body' ids.
I would go with the word count, but also use a dictionary of common phrases, which are likely to appear in a news article.
I would search for the block within 'header' and 'footer', excluding comments section or advertisements (again, by searching the values of the object id or class names).
Start your crawling from the main page, it will probably have references to the sub pages or articles - once you have the reference (e.g. a header or article name), it will help you navigate in the sub page itself.
In any case, I suggest working with java jsoup library - it will make your life easier; use it with the jquery-like selectors.
Goodluck.

Save generated HTML using Canopy

Can a website's generated HTML be saved using Canopy? Looking at the documentation under 'Getting Started', I could not find anything related.
You can run arbitrary JavaScript using js, document.documentElement.outerHTML will return the current DOM, so
let html = js "return document.documentElement.outerHTML" |> string
does the trick.
Canopy is a wrapper around Selenium that provides some useful helper functions. But it also provides access to the Selenium IWebElement instances in case you need them, via the element function (halfway down the page; there don't seem to be internal anchors in that page so I couldn't link directly to the function). Then once you have the IWebElement object, your problem becomes similar to this one, where the answer seems to be elem.getAttribute("innerHtml") where elem is the elememt whose content you want (which might even be the html element). Note that the innerHtml attribute is not a standard DOM attribute, so this won't work with all Selenium drivers; it will be dependent on which browser you're running in. But it apparently works on all major Web browsers.
See Get HTML Source of WebElement in Selenium WebDriver using Python for a related question using Python, which has more discussion about whether the innetHtml attribute will work in all browsers. If it doesn't, Canopy also has the js function, which you could leverage to run some Javascript to get the HTML you're looking for -- but if you're having trouble with that, you probably need to ask a Javascript question rather than an F# question.

Is it possible to nest one data: URI inside another?

If I use a data URI to construct a src attribute for an HTML element, can it in turn have another data URI inside it?
I know you can't use data uri's for iframes (I'm actually trying to construct an OSDX document and pass it to the browser with an icon encoded in base64 but that's a really niche use case and this is more of a general question), but assuming you could, my use case would look like:
var iframe = document.createElement('iframe');
var icon = document.createElement('image');
var iSrc = '*[REALLY LONG STRING]*/';
iframe.src='data:text/html,<html><body><image src="'+iSrc+'" /></body</html>
document.body.appendChild(iframe);
Basically what I'm after is is there anything in a data uri that would break a parent data uri?
Yes you can. I really thought it was impossible, as did everyone I asked.
Example:
Pasting the following into your browser's URL bar should render a gmail logo in an html page that says hello world.
data:text/html,<html><body><p>hello world</p><img src="" /></body></html>
or for a shorter example courtesy of Pumbaa80:
data:text/html,<script src="data:text/javascript,alert('hello world')"></script>
MSDN explicitly supports this:
Data URIs can be nested.
An old blog entry talks a little bit more about embedding images within CSS using data: :
Neither dataURI spec nor any other mentions if dataURI’es can not be nested. So here’s the testcase where dataURI’ed CSS has dataURI’ed image embedded. IE8b1, Firefox3 and Safari applied the stylesheet and showed the image, Opera9.50 (build 9613) applies the stylesheet but doesn’t show the embedded image! So it seems that Opera9 doesn’t expect to get anything embedded inside of an already embedded resource! :D
But funny thing, as IE8b1 supports expressions and also supports nested data URI’es, it has the same potential security flaw as Firefox does (as described in the section above). See the testcase — embedded CSS has the following code: body { background: expression(a()); } which calls function a() defined in the javascript of the main page, and this function is called every time the expression is reevaluated. Though IE8b1 has limited expressions support (which is going to be explained in a separate post) you can’t use any code as the expression value, but you can only call already defined functions or use direct string values. So in order to exploit this feature we need to have a ready javascript function already located on the page and then we can just call it from the expression embedded in the stylesheet. That’s not very trivial obviously, but if you have a website that allows people to specify their own stylesheets and you want to be on the safe side, you have to either make sure you don’t have a javascript function that can cause any potential harm or filter expressions from people’s stylesheets.

What do people mean by "DOM Manipulation" and how would I do that?

I always hear people talk about DOM this, manipulate the DOM, change the DOM, traverse the DOM; but what exactly does this mean?
What is the DOM and why would I want to do something with it?
The DOM is basically an API you use to interface the document with, and is available in many languages as a library ( JS is one of those languages ). The browser converts all the HTML in your web page to a tree based on the nesting. Pop open Firebug and look at the HTML structure. That is the tree I'm talking about.
If you want to change any HTML you can interact with the DOM API in order to do so.
<html>
<head><script src="file.js"></script></head>
<body>blah</body>
</html>
In file.js I can reference the body using:
onload = function() {
document.getElementsByTagName('body')[0].style.display='none';
}
The getElementsByTagName is a method of the document object. I am manipulating the body element, which is a DOM element. If I wanted to traverse and find say, a span I can do this:
onload = function() {
var els = document.getElementsByTagName('*');
for ( var i = els.length; i--; ) {
if ( els[i].nodeType == 1 && els[i].nodeName.toLowerCase() == 'span' ) {
alert( els[i] )
}
}
}
I am traversing the nodeList given back by getElementsByTagName in the snippet above, and looking for a span based on the nodeName property.
It means working with the Document Object Model, which is an API to work with XML like documents.
From w3 on the DOM:
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page. This is an overview of DOM-related materials here at W3C and around the web.
One of the functions mostly used in DOM work is:
getElementById
Manipulating/Changing the DOM means using this API to change the document (add elements, remove elements, move elements around etc...).
Traversing the DOM means navigating it - selecting specific elements, iterating over groups of elements etc...
In short:
When a web page is loaded, the browser creates a Document Object Model of the page, which is an object oriented representation of an HTML document, that acts as an interface between JavaScript and the document itself and allows the creation of dynamic web pages.
Source: w3schools - HTML DOM
D ocument
O bject
M odel
This is the DOM. Either an XML, or HTML, or similar document. All of those terms mean to parse the document and/or make changes to it (usually by using some available tools like JavaScript or C#).
The best example of a DOM when people use those terms is the HTML document in a browser. You might want to manipulate the DOM in this case to add something to the web page.