Scraping pseudo-elements from a website with XPath - html

I want to extract data from a website, but it seems that the elements that I want to extract are not "accessible".I also discovered they seem to be pseudo-elements. I can se that their tags are marked with a # before in my web-inspector.
Moreover, while using XPath I can't extract the text I want to access. Their is a point in the CSS "cascade tree" when I can't extract the content of a tag, you can see it below.
Here I can extract information up to the tag 'content fond'. But when I ask for the tag "fos_comment_thread" which is the tag just below, the return is empty. And it is especially this tag which is a pseudo-element, and the following behind. However the text I want to access is even more deeper in this part of the CSS tree...
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond'].extract()
Output
['<div id="foc_comment_thread"<div>']
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond']/div[id#='fos_comment_thread'].extract()
Output
[]
I don't understand why I can't extract, I think it is due to the fact that the rest of my tags are pseudo-elements,but I haven't found a solution to solve the problem...

The first thing you need to do is to not using your web-inspector tool and look at the raw HTML of the website.
Web inspectors take into account the transformations made by Javascript and may show you an update HTML after Javascript execution, that scrapy obviously can't see.

Related

Markdown TOC with Special Characters?

I am trying to create a TOC for my Markdown blog.
The methods I am finding here... : Markdown to create pages and table of contents?
....do not work for me because I am naming all of my headers # _</>_ The Setup because I am using CSS on to style the "", giving each header a nice colored Icon next to it. If I simply use ```# The Setup ```` it works great.
This causes issues whenever I try to use [The Setup](#The-Setup).
I tried a few things like [The Setup](#_</>_-The-Setup) and other things, but I can not get it to work.
If someone can point me in the right direction I would greatly appreciate it. Also, if anyone has a better way of adding custom icons next to headers, I think that would be the better way to go about it.
As always, thanks in advance.
The general solution is to examine the rendered HTML output to see what the tool is converting the special characters to, in the HTML's element ID. Every tool could handle the conversion differently (it could convert special characters to -, _, or just remove special characters). Some examples:
<h1 id="_____the-setup">The Setup</h1>
<h1 id="-the-setup">The Setup</h1>
<h1 id="the-setup">The Setup</h1>
Once you have identified the exact id that the tool is using, then you use that value as the heading link in the markdown's table of contents. For example:
[The Setup](#_____the-setup)
Now, the tricky part is that not all Markdown tools will export the rendered HTML, including VS Code. The workaround for VS Code is:
Open the markdown preview mode (which renders to html internally).
Open the VS Code Developer Tools (Help > Toggle Developer Tools).
Use DevTools to inspect the element (in this case, the heading element for "The Setup").
I see that VS Code named the id as the-setup, so in the markdown's table of contents, I write [The Setup](#the-setup). Now the table of content hyperlink works in VS Code. Caveat: it might not work in other Markdown tools if they render a different HTML element ID!
Another shortcut now available in VS Code (1.70 July 2022), is that markdown can autocomplete the header ID. So you just type #, and it will list the valid IDs:

JSoup Select Tag Recursive Search

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

RegEx to Filter some specific tags

I'm developing an ASP code that read a external websites and parse it via HTMLDocument interface Object ( "HTMLFILE" Object) to navigate contents via DOM structure. But there are some pages that throw an error :
'htmlfile error 80070057 Invalid Argument.'
After doing a lot of research, I've discovered that there are some HTML tags that, i don't know why, are not rendered or managed correctly by HTMLFILE object giving me that error.
Because ASP is too old and there isn't much content available today to be probing, I'm convinced that I have to parse it before send to HTMLFILE Object, and the best way that I have figured is to do via RegEx.
But I'm facing some problems (and because i don't have much practice).
I have to successfully locate HTML Tag Blocks that 'HTMLFILE' do not accept to be able to remove them.
For Example:
<head>
<script> ....... </script>
<style> ....... </style>
</head>
<body>
<iframe> ........ </iframe>
<div> ..... </div>
<table>.....</table>
I have to match full script block, style and iframe, leaving the rest of document intact.
From last days i've doing some research and have almost done it:
<(?:script|embed|object|frameset|frame|iframe|meta|style).+(.|\s)*?>$
I've tried to match single line tag (for example '<BR>') but I'm totally confused now and there are some inconsistencies on it, for example, some of lines that close some tags are improperly selected.
I Know that the best way is discover why HTMLFILE is throwing me on error, but there is no more information on error to debug it.
Thank for all the time and patience.
Here is the regex candidate:
<(script|meta|style|embed|object|frameset|frame|iframe)[\s\S]*?<\/(script|meta|style|embed|object|frameset|frame|iframe)>
DEMO with explanation
EDIT
Update with lazy match for [\s\S]*?
Regex is not best tool for that, take a look here, but if you really want, I think in simple cases you can also use one regex for all tags, also nested:
(?=(<([^>]+)>([\s\S]*?)<\/\2>))
DEMO
the 1st groups shows whole captured part, 2nd groups capture just tag, and 3rd group capture content of tag. It doesn't actually match text, only capture some fragments. However you probably can get start/end index of match, and use in as you want.
Still I think you should reconsider using regex, however suntex used above is quite useful, so it is worth to know how to use it.

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

Extracting meta tags attribute using wget

I have a file having some URLs per line. I need to extract the "keywords" present in the tags i.e. if there is meta tag for "keywords" then i want to get "content" value for it.
Example: if the web-page has this meta-tag:
<meta name="keywords" content="wikipedia,encyclopedia">
then for that URL i want "wikipedia,encyclopedia" to be extracted.
One approach is to download the web-page using "wget" and then parse it using some standard HTML parser.
I was wondering is there any better way to do this without downloading the entire web-page.
No -- you have to download the whole page .. or interrupt downloading after receiving some amount of data (which is even worse and much more complicated to do as AFAIK it cannot be done with wget and you will have to code your own wget).
If you're comfortable with some PHP, you should be able to put something together pretty easily by wrapping a loop around QueryPath.
Swiping an example from the docs, this:
require 'QueryPath/QueryPath.php';
$url = 'http://example.com';
print qp($url, 'title')->text();
...will go out and get the document at example.com, extract the text of the title tag and output it.
It'd only take a little more work to make that look for meta keywords tags and extract the content attribute, especially if you're already familiar with jQuery. (It's a bit of a simplification, but a large chunk of QueryPath is more or less implementing a "server-side jQuery.")
If you pursue this programmatic method and have further questions, they should probably go on the main Stack Overflow site where there's also an active querypath tag.
Here you have another solution:
http://simplehtmldom.sourceforge.net
I didn't try it yet!