I'm new to web scraping so I'm fooling around with scrapy and trying to crawl a certain website.
I'm working with the scrapy shell on windows and just trying to establish the proper XPath to a particular element I want to access. The element is a schedule, this is the HTML:
I'm trying to access the rv-schedule-module and all its sub-nodes. I'm able to access all nodes up until the rv-schedule-module however beyond that all XPath calls return null. For instance:
The progression of calls returns data until I want to access a div underneath the rv-schedule-module. That call returns null.
What am I doing wrong?
Just as I suspected that content is dynamically created because it's handled by javascript!
When you inspect the element it will be there but if you check the page source it won't. Scrapy by itself doesn't handle javascript, you'll need something like scrapy-splash or Selenium.
There is a really good post of the all mighty Alex explaining how to use it - https://stackoverflow.com/a/30378765/2781701
Related
The XPath selector in Scrapy shell response.xpath('//div[#class="chr-lot-header__bid-details"]//span[#class="chr-lot-header__value-field"] returns an empty list while the same XPath selector selects the right html tag in the "Elements" tab of my Chrome browser.
Here's the website the XPath selector is intended for:
https://www.christies.com/en/lot/lot-5973059
The output I want the XPath selector to produce is "GBP 11,282,500".
I just checked the website you mentioned in your mentioned is getting the required information dynamically loaded, which means it can not be scrapped directly. Because scrapy only scrap statically available data not dynamically loaded data. To scrap dynamically loaded data either you need to mimic real time browser like you can use selenium/playwright and integrate there library inside your scrapy code or you can try to find API calls in network tab, which this required data is being loaded/fetched.
I started learning Rub/Rails about a month ago, but haven't been able to find many resources specific to my issue.
I understand that in HTML/JS you can do something like:
let elements = document.getElementByName('name')
Is there a way in rails to get elements that share the class/id/name?
How can we interact with those elements? for example: if a div with a specific name already exists, append some data from our rails application to that div instead of creating a new one.
Thank you in advance.
Unless I'm really missing the point, what you're asking for isn't possible.
DOM manipulation (using Javascript) is something that happens client-side, in the browser; the browser requests a page, the server responds with an HTML document, and then the browser builds the DOM and we go from there, running Javascript, and potentially inspecting and manipulating the DOM.
Ruby on Rails is server-side; in the above description, it would be involved in the "the server responds" step, but there is no DOM at that point; it's simply generating an HTML document, using models / a view / a controller.
I am trying to collect results of a google search page in GoLang using the goquery library. In order to achieve this, I am collecting all nodes of a goquery selection with goquery. The problem is that the selection returned by Find("*") does not seem to contain all the nodes of the HTML document. Question: does the method collect ALL nodes with the whole tree structure or not ? If not, is there a method to collect them all ?
I tried using the goquery Find("*") method applied to the whole document selection. So nodes with certain attributes are not returned, although they are in the HTML document. For instance, nodes with are not recognized
alltags := doc.Find("*") //doc is the HTML doc with the Google search
The selection does not contain the div tags with class="srg". The same applies to other class values such as "bkWMgd", "rc" for example.
This has happened to me before. I was trying to web scrape with python beautiful soup package and the same thing was happening.
Later it turned out that the html markup returned when trying to fetch it was actually the markup the server returned after finding a bot. I solved this by setting the User-Agent to Mozilla/5.0.
Hope this helps in your quest to solve this.
You can start by updating the code for the fetch request you have performed.
I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.
I am writing a program for managing an inventory. It serves up html based on records from a postresql database, or writes to the database using html forms.
Different functions (adding records, searching, etc.) are accessible using <a></a> tags or form submits, which in turn call functions using http.HandleFunc(), functions then generate queries, parse results and render these to html templates.
The search function renders query results to an html table. To keep the search results page ideally usable and uncluttered I intent to provide only the most relevant information there. However, since there are many more details stored in the database, I need a way to access that information too. In order to do that I wanted to have each table row clickable, displaying the details of the selected record in a status area at the bottom or side of the page for instance.
I could try to follow the pattern that works for running the other functions, that is use <a></a> tags and http.HandleFunc() to render new content but this isn't exactly what I want for a couple of reasons.
First: There should be no need to navigate away from the search result page to view the additional details; there are not so many details that a single record's full data should not be able to be rendered on the same page as the search results.
Second: I want the whole row clickable, not merely the text within a table cell, which is what the <a></a> tags get me.
Using the id returned from the database in an attribute, as in <div id="search-result-row-id-{{.ID}}"></div> I am able to work with individual records but I have yet to find a way to then capture a click in Go.
Before I run off and write this in javascript, does anyone know of a way to do this strictly in Go? I am not particularly adverse to using the tried-and-true js methods but I am curious to see if it could be done without it.
does anyone know of a way to do this strictly in Go?
As others have indicated in the comments, no, Go cannot capture the event in the browser.
For that you will need to use some JavaScript to send to the server (where Go runs) the web request for more information.
You could also push all the required information to the browser when you first serve the page and hide/show it based on CSS/JavaScript event but again, that's just regular web development and nothing to do with Go.