I am attempting to get a list of all links of satellite data for a year/month on the European Space Agency's Cryosat-2 website (https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02). No matter what web scraping or html reading package I use, the links are never included. Below is an example of such an attempt with the url provided, but it is by no means my only attempt. I am looking for an explanation as to why the links that initiate the download of individual files aren't extracted, and what the solution is to obtaining them.
library(textreadr)
html_string<- 'https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02'
html_read<- read_html(html_string)
html_read
[1] "Layer 1" "European Space Agency"
[3] "CryoSat-2 Science Server" "The access and use of CryoSat-2 products are regulated by the"
[5] "ESA's Data Policy" "and subject to the acceptance of the specific"
[7] "Terms & Conditions" "."
[9] "Users accessing CryoSat-2 products are intrinsically acknowledging and accepting the above." "Name"
[11] "Modified" "Size"
[13] "ESA European Space Agency"
Ok, here is a solution. In this kind of case where you can't get the info with regular scraping (rvest and so on), two solutions:
or you get the info with RSelenium and it can be tedious
or you inspect the page and monitor the XHR request (with the element inspector of firefox for example: network, and XHR).
You will find out that the data are loaded from urls looking like:
https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02
These pages appear like html, but if you open them, they are not. They are JSON. The webpage just display dynamically the info requested as JSON.
So you can simply get the info as follow:
url <- 'https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02'
library(jsonlite)
fromJSON(url)
$success
[1] TRUE
$is_writable
[1] FALSE
$results
mtime size name
1 1616543429 37713 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 1616543428 845594 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 1616543364 37713 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 1616543363 528578 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 1616543321 37713 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 1616543322 387650 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 1616543360 37713 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 1616543359 456414 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 1616543328 37713 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 1616543327 385998 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
path
1 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
These should give you all the info you need. If you tweak a bit the url, you should be able to get all the info from the date you want.
The problem seems to be that the page is dynamic. It probably has some JS code and only loads the links after it runs. So when you get the HTML from the link, you're only getting the base page (before the JS runs).
I can think of two possible solutions:
You can try to use selenium, which emulates an user in the browser so it will load the page completely, but the set-up might be a bit complicated. See for an intro https://www.r-bloggers.com/2014/12/scraping-with-selenium/
The page probably sends an HTTP request to get the links from an API, you can try to figure out the exact request. The network tab on you browser is a good place to start.
Related
I found this reddit post here - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/ .
I would like to use the API in such a way, such that I can get all the comments from this post.
I tried looking into the documentation of this API (e.g. https://github.com/pushshift/api) and this does not seem possible? If somehow I cold get the LINK_ID pertaining to this reddit post, I think I would be able to do it then.
Is this possible to do?
Thanks!
The Link Id of the post is in the URL
https://www.reddit.com/r/obama/comments/xgsxy7 <-- id
You could even search https://www.reddit.com/xgsxy7 to get the information.
If you fetch at the endpoint https://www.reddit.com/xgsxy7.json you would get the JSON information, you should then access the object to find them.
JS example:
const data = fetchedJSONObject;
const comments = data[1].data.children.map(comment => comment.data.body); // to get the text body
And you can just analyze the JSON object and get all the data you want from it: if the comment has some nested replies to it, time created, author, etc.
I would recommend you use WebScrapingAPI's extract_rules feature, which returns an array of elements you can extract using the CSS selector. For example, I used [data-testid='comment'] as a CSS selector in the following GET request:
https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/&render_js=1&extract_rules={"comments":{"selector":"[data-testid='comment']", "output":"text"}}
And I got:
{
"comments":[
"I wonder what's the most number of living ex-presidents there have been at one time?",
"The highest number is six—occurring in four different periods in history. The most recent period was 2017-2018 before GHW Bush died.",
"I don't understand what the first half of your title is doing there, other than to confuse and cause a person to have to read the whole title a couple of times to work out that all the living ex-presidents are invited to QEII's DC memorial service.",
"Agreed, OP is pretty awful at writing headlines.",
"Former disgraced president trump",
"No, he's still disgraced.",
"If the link is behind a paywall, or for an ad-free version:outline.comOr if you want to see the full original page:archive.org or archive.fo or 12ft.ioOr Google cache:https://www.google.com/search?q=site:https://www.townandcountrymag.com/society/politics/a41245384/donald-trump-barack-obama-george-bush-queen-elizabeth-memorial/I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns."
]
}
I'm trying to scrape global daily counts of cases and deaths from JHU: https://coronavirus.jhu.edu/
It seems that the counts are stored like this when I use the web inspect but when I try to use the following code to access them, all I can find are placeholders:
library(rvest)
url = "https://coronavirus.jhu.edu/"
website = read_html(url)
cases <- website %>%
html_nodes(css = "figure")
cases
produces the following:
{xml_nodeset (4)}
[1] <figure><figcaption>Global Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[2] <figure><figcaption>Global Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[3] <figure><figcaption>U.S. Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[4] <figure><figcaption>U.S. Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
So I can access these, but all that's stored in them is "Loading..." where the actual count appears on the site and in the webinspect. I'm new to this so I appreciate any help you can give me. Thank you!
The numbers you're interested in are updated by querying another source, so the page loads initially with the "Loading..." that you're seeing. You can see this in action if you refresh the page - initially, there's only "Loading..." in the boxes that's later filled in. The trick here is to find the source that supplies that information and request that, rather than the page itself. Here, the page that https://coronavirus.jhu.edu/ pulls from is https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json, so we can query that directly.
url <- "https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json"
website = read_html(url)
website %>%
html_element(xpath = "//p") %>%
html_text() %>%
jsonlite::fromJSON() %>%
as.data.frame()
returning a neat data frame of
generated updated cases.global cases.US deaths.global deaths.US
1 1.637691e+12 1.637688e+12 258453277 47902038 5162675 772588
You were off to a good start by using the web inspect, but finding sources is a little trickier and usually requires using Chrome's "Network" tab instead of the default "Elements". In this case, I was able to find the source by slowing down the request speed (enabling throttling to "Slow 3G") and then watching for network activity that occurred after the initial page load. The only major update came from the URL I've suggested above:
(green bar in the top left was the original page loading, second lower green bar currently highlighted with a blue aura was the next major update)
which I could then access directly in Chrome (copy/paste URL) to see the raw JSON.
As an additional note, because you're webscraping I'd recommend the polite package to obey the website's rules for scraping which I've included below:
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.
I am able to get the news data like post title and description but how do I get the news image by using the feedzilla API..
Nothing mentioned in the API documentation
http://w.feedzilla.com/api-overview
When you get your list of articles (e.g. http://api.feedzilla.com/v1/categories/26/articles.json) you can see that some of them have an attribute called 'enclosures' where feedzilla puts information about possible images of an article. As an example an article containing an image as JSON:
{"enclosures":[{"length":3489,"media_type":"image\/jpeg","uri":"http:\/\/news.com.au.feedsportal.com\/c\/34564\/f\/632403\/e\/1\/s\/36174fe5\/l\/0Lresources30Bnews0N0Bau0Cimages0C20A140C0A10C190C122680A50C4485310Erosetta0Bjpg\/448531-rosetta.jpg"}],"publish_date":"Sun, 19 Jan 2014 14:00:00 +0100","source":"Cnet","source_url":"http:\/\/feeds.feedburner.com\/newscomauworldnewsndm","summary":"ONE of the most ambitious missions in the history of space goes into high-risk\u000amode today when a comet-chasing probe wakes up.\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a","title":"Comet-chasing probe to wake up (Cnet)","url":"http:\/\/news.feedzilla.com\/en_us\/stories\/top-news\/353840990?client_source=api&format=json"}
Note that this attribute is optional.
I am considering starting a project so that I can learn more and keep the things I have learned thus far from getting rusty.
A lot of the project will be new things so I thought I would come here and ask for advice on what to do and how to go about doing it.
I enjoy photoshop and toying around with it, so I thought I would mix my project with something like that. So I decided my program will do something along the lines of grab new resources for photoshop put them in their own folder on my computer. (from deviantart for now)
For now I want to focus on a page like this:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9
I'm not fluent with understanding exactly what is going on in the html source so it is a bit hard to see what is going on.
But lets say I am on that page and I have the following options chosen:
Sorted by Popular
Sorted by All Time
Sorted by 24 Items Per Page
My goal is to individually go to each thumbnail and grab the following:
The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it
I would like to do that for each of the 24 items on the page and then go to the next page and do the same. (I am thinking of just going through the first five pages as I don't have too much interest in trying out brushes that aren't too popular)
So, I'm posting this for a sense of direction and perhaps some help on how to parse such a page to get what I'm looking for. I'm sure this project will keep me busy for awhile, but I'm hoping it will become useful in teaching me things.
Any help and suggestions are always appreciated.
.
.
EDIT
Each page is made up of 24 of these:
<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
<span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
<a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>
My assumption is, I want to grab all my information from the:
<a class="t" ... >
Since it contains the title, author, and link to where the download url and large image is located.
If this sounds correct, how would one go about getting that info for each object on the page. (24 per page) I would assume by using CyberNeko. I'm just not exactly sure how to get to the proper level where is located and for each of them on the page
.
.
EDIT #2
I have some test code that looks like this:
divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")
divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
divs.each { println it }
The XPath is correct, but it prints out:
<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/
art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplicat
ions%2Fpsbrushes&qo=0" class="t" title="Shad0ws Blood Brush Set by ~Shad0w-G
FX, Jun 28, 2005">Shad0ws Blood Brush Set
Can you explain what I need to do to just get the href out of there? Is there a simple way to do it with HtmlUnit?
Meeting the requirements you've listed above is actually pretty easy. You can probably do it with a simple Groovy script of about 50 lines. Here's how I would go about it:
The URL of the first page is
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0
To get the next page, simply increase the value of the offset parameter by 24:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24
So now you know how to construct the URLs for the pages you need to work with. To download the content of this page use:
def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'
// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes
// or get the content as a String
String pageContentAsString = new URL(pageUrl).text
Now all you need to do is parse out the elements of the content that you're interested in as save it in files. For the parsing, you should use a HTML parser like CyberNeko or Jericho.