I've got problem with rvest on Facebook. I've webscraped many things by R earlier, so I understand how for example html_nodes works. I always use SelectorGadget and everything workes. This time, SelectorGadget doesn't work on Facebook site so I have to cope with html.
Let's say I've got this site https://www.facebook.com/avanti/posts/1017920341583065 and I want to extract article title ('Karnawałowe stylizacje F&F'). How can I do it?
I've tried so far:
library("rvest")
link_fb <- "http://www.fb.com/103052579736517_1017920341583065"
html_strony <- read_html(link_fb)
html_text(html_nodes(html_strony, "mbs _6m6"))
but it doesn't work. I'd be really greatfull for any help.
PS I have to have this title, not after clicking the link, because it could be different there.
I think you should USE Facebook API to download content and information from Facebook: Rfacebook R package and Facebook API: https://developers.facebook.com/
You can write your own R-Facebook-API conection with httr package. Good luck
Related
I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.
Please help me.
I'm trying to scrape the split table but actually I can't do and I don't understand why.
This is the url:
https://www.strava.com/activities/1983801964
This is the credential to login:
email=trytest#tiscali.it
password=12345678
This is my code:
pgsession<-html_session("https://www.strava.com/login")
pgform<-html_form(pgsession)[[1]]
filled_form<-set_values(pgform, email="trytest#tiscali.it", password="12345678")
submit_form(pgsession, filled_form)
page<-jump_to(pgsession, "https://www.strava.com/activities/1983801964")
page%>%html_nodes(xpath='//*[#id="contents"]')
And I get {xml_nodeset (0)}
I tried everything, also
page%>%html_nodes("body")%>%html_text()
But I can't get this information, please help me!!
Thanks in advance
I cannot find the split data in the HTML. Therefore, it may not be possible to scrape the splits from the HTML like this.
Alternatively, you can download the raw activity data. Link: https://support.strava.com/hc/en-us/articles/216918437-Exporting-your-Data-and-Bulk-Export
Edit: you may also be able to use this method to download Strava data: https://scottpdawson.com/export-strava-workout-data/
Edit 2: The splits are contained in a DIV called "splits-container". But, the source HTML is likely modified by javascript after the page is loaded. This means you will probably not be able to scrape the data without running the javascript first. Hope this helps.
I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.
If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")
I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.
I'm scraping the NFL website with R. R might not be the best to do this but that is not my question here.
I can usually get everything I want but for the first time I got a problem.
In the present case I want to get info from let's say, this page
http://www.nfl.com/player/j.j.watt/2495488/profile
The info I want to get is there
Draft
Using xPathSapply(parsedPage,xmlGettAttr, name="onclick") I get only NULL... and I do not get the reason why.
I could retrieve the information elsewhere in the code and then paste to recover the address but I find it much easier and clearer to get it at once.
How can I get this, using R, eventually C. I do not know much about JavaScript, I would be happy to avoid this.
Thanks in advance for the help.
The reason is that there are no "onclick"-attributes in the sourcecode: See (in Chrome)
view-source:http://www.nfl.com/player/j.j.watt/2495488/profile
The onclick-attributes are added via javascript. Because of that you need a parser that executes the JS.
In R you can you RSelenium for that as follows:
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.nfl.com/player/j.j.watt/2495488/profile")
doc <- remDr$getPageSource()
require(rvest)
doc <- read_html(doc[[1]])
doc %>% html_nodes(".HOULink") %>% xml_attr("onclick")
remDr$close()
#shutdown
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")
For me this resulted in:
[1] "s_objectID=\"http://www.nfl.com/teams/houstontexans/profile?team=HOU_1\";return this.s_oc?this.s_oc(e):true"
[2] "s_objectID=\"http://www.houstontexans.com/_2\";return this.s_oc?this.s_oc(e):true"
[3] "s_objectID=\"http://www.nfl.com/gamecenter/2015122004/2015/REG15/texans#colts/watch_1\";return this.s_oc?this.s_oc(e):true"
...
You can also use a headless browser like phantomjs see https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html