I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.
Related
I am trying to webscrape some data from ESPN using R and I am having trouble getting passed the login. I don't know if it is just because ESPN prevents webscraping or if I am missing something. Here is my code:
library(rvest)
url = "https://fantasy.espn.com/football/league/draftrecap?seasonId=2015&leagueId=1734728"
pgsession<-html_session(url)
read_html(url) #to make sure I am in the ESPN login page not the league page
After this step I think I go wrong. I don't know how to find the correct form needed for login
fantform <-html_form(pgsession)[[1]]
fantform #to check the form I have
The response I get from checking this form is below. It doesn't seem right but if I change the form number I get an "error : subscript is out of bounds"
<form> '<unnamed>' (GET )
<field> (search) :
The rest of the code I have is below but I am pretty sure I am stuck at this part.
filled_form <- html_form_set(pgform, "usernanerow" = "username", "passwordrow" = "password")
submit_form(pgsession,filled_form)
Fantasy_league <- jump_to(pgsession, "https://fantasy.espn.com/football/league/draftrecap?seasonId=2015&leagueId=1734728")
I am very grateful for all responses/help. Thank you in advance!
They seem to make it really hard to find the login page and therefore difficult to find the login form, like you stated the form you get above isn't right.
If you can find the login page then just replace your first url with whatever the url is for the login page.
Apart from that the rest of the code is correct
If you're happy to use existing packages it seems like there's already one that might help with what you need:
https://cran.r-project.org/web/packages/fflr/vignettes/fantasy-football.html
I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.
Please help me.
I'm trying to scrape the split table but actually I can't do and I don't understand why.
This is the url:
https://www.strava.com/activities/1983801964
This is the credential to login:
email=trytest#tiscali.it
password=12345678
This is my code:
pgsession<-html_session("https://www.strava.com/login")
pgform<-html_form(pgsession)[[1]]
filled_form<-set_values(pgform, email="trytest#tiscali.it", password="12345678")
submit_form(pgsession, filled_form)
page<-jump_to(pgsession, "https://www.strava.com/activities/1983801964")
page%>%html_nodes(xpath='//*[#id="contents"]')
And I get {xml_nodeset (0)}
I tried everything, also
page%>%html_nodes("body")%>%html_text()
But I can't get this information, please help me!!
Thanks in advance
I cannot find the split data in the HTML. Therefore, it may not be possible to scrape the splits from the HTML like this.
Alternatively, you can download the raw activity data. Link: https://support.strava.com/hc/en-us/articles/216918437-Exporting-your-Data-and-Bulk-Export
Edit: you may also be able to use this method to download Strava data: https://scottpdawson.com/export-strava-workout-data/
Edit 2: The splits are contained in a DIV called "splits-container". But, the source HTML is likely modified by javascript after the page is loaded. This means you will probably not be able to scrape the data without running the javascript first. Hope this helps.
I'm trying to build a Rainmeter widget that fetches the nearest town to the user and displays it onscreen. I'm currently trying to use the webparser function and this website, but it doesn't seem to be working. The code I've adapted from the example on the Rainmeter website is below - any ideas?
[Rainmeter]
Author=Rainmeter staff
Update=1000
;[WEBSITE MEASURES]===============================
[MeasureWebsite]
Measure=Plugin
Plugin=WebParser
UpdateRate=1800
URL=http://locationdetection.mobi/
RegExp="(?siU)<span style="color:white;">town_city:</span> <b>"(.*)"</b>.*"
[MeasureTown]
Measure=Plugin
Plugin=WebParser
Url=[MeasureWebsite]
StringIndex=1
;[DISPLAY METERS]==================================
[TextStyle]
X=2
Y=17
FontFace=Segoe UI
FontSize=32
FontColor=#454442
StringStyle=Bold
Antialias=1
[MeterTown]
MeasureName=MeasureTown
Meter=String
MeterStyle=TextStyle
Y=2
I'm not sure if it's too late but... your URL doesn't seem to contain any town_city information... at least not in the source code.
Rainmeter reads the raw HTML code from the URL you give to it and processes it through the provided RegEx. Without proper information in the website you can never get it to work.
I am considering starting a project so that I can learn more and keep the things I have learned thus far from getting rusty.
A lot of the project will be new things so I thought I would come here and ask for advice on what to do and how to go about doing it.
I enjoy photoshop and toying around with it, so I thought I would mix my project with something like that. So I decided my program will do something along the lines of grab new resources for photoshop put them in their own folder on my computer. (from deviantart for now)
For now I want to focus on a page like this:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9
I'm not fluent with understanding exactly what is going on in the html source so it is a bit hard to see what is going on.
But lets say I am on that page and I have the following options chosen:
Sorted by Popular
Sorted by All Time
Sorted by 24 Items Per Page
My goal is to individually go to each thumbnail and grab the following:
The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it
I would like to do that for each of the 24 items on the page and then go to the next page and do the same. (I am thinking of just going through the first five pages as I don't have too much interest in trying out brushes that aren't too popular)
So, I'm posting this for a sense of direction and perhaps some help on how to parse such a page to get what I'm looking for. I'm sure this project will keep me busy for awhile, but I'm hoping it will become useful in teaching me things.
Any help and suggestions are always appreciated.
.
.
EDIT
Each page is made up of 24 of these:
<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
<span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
<a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>
My assumption is, I want to grab all my information from the:
<a class="t" ... >
Since it contains the title, author, and link to where the download url and large image is located.
If this sounds correct, how would one go about getting that info for each object on the page. (24 per page) I would assume by using CyberNeko. I'm just not exactly sure how to get to the proper level where is located and for each of them on the page
.
.
EDIT #2
I have some test code that looks like this:
divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")
divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
divs.each { println it }
The XPath is correct, but it prints out:
<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/
art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplicat
ions%2Fpsbrushes&qo=0" class="t" title="Shad0ws Blood Brush Set by ~Shad0w-G
FX, Jun 28, 2005">Shad0ws Blood Brush Set
Can you explain what I need to do to just get the href out of there? Is there a simple way to do it with HtmlUnit?
Meeting the requirements you've listed above is actually pretty easy. You can probably do it with a simple Groovy script of about 50 lines. Here's how I would go about it:
The URL of the first page is
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0
To get the next page, simply increase the value of the offset parameter by 24:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24
So now you know how to construct the URLs for the pages you need to work with. To download the content of this page use:
def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'
// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes
// or get the content as a String
String pageContentAsString = new URL(pageUrl).text
Now all you need to do is parse out the elements of the content that you're interested in as save it in files. For the parsing, you should use a HTML parser like CyberNeko or Jericho.