Scape webpage but can not find page links - html

I am think to scape some data from the following webpage using selenim and beautiful soup. But when inspect the html I could not locate the page number link.
http://quote.eastmoney.com/center/boardlist.html#concept_board
Would greatly appreciate any help.

df_all=pd.DataFrame()
for j in range(1,18):
browser.get('http://quote.eastmoney.com/center/boardlist.html#concept_board')
mtable = browser.find_element_by_id('table_wrapper-table')
content = browser.find_element_by_class_name('paginate_input')
button_go = browser.find_element_by_link_text('GO')
content.clear()
content.send_keys(str(j))
time.sleep(2)
browser.find_element_by_link_text('GO').click()
time.sleep(5)
mtable = browser.find_element_by_id('table_wrapper-table')
for row in mtable.find_elements_by_css_selector('tr'):
i=0
for cell in row.find_elements_by_tag_name('td'):
i+=1
if i==2:
print(cell.text, cell.find_elements_by_css_selector("a")[0].get_attribute("href"))

Ok, a couple of things here:
You are trying to get a url, but didn't provide the proper sytanx. There is no http protocol provided (see below code)
I am not sure if you are trying to locate just a page number, or trying to click on the next page and so on, you click on Go button.
Here is the code until what you have provided.
driver.get("https://quote.eastmoney.com/center/boardlist.html#concept_board")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'paginte_go')).click() # to locate to 'Go' button and click on it to go to next page
I am not sure why code format is not working today. I am sorry for this text-ish style of format

Related

I am trying to Webscrape behind a password protected site from ESPN using R

I am trying to webscrape some data from ESPN using R and I am having trouble getting passed the login. I don't know if it is just because ESPN prevents webscraping or if I am missing something. Here is my code:
library(rvest)
url = "https://fantasy.espn.com/football/league/draftrecap?seasonId=2015&leagueId=1734728"
pgsession<-html_session(url)
read_html(url) #to make sure I am in the ESPN login page not the league page
After this step I think I go wrong. I don't know how to find the correct form needed for login
fantform <-html_form(pgsession)[[1]]
fantform #to check the form I have
The response I get from checking this form is below. It doesn't seem right but if I change the form number I get an "error : subscript is out of bounds"
<form> '<unnamed>' (GET )
<field> (search) :
The rest of the code I have is below but I am pretty sure I am stuck at this part.
filled_form <- html_form_set(pgform, "usernanerow" = "username", "passwordrow" = "password")
submit_form(pgsession,filled_form)
Fantasy_league <- jump_to(pgsession, "https://fantasy.espn.com/football/league/draftrecap?seasonId=2015&leagueId=1734728")
I am very grateful for all responses/help. Thank you in advance!
They seem to make it really hard to find the login page and therefore difficult to find the login form, like you stated the form you get above isn't right.
If you can find the login page then just replace your first url with whatever the url is for the login page.
Apart from that the rest of the code is correct
If you're happy to use existing packages it seems like there's already one that might help with what you need:
https://cran.r-project.org/web/packages/fflr/vignettes/fantasy-football.html

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

How to use R to download a file from webpage when there is no specific file embedded on the page

Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.
I have this url
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address
it turns to be the following script
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
instead of the url which give me access to the csv file.
Is there any solution to tackle this problem.
You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.
library(RSelenium)
library(here)
Here's the URL you provided:
url <- paste0(
"https://www.fangraphs.com/leaders.aspx",
"?pos=all",
"&stats=bat",
"&lg=all",
"&qual=y",
"&type=8",
"&season=2016",
"&month=0",
"&season1=2016",
"&ind=0"
)
Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."
button_id <- "LeaderBoard1_cmdCSV"
We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):
filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.
An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.
driver <- rsDriver(
browser = "chrome",
chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client
Using the browser client, navigate to the page and click that button.
Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.
buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()
Then wait for the file to show up in your downloads folder, and move it to the current project directory:
while (!file.exists(file.path(download_location, filename))) {
Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))
Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.
browser$close()
server$stop()
And you're on your merry way!
Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:
using = "xpath"
using = "css selector"
using = "name"
using = "tag name"
using = "class name"
using = "link text"
using = "partial link text"
Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.
XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.
Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:
Click that, and then click on the element you want:
That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.
And that gives you your code!
buttons <- browser$findElements(
"#linkAccount > div > div.label-account",
using = "css selector"
)
buttons[[1]]$clickElement()
Boom.

Error when getting website table data using python selenium - Multiple tables and Unable to locate element

I am trying to get info from brazilian stock market (BMF BOVESPA). The website has several tables, but my code is not being able to get them.
The code below aims to get all data from table "Ações em Circulação no Mercado" -> one of the last tables from webpage.
I have tried the ones below, but none worked for me:
content = browser.find_element_by_css_selector('//div[#id="div1"]')
and
table = browser.find_element_by_xpath(('//*[#id="div1"]/div/div/div1/table/tbody'))
Thanks in advance for taking my question.
from selenium import webdriver
from time import sleep
url = "http://bvmf.bmfbovespa.com.br/cias-Listadas/Empresas-
Listadas/ResumoEmpresaPrincipal.aspx?codigoCvm=19348&idioma=pt-br"
browser = webdriver.Chrome()
browser.get(url)
sleep(5) #wait website to reload
content = browser.find_element_by_css_selector('//div[#id="div1"]')
HTML can be found at attached picture
As alternative, the code below reaches the same website
url = "http://bvmf.bmfbovespa.com.br/cias-Listadas/Empresas-Listadas/BuscaEmpresaListada.aspx?idioma=pt-br"
Ticker='ITUB4'
browser = webdriver.Chrome()
browser.get(url)
sleep(2)
browser.find_element_by_xpath(('//*[#id="ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_txtNomeEmpresa_txtNomeEmpresa_text"]')).send_keys(Ticker)
browser.find_element_by_xpath(('//*[#id="ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnBuscar"]')).click();
content = browser.find_element_by_id('div1')
Selenium with Python documentation UnOfficial
Hii there
Selenium provides the following methods to locate elements in a page:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
Why your code doesnt work ? because you're not using correct correct code to locate element
you're using xpath inside css selector
content = browser.find_element_by_css_selector('//div[#id="div1"]') #this part is wrong
instead you can do this if you want to select div1
content = browser.find_element_by_id('div1')
here's the correct code
url = "http://bvmf.bmfbovespa.com.br/cias-Listadas/Empresas-
Listadas/BuscaEmpresaListada.aspx?idioma=pt-br"
Ticker='ITUB4'
browser = webdriver.Chrome()
browser.get(url)
sleep(2)
browser.find_element_by_xpath(('//*[#id="ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_txtNomeEmpresa_txtNomeEmpresa_text"]')).send_keys(Ticker)
browser.find_element_by_xpath(('//*[#id="ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnBuscar"]')).click()
I tested it and it worked :)
Mark it as best answer if i helped you :)

Saving R timevis timeline in html webpage

I'm sure I'm missing a basic issue but I'm not currently able to find my way out of this problem.
Is there a way to save a simple (not Shiny) Timevis timeline in html webpage from the code?
I've successfully tried by using RStudio export button but I would like to include the function in the code.
htmlwidgets::saveWidget() doesn't work properly as the webpage is incomplete e.g. zoom buttons are missing (see incomplete webpage print screen) even with a minimal code:
myTimeline<-timevis(
data.frame(id = 1:2,
content = c("one", "two"),
start = c("2016-01-10", "2016-01-12"))
)
htmlwidgets::saveWidget(myTimeLine,"myTimeLine.html")
Thank in advance for any help and advice!
There is an open issue on github about this.
The workaround is to use selfcontained = FALSE:
htmlwidgets::saveWidget(myTimeline, "myTimeLine.html", selfcontained = F)
If you want to use a selfcontained version (e.g. because you want to offer this htmlwidget via plumber), the issue is the lack of zoom buttons.
If you modify the output HTML content to re-include the zoom buttons properly, everything works fine.