How download files from a html form R - html

when I press ctrl+s and save this page on my web browser
http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd:C01514+cpd:C05903+cpd:C01265+cpd:C01714
I download the html form and a folder with some png files. I'm interested in png files that have a known pattern.
Is there a way to download them in the same way from R?
I'm trying:
download.file("http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd:C01514+cpd:C05903+cpd:C01265+cpd:C01714","form.html", mode = "wb")
but I download only the html form, not the associated pngs.
Thanks

This will get you part of the way there:
source("http://bioconductor.org/biocLite.R")
biocLite("KEGGREST")
library(png)
library(KEGGREST)
png <- keggGet(c("zma00944","default=red","cpd:C01514","cpd:C05903","cpd:C01265","cpd:C01714"), "image")
t <- tempfile()
writePNG(png, t)
browseURL(t)
Unfortunately it does not do the red highlighting which you probably want. I'm not sure if that can be done through the REST API.
So probably instead you could just download the URL as you have, and then parse it for the PNG and then download that:
download.file("http://www.kegg.jp/kegg-bin/show_pathway?zma00944+default%3dred+cpd%3aC01514+cpd%3aC05903+cpd%3aC01265+cpd%3aC01714", "form.html")
lines <- readLines("form.html")
imgUrl <- lines[grep('img src="/', lines)]
url <- paste0("http://www.kegg.jp/", strsplit(imgUrl, '"')[[1]][2])
download.file(url, "file.png")
browseURL("file.png")

Related

Beautifulsoup scraping "lazy faded" images

I am looking for a way to parse the images on a web page. Many posts already exist on the subject, and I was inspired by many of them, in particular :
How Can I Download An Image From A Website In Python
The script presented in this post works very well, but I have encountered a type of image that I don't manage to automate the saving. On the website, inspection of the web page gives me:
<img class="lazy faded" data-src="Uploads/Media/20220315/1582689.jpg" src="Uploads/Media/20220315/1582689.jpg">
And when I parse the page with Beautifulsoup4, I get this (fonts.gstatic.com Source section content) :
<a class="ohidden" data-size="838x1047" href="Uploads/Media/20220315/1582689.jpg" itemprop="contentUrl">
<img class="lazy" data-src="Uploads/Media/20220315/1582689.jpg" />
</a>
The given URL is not a bulk web URL which can be used to download the image from anywhere, but a link to the "Sources" section of the web page (CTRL + MAJ + I on the webpage), where the image is.
When I put my mouse on the src link of the source code of the website, I can get the true bulk url under "Current source". This information is located in the Elements/Properties of the DevTools (CTRL + MAJ + I on the webpage), but I don't know how to automate the saving of the images, either by directly using the link to access the web page sources, or to access the bulk address to download the images. Do you have some idea ?
PS : I found this article about lazy fading images, but my HTLM knowledge isn't enough to find a solution for my problem (https://davidwalsh.name/lazyload-image-fade)
I'm not too familiar with web scraping or the benefits. However, I found this article here that you can reference and I hope it helps!
Reference
However, here is the code and everything you need in one place.
First you have to find the webpage you want to download the images from, which is your decision.
Now we have to get the urls of the images, create an empty list, open it, select them, loop through them, and then append them.
url = ""
link_list[]
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response, "html.parser")
image_list = soup.select('div.boxmeta.clearfix > h2 > a')
for image_link in image_list:
link_url = image_link.attrs['href']
link_list.append(link_url)
This theoretically should look for any href tag linking an image to the website and then append them to that list.
Now we have to get the tags of the image file.
for page_url in link_list:
page_html = urllib.request.urlopen(page_url)
page_soup = BeautifulSoup(page_html, "html.parser")
img_list = page_soup.select('div.seperator > a > img')
This should find all of the div tags that seperate from the primary main div class, look for an a tag and then the img tag.
for img in img_list:
img_url = (img.attrs['src'])
file_name = re.search(".*/(.*png|.*jpg)$", img_url)
save_path = output_folder.joinpath(filename.group(1))
Now we are going to try to download that data using the try except method.
try:
image = requests.get(img_url)
open(save_path, 'wb').write(image.content)
print(save_path)
except ValueError:
print("ValueError!")
I think you are talking about the relative path and absolute path.
Things like Uploads/Media/20220315/1582689.jpg is a relative path.
The main difference between absolute and relative paths is that absolute URLs always include the domain name of the site with http://www. Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. --- ref.
So in your case try this:
import requests
from bs4 import BeautifulSoup
from PIL import Image
URL = 'YOUR_URL_HERE'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
for img in soup.find_all("img"):
# Get the image absolute path url
absolute_path = requests.compat.urljoin(URL, img.get('data-src'))
# Download the image
image = Image.open(requests.get(absolute_path, stream=True).raw)
image.save(absolute_path.split('/')[-1].split('?')[0])

How to export htmlTable from the viewer to a word document in R

I'm working with the htmlTable function. While the table renders perfectly in the R studio viewer, I am unable to save, copy and paste, or screenshot it, such that it looks equally nice in the word document I am writing. I was wondering if there was a way for me to export or save the image, such that the table show up just as nice in the word document.
Here is an example table.
output <-
matrix(paste("Example", LETTERS[1:16]),
ncol=4, byrow = TRUE)
library(htmlTable)
htmlTable(output)
You can display your table as grid graphics using grid.table function from gridExtra library. And save it as image using ggsave function from ggplot library.
library(ggplot2)
library(gridExtra)
ggsave(grid.table(output), filename = "~DirectoryName/imageName.png")

How to use R to download a file from webpage when there is no specific file embedded on the page

Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.
I have this url
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address
it turns to be the following script
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
instead of the url which give me access to the csv file.
Is there any solution to tackle this problem.
You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.
library(RSelenium)
library(here)
Here's the URL you provided:
url <- paste0(
"https://www.fangraphs.com/leaders.aspx",
"?pos=all",
"&stats=bat",
"&lg=all",
"&qual=y",
"&type=8",
"&season=2016",
"&month=0",
"&season1=2016",
"&ind=0"
)
Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."
button_id <- "LeaderBoard1_cmdCSV"
We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):
filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.
An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.
driver <- rsDriver(
browser = "chrome",
chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client
Using the browser client, navigate to the page and click that button.
Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.
buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()
Then wait for the file to show up in your downloads folder, and move it to the current project directory:
while (!file.exists(file.path(download_location, filename))) {
Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))
Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.
browser$close()
server$stop()
And you're on your merry way!
Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:
using = "xpath"
using = "css selector"
using = "name"
using = "tag name"
using = "class name"
using = "link text"
using = "partial link text"
Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.
XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.
Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:
Click that, and then click on the element you want:
That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.
And that gives you your code!
buttons <- browser$findElements(
"#linkAccount > div > div.label-account",
using = "css selector"
)
buttons[[1]]$clickElement()
Boom.

Find absolute html path given relative href using R

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"
The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.

Embed image in HTML r markdown document that can be shared

I have an R markdown document which is created using a shiny app, saved as a HTML. I have inserted a logo in the top right hand corner of the output, which has been done using the following code:
<script>
$(document).ready(function() {
$head = $('#header');
$head.prepend('<img src=\"FILEPATH/logo.png\" style=\"float: right;padding-right:10px;height:125px;width:250px\"/>')
});
</script>
However, when I save the HTML output and share the output, of course the user cannot see the logo since the code is trying to find a file path which will not exist on their computer.
So, my question is - is there a way to include the logo in the output without the use of file paths? Ideally I don't want to upload the image to the web, and change the source to a web address.
You can encode an image file to a data URI with knitr::image_uri. If you want to add it in your document, you can add the html code produced by the following command in your header instead of your script:
htmltools::img(src = knitr::image_uri("FILEPATH/logo.png"),
alt = 'logo',
style = 'float: right;padding-right:10px;height:125px;width:250px')