How to use R to download a file from webpage when there is no specific file embedded on the page

How to use R to download a file from webpage when there is no specific file embedded on the page - html

Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.
I have this url
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address
it turns to be the following script
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
instead of the url which give me access to the csv file.
Is there any solution to tackle this problem.

You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.
library(RSelenium)
library(here)
Here's the URL you provided:
url <- paste0(
"https://www.fangraphs.com/leaders.aspx",
"?pos=all",
"&stats=bat",
"&lg=all",
"&qual=y",
"&type=8",
"&season=2016",
"&month=0",
"&season1=2016",
"&ind=0"
)
Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."
button_id <- "LeaderBoard1_cmdCSV"
We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):
filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.
An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.
driver <- rsDriver(
browser = "chrome",
chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client
Using the browser client, navigate to the page and click that button.
Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.
buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()
Then wait for the file to show up in your downloads folder, and move it to the current project directory:
while (!file.exists(file.path(download_location, filename))) {
Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))
Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.
browser$close()
server$stop()
And you're on your merry way!
Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:
using = "xpath"
using = "css selector"
using = "name"
using = "tag name"
using = "class name"
using = "link text"
using = "partial link text"
Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.
XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.
Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:
Click that, and then click on the element you want:
That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.
And that gives you your code!
buttons <- browser$findElements(
"#linkAccount > div > div.label-account",
using = "css selector"
)
buttons[[1]]$clickElement()
Boom.

Related

How to import this link by importxml in google sheets?

I think the xpath is not standard but I do not know how to fetch data from this dam site
IMPORTXML("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=36773155987365094#","/html/body/div[4]/form/div[3]/div[9]/span/div2/div2
")

As on this page you need to click on a tab to open what you want, it would be impossible to go through it.
So I had to open Google's Dev Tools, under Network and selected Fetch/XHR:
After that, I copied the text شركت سرمايه گذاري البرز-سهامي عام-, press CONTROL + F to search all url's and searched the text, it was found in a single instance:
http://www.tsetmc.com/Loader.aspx?Partree=15131T&c=IRO3ETLZ0007
Equipped with this url, when opening it, it is fixedly only that tab:
Well, then the hardest part is gone, now just create the desired path:
=IMPORTXML(
"http://www.tsetmc.com/Loader.aspx?Partree=15131T&c=IRO3ETLZ0007",
"//tr[#class='sh']"
)
Which resulted in what you want:

Finding out how to get this specific URL

In another question a user used this url(1) which contains a data table and somehow converted the code into this url(2) in order to scrape using json and beautiful soup.
My question is how do you get the second url which is scrape friendly given the first url?
The user which somehow got the 2nd url was asked how he got it and it has been a while and he never responded..Here is a link to the original thread.

You can do that by using Google Chrome Developer Tools (and with other browsers as well).
Open Google Chrome browser and navigate to the first URL
Open Developer Tools (⌘ + option + i)
Head to "Network" tab
Click on "Preserve log" and on "XHR" (since it's an XMLHttpRequest)
Reload the page and you'll see the XMLHttpRequest to the second URL
Note: In this case I guessed that it was loaded by an XHR but I'd recommend to click "All" instead of "XHR" next time. You'll see more results and you'll need to filter and/or take some more time to find the call/request on the matter, but it'll be more accurate.

Is it possible to use Watir-Webdriver to interact with Polymer?

I just updated my Chrome browser (Version 50.0.2661.75) and have found that the chrome://downloads page has changed such that my automated tests can no longer interact with it. Previously, I had been using Watir-Webdriver to clear the downloads page, delete files from my machine, etc, without too much difficulty.
It looks like Google is using Polymer on this page, and
there are new (to me) elements like paper-button that Watir-Webdriver doesn't recognize
even browser.img(:id, 'file-icon').present? returns false when I
can clearly see that the image is on the page.
Is automating a page made with Polymer (specifically the chrome://downloads page) a lost cause until changes are made to Watir-Webdriver, or is there a solution to this problem?

Given that the download items are accessible in Javascript and that Watir allows Javascript execution (as #titusfortner pointed out), it's possible to automate the new Downloads page with Watir.
Note the shadow root elements (aka "local DOM" in Polymer) can be queried with $$.
Here's an example Javascript that logs the icon presence and filename of each download item and removes the items from the list. Copy and paste the snippet into Chrome's console to test (verified in Chrome 49.0.2623.112 on OS X El Capitan).
(function() {
var items = document
.querySelector('downloads-manager')
.$$('iron-list')
.querySelectorAll('downloads-item');
Array.from(items).forEach(item => {
let hasIcon = typeof item.$$('#file-icon') !== 'undefined';
console.log('hasIcon', hasIcon);
let filename = item.$$('#file-link').textContent;
console.log('filename', filename);
item.$.remove.click();
});
})();
UPDATE: I verified the Javascript with Watir-Webdriver in OS X (with ChromeDriver 2.21). It works the same as in the console for me (i.e., I see the console logs, and the download items are removed). Here are the steps to reproduce:
Run the following commands in a new irb shell (copy+paste):
require 'watir-webdriver'
b = Watir::Browser.new :chrome
In the newly opened Chrome window, download several files to create some download items, and then open the Downloads tab.
Run the following commands in the irb shell (copy+paste):
script = "(function() {
var items = document
.querySelector('downloads-manager')
.$$('iron-list')
.querySelectorAll('downloads-item');
Array.from(items).forEach(item => {
let hasIcon = typeof item.$$('#file-icon') !== 'undefined';
console.log('hasIcon', hasIcon);
let filename = item.$$('#file-link').textContent;
console.log('filename', filename);
item.$.remove.click();
});
})();"
b.execute_script(script)
Observe the Downloads tab no longer contains download items.
Open the Chrome console from the Downloads tab.
Observe the console shows several lines of hasIcon true and the filenames of the downloaded items.

Looks like Google put the elements inside the Shadow-Dom, which isn't supported by Selenium/Watir/WebDriver spec (yet). There might a way to obtain the element via javascript (browser.execute_script(<...>)), but it is experimental at best still.

Attempting to automated a Polymer page, I found I was able to access the web elements by asking Polymer to use the shady dom by adding ?dom=shady in the URL. Like in the example on this page https://www.polymer-project.org/1.0/docs/devguide/settings:
http://example.com/test-app/index.html?dom=shady
Adding the dom parameter to request Polymer to use the shady dom may be worth a try.

Bookmarking PDF in Google Chrome

Lately I've been reading a lot of PDF books using Google Chrome. To go to a particular page, you can simply append #page=23 to the url (as in file:///C:/my_book.pdf#page=23). This is a nice and easy way to bookmark your current page number to continue reading the book later.
My question:
What's a way to find out what page you're currently in within the book?
OR
What's a Chrome plugin that bookmarks PDF files within your file system?
I've tried a few extensions, but they don't work unless the book is in a server (as in http:// localhost/my_book.pdf), which is not desired in my case.
Thanks!

As of Chrome version 31+ the page number is displayed by the scroll bar if you scroll at all. I'm not sure when (what version) this feature was added.

There's a chrome extension called "PDF Bookmark" it is free and works in my case.
Here's the link for your reference.

Does it automatically update the hash tag to the page number?
If so, you could use document.location.hash as follows:
currentPage = document.location.hash.split("="); currentPage = currentPage[1];

not very user friendly but you can append document.documentElement.scrollTop property value to the url
on the console
> document.documentElement.scrollTop
<- 4000
bookmark as "file://path/to/pdf.pdf#4000"
and then when you reopen it use that value to set the same property
document.documentElement.scrollTop = 4000
a simple user script should be able to do this...

How to get links in Excel to open in single browser tab

I have an Excel spreadsheet with html links in one column. The links are being generated by a perl script via Win32::OLE like so (inside a loop with index $i):
my $range = $Sheet->Range("B".$row);
my $link = "http://foobar.com/show.pl?id=$i";
$Sheet->Hyperlinks->Add({Anchor=>$range,Address=>$link,TextToDisplay=>"Link to $i"});
Currently, every time I click one of these links, it opens in a new browser tab. Since there are a lot of these links I wind up with 20 tabs after working with the sheet for a while. This is a pain in the behind because I periodically have to go through and close them.
Is there some way to get these links to open in the same browser tab? I don't know if it's possible to specify the HTML equivalent of an anchor target with a constant name using the Hyperlinks->Add method, or if this would even do the job.

Depends on which browser you are using:
For Firefox, see this link Force Firefox To Open Links In Same Tab
Requires setting option in about:config browser.link.open_newwindow = 1
For IE, Tools/options/General/Tabs/Setings
Open links from other programs in: The current Tab or Window

Try using Spreadsheet::WriteExcel
#!/usr/bin/perl -w
use strict;
use Spreadsheet::WriteExcel;
# Create a new workbook called simple.xls and add a worksheet
my $workbook = Spreadsheet::WriteExcel->new('Example.xls');
my $worksheet = $workbook->add_worksheet();
# The general syntax is write($row, $column, $token). Note that row and
# column are zero indexed
# Write a hyperlink
$worksheet->write(10, 0, 'http://perldoc.perl.org/');
$worksheet->write(11, 0, 'http://stackoverflow.com/');
__END__
the hyperlinks are opened in same browser tabs(works fine in IE,Firefox,Chrome)

So I have found the docs for the Excel object model, and there are no properties of a hyperlink that would indicate this is possible. Thanks for the come-backs tho.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to use R to download a file from webpage when there is no specific file embedded on the page - html

Related

How to import this link by importxml in google sheets?

Finding out how to get this specific URL

Is it possible to use Watir-Webdriver to interact with Polymer?

Bookmarking PDF in Google Chrome

How to get links in Excel to open in single browser tab

Categories

Resources