In another question a user used this url(1) which contains a data table and somehow converted the code into this url(2) in order to scrape using json and beautiful soup.
My question is how do you get the second url which is scrape friendly given the first url?
The user which somehow got the 2nd url was asked how he got it and it has been a while and he never responded..Here is a link to the original thread.
You can do that by using Google Chrome Developer Tools (and with other browsers as well).
Open Google Chrome browser and navigate to the first URL
Open Developer Tools (⌘ + option + i)
Head to "Network" tab
Click on "Preserve log" and on "XHR" (since it's an XMLHttpRequest)
Reload the page and you'll see the XMLHttpRequest to the second URL
Note: In this case I guessed that it was loaded by an XHR but I'd recommend to click "All" instead of "XHR" next time. You'll see more results and you'll need to filter and/or take some more time to find the call/request on the matter, but it'll be more accurate.
Related
I think the xpath is not standard but I do not know how to fetch data from this dam site
IMPORTXML("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=36773155987365094#","/html/body/div[4]/form/div[3]/div[9]/span/div2/div2
")
As on this page you need to click on a tab to open what you want, it would be impossible to go through it.
So I had to open Google's Dev Tools, under Network and selected Fetch/XHR:
After that, I copied the text شركت سرمايه گذاري البرز-سهامي عام-, press CONTROL + F to search all url's and searched the text, it was found in a single instance:
http://www.tsetmc.com/Loader.aspx?Partree=15131T&c=IRO3ETLZ0007
Equipped with this url, when opening it, it is fixedly only that tab:
Well, then the hardest part is gone, now just create the desired path:
=IMPORTXML(
"http://www.tsetmc.com/Loader.aspx?Partree=15131T&c=IRO3ETLZ0007",
"//tr[#class='sh']"
)
Which resulted in what you want:
Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.
I have this url
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0
there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address
it turns to be the following script
javascript:__doPostBack('LeaderBoard1$cmdCSV','')
instead of the url which give me access to the csv file.
Is there any solution to tackle this problem.
You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.
library(RSelenium)
library(here)
Here's the URL you provided:
url <- paste0(
"https://www.fangraphs.com/leaders.aspx",
"?pos=all",
"&stats=bat",
"&lg=all",
"&qual=y",
"&type=8",
"&season=2016",
"&month=0",
"&season1=2016",
"&ind=0"
)
Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."
button_id <- "LeaderBoard1_cmdCSV"
We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):
filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.
An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.
driver <- rsDriver(
browser = "chrome",
chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client
Using the browser client, navigate to the page and click that button.
Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.
buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()
Then wait for the file to show up in your downloads folder, and move it to the current project directory:
while (!file.exists(file.path(download_location, filename))) {
Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))
Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.
browser$close()
server$stop()
And you're on your merry way!
Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:
using = "xpath"
using = "css selector"
using = "name"
using = "tag name"
using = "class name"
using = "link text"
using = "partial link text"
Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.
XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.
Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:
Click that, and then click on the element you want:
That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.
And that gives you your code!
buttons <- browser$findElements(
"#linkAccount > div > div.label-account",
using = "css selector"
)
buttons[[1]]$clickElement()
Boom.
When you type in an invalid address, Chrome displays a grey page that says "Oops! Google Chrome could not find X. Did you mean Y?"
Because this is not an HTTP page but rather one of the browser's built-in things, I can't put a content script in it and can't control it, so my extension is frozen until the user manually goes to another page.
Since the extension is supposed to be able to control the browser on its own, it's very important that anytime this page opens, it automatically goes back to a page I do have content script access to, and then displays a message instead.
Is this impossible?
You can use the chrome.webNavigation.onErrorOccurred to detect such errors, and redirect to a different page if you want. Unless you've got an extremely good reason to do so, I strongly recommend against implementing such a feature, because it might break the user's expectations of how the browser behaves.
Nevertheless, sample code:
chrome.webNavigation.onErrorOccurred(function(details) {
if (details.frameId === 0) {
// Main frame
chrome.tabs.update(details.tabId, {
url: chrome.runtime.getURL('error.html?error=' + encodeURIComponent(details.error))
});
}
});
According to the docs the only pages an extension can override are:
The bookmarks manager
The history
The new-tab
So, an extension can't change/contol/affect the behaviour of the browser regarding the "Oops!..." page.
Lately I've been reading a lot of PDF books using Google Chrome. To go to a particular page, you can simply append #page=23 to the url (as in file:///C:/my_book.pdf#page=23). This is a nice and easy way to bookmark your current page number to continue reading the book later.
My question:
What's a way to find out what page you're currently in within the book?
OR
What's a Chrome plugin that bookmarks PDF files within your file system?
I've tried a few extensions, but they don't work unless the book is in a server (as in http:// localhost/my_book.pdf), which is not desired in my case.
Thanks!
As of Chrome version 31+ the page number is displayed by the scroll bar if you scroll at all. I'm not sure when (what version) this feature was added.
There's a chrome extension called "PDF Bookmark" it is free and works in my case.
Here's the link for your reference.
Does it automatically update the hash tag to the page number?
If so, you could use document.location.hash as follows:
currentPage = document.location.hash.split("="); currentPage = currentPage[1];
not very user friendly but you can append document.documentElement.scrollTop property value to the url
on the console
> document.documentElement.scrollTop
<- 4000
bookmark as "file://path/to/pdf.pdf#4000"
and then when you reopen it use that value to set the same property
document.documentElement.scrollTop = 4000
a simple user script should be able to do this...
Is it possible to have a print option that bypasses the print dialog?
I am working on a closed system and would like to be able to pre-define the print dialog settings; and process the print as soon as I click the button.
From what I am reading, the way to do this varies for each browser. For example, IE would use ActiveX. Chrome / Firefox would require extensions. Based on this, it appears I'll have to write an application in C++ that can handle parameters passed by the browser to auto print with proper formatting (for labels). Then i'll have to rewrite it as an extension for Chrome / Firefox. End result being that users on our closed system will have to download / install these features depending on which browser they use.
I'm hoping there is another way to go about this, but this task most likely violates browser security issues.
I ended up implementing a custom application that works very similar to the Nexus Mod Manager. I wrote a C# application that registers a custom Application URI Scheme. Here's how it works:
User clicks "Print" on the website.
Website links user to "CustomURL://Print/{ID}
Application is launched by windows via the custom uri scheme.
Application communicates with the pre-configured server to confirm the print request and in my case get the actual print command.
The application then uses the C# RawPrinterHelper class to send commands directly to the printer.
This approach required an initial download from the user, and a single security prompt from windows when launching the application the first time. I also implemented some Javascript magic to make it detect whether the print job was handled or not. If it wasn't it asks them to download the application.
I know this is a late reply, but here's a solution I'm using. I have only used this with IE, and have not tested it with any other browser.
This Sub Print blow effectively replaces the default print function.
<script language='VBScript'>
Sub Print()
OLECMDID_PRINT = 6
OLECMDEXECOPT_DONTPROMPTUSER = 2
OLECMDEXECOPT_PROMPTUSER = 1
call WB.ExecWB(OLECMDID_PRINT, OLECMDEXECOPT_DONTPROMPTUSER,1)
End Sub
document.write "<object ID='WB' WIDTH=0 HEIGHT=0 CLASSID='CLSID:8856F961-340A-11D0-A96B-00C04FD705A2'></object>"
</script>
Then use Javascript's window.print(); ties to a hyperlink or a button to execute the print command.
If you want to automatically print when the page loads, then put the code below near tag.
<script type="text/javascript">
window.onload=function(){self.print();}
</script>
I am writing this answer for firefox browser.
Open File > Page Setup
Make all the headers and footers blank
Set the margins to 0 (zero)
In the address bar of Firefox, type about:config
Search for print.always_print_silent and double click it
Change it from false to true
This lets you skip the Print pop up box that comes up, as well as skipping the step where you have to click OK, automatically printing the right sized slip.
If print.always_print_silent does not come up
Right click on a blank area of the preference window
Select new > Boolean
Enter "print.always_print_silent" as the name (without quotes)
Click OK
Select true for the value
You may also want to check what is listed for print.print_printer
You may have to choose Generic/Text Only (or whatever your receipt printer might be named)
The general answer is: NO you cannot do this in the general case but there some cases where you might do it.
Check
http://justtalkaboutweb.com/2008/05/09/javascript-print-bypass-printer-dialog-in-ie-and-firefox/
If you where allowed to do such a thing anyway, it would be a security issue since a malware script could silently sent printing jobs to visitor's printer.
I found a awesome plugin by Firefox which solve this issue. try seamless printing plugin of firefox which will print something from a web application without showing a print dialog.
Open Firefox
Search addon name seamless printing and install it
After successful installation the printing window will get bypassed when user wants to print anything.
I was able to solve the problem with this library: html2pdf.js (https://github.com/eKoopmans/html2pdf.js)
Considering that you have access to it, you could do something like that (taken from the github repository):
var element = document.getElementById('element-to-print');
html2pdf(element);