How to extract data from html into R

How to extract data from html into R - html

I have a link that contents a table. First thing I tried was finding if there is any button to click and unfortunately there isn't. Then I tried to use a package called XML in R to fetch the data between different nodes to build up a data frame by myself.
In order to do this I need to know which node (or HTML tag) I would like to extracting. So I right click on the web browser and find the tag that contains the table I want.
From <fieldset id="result" starts the content of the table. We can also see from the browser the first row of the table is <li class="vesselResultEntry removeBackground">.
Then when I was trying to use R to download this HTML, I found the whole <li> tags that relating to the table are gone and replaced by <li class="toRemove"/>. Here is my R code below by the way:
library(XML)
url <- "http://www.fao.org/figis/vrmf/finder/search/#stats"
webpage <- readLines(url)
htmlpage <- htmlParse(webpage, asText = TRUE)
data <- xpathSApply(htmlpage, "//ul[#id='searchResultsContainer']")
data
# <ul id="searchResultsContainer" class="clean resultsContainer"><li class="toRemove"></li></ul>
What I'm trying to do in the code is simply to see if I can fetch the content in a specific tag. Clearly the row I want to fetch is not in the object (webpage)I saved.
So my questions are:
Is there a way to download the table I want by any means (Ideally in R)?
Is there some kind of protection in this website that prevents me from downloading the whole HTML as a text file and fetch data?
Much appreciate for any suggestions

The page you're trying to fetch is assembled dynamically on the browser side on load. The content you get by directly fetching the url does not contain the data you see when you view the DOM. That data is loaded later from a separate URL.
I took a look and the URL in question is:
http://www.fao.org/figis/vrmf/finder/services/public/vessels/search?c=true&gd=true&nof=false&not=false&nol=false&ps=30&o=0&user=NOT_SET
I'm not sure what most of the query string is, but it's clear that ps is "page size" and o is "offset". Page size seems to cap at 200 above which it is forced to 30. The URL returns JSON so you'll need some way to parse that. The data embedded in the responses says there are 231047 entries so you'll have to make multiple requests to get it all.
Data providers usually do not appreciate people scraping their data like that. You might want to look around for a downloadable version.

Related

Cannot collect all nodes of Google search result with goquery: some nodes are missing

I am trying to collect results of a google search page in GoLang using the goquery library. In order to achieve this, I am collecting all nodes of a goquery selection with goquery. The problem is that the selection returned by Find("*") does not seem to contain all the nodes of the HTML document. Question: does the method collect ALL nodes with the whole tree structure or not ? If not, is there a method to collect them all ?
I tried using the goquery Find("*") method applied to the whole document selection. So nodes with certain attributes are not returned, although they are in the HTML document. For instance, nodes with are not recognized
alltags := doc.Find("*") //doc is the HTML doc with the Google search
The selection does not contain the div tags with class="srg". The same applies to other class values such as "bkWMgd", "rc" for example.

This has happened to me before. I was trying to web scrape with python beautiful soup package and the same thing was happening.
Later it turned out that the html markup returned when trying to fetch it was actually the markup the server returned after finding a bot. I solved this by setting the User-Agent to Mozilla/5.0.
Hope this helps in your quest to solve this.
You can start by updating the code for the fetch request you have performed.

Send and receive data to and from a website using the TWebbrowser component in Delphi

I'm creating a VCL Application with Delpi 10.3 and want to support some web functionality by having the user enter the ISBN of a book into a TEdit component and from there passing/sending this value to a search field on this website: https://isbnsearch.org after which the website looks up the ISBN and displays the Author of the book. I want to somehow access the information (i.e Author) presented by the search result and again use it in my application.
This is my GUI, for a better idea of what I want to accomplish:
What code can I use for this? Any other feasible suggestions or approaches are acceptable.

When performing a search on that website, it simply loads a page with a specific URL query string...
https://isbnsearch.org/search?s=suess
The above example is when I search for "suess", so you can easily concatenate a search URL.
You can use any HTTP component, such as TIdHTTP, to load this search page, then use an HTML parser to scrape the page and read what you need. Much, much easier than trying to read through the TWebBrowser.
In the end, you won't actually display the HTML (I mean you can if you want to), but the idea is to read the data and display it in your own format.
On that specific page, start by locating the ul element with id searchresults. Then, each li element contains individual results. Unfortunately, this website uses pagination, and only shows 10 results per page. To do this, call this page again with another parameter &p=2 for the 2nd page, &p=3 for the 3rd page, and so on.
On the other hand, that is the worst way to acquire such information. What you should be doing is using a proper API which gives you machine-friendly data. The service you are referencing doesn't appear to have an option, but here's an example of one which does:
https://openlibrary.org/dev/docs/api/books - this also appears to provide you MUCH more information than the one you're using.

not scraping the html source, but the actual website

I am working on a project where I want to scrape a page like this, in order to get the city of origin. I tried to use the css selector: ".type-12~ .type-12+ .type-12" However I do not get the text into R.
Link:
https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description
I use rvest and and the read_html function.
However, it seems that the source has some scripts in it. Is there a way to scrape the website after the scripts have returned their results (as you see it with a browser)?
PS I looked at similar questions but did find the answer..
Code:
main.names <- read_html(x = paste0("https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description")) # feed `main.page` to the next step
names1 <- main.names %>% # feed `main.page` to the next step
html_nodes("div.mb0-md") %>% # get the CSS nodes
html_text()# extract the text

You should not do it. They provide a API which you can find here: https://status.kickstarter.com/api
Using APIs or Ajax/JSON calls is usually better since
The server isn't overused because your scraper visits every link it can find causing unnecessary traffic. That is bad for the speed of your program and bad for the servers of the site you are scraping.
You don't have to worry about that they changed a class name or id and your code won't work anymore
Especially the second part should interest you since it can take hours finding which class isn't returning a value anymore.
But to answer your question:
When you use the right scraper you can find all what you want. What tools are you using? There are possibilities to get data before the site is loaded or after. You can execute the JS on the site separately and find hidden content or find things like display:none Css classes...
It really depends on what you are using and how you use it.

What additional data to send in URL request?

Short version: How do I know how to phrase additional data (like specific options on the page that display different html files but belong to the same URL) when getting an URL with urllib?
Long version:
I am having trouble to figure out how to handle properties of an url request that are not determined by the Link URL but by probably other information that your browser is usually sending.
To be more precise:
This page contains a table that i want to read with python, but the length of the table depends on the number of items per page you choose in the bottom left (i.e. the number of items in the code I get from urllib.request.urlopen is the standard of 50 or something, not the complete table). Clicking the buttons for e.g. 400 items per page doesn't change the URL so I expect that there is some information sent somewhere else. I understand that using urllib can send additional data besides the url, but it is unclear to me how to figure out how I should phrase the "give me the whole table" (or "give me 400 items per page") in that data.
Studying the .html file I get from saving the webpage in my browser didn't give me any hints and I miss the vocabulary to search for answers on the web (that is, googling "urllib request parameter" is too vague).
Hence I'd be completely satisfied if someone would point me to a duplicate of this question.
Thanks in advance :)

For everyone else finding this question I'll elaborate on the answer #deceze gave in the comments:
Open the webpage you want to read in your browser
Open your Browsers network panel (in chromium this is [Strg+Shift+I] or right-click > Inspect
Go to the "Network" Tab (at least in chromium)
Do whatever you want your program to do and the empty network panel list will fill with a lot of data
Find your request in the list of events (one of the very first ones is right, I would guess), click it and select "Headers"

Can Go capture a click event in an HTML document it is serving?

I am writing a program for managing an inventory. It serves up html based on records from a postresql database, or writes to the database using html forms.
Different functions (adding records, searching, etc.) are accessible using <a></a> tags or form submits, which in turn call functions using http.HandleFunc(), functions then generate queries, parse results and render these to html templates.
The search function renders query results to an html table. To keep the search results page ideally usable and uncluttered I intent to provide only the most relevant information there. However, since there are many more details stored in the database, I need a way to access that information too. In order to do that I wanted to have each table row clickable, displaying the details of the selected record in a status area at the bottom or side of the page for instance.
I could try to follow the pattern that works for running the other functions, that is use <a></a> tags and http.HandleFunc() to render new content but this isn't exactly what I want for a couple of reasons.
First: There should be no need to navigate away from the search result page to view the additional details; there are not so many details that a single record's full data should not be able to be rendered on the same page as the search results.
Second: I want the whole row clickable, not merely the text within a table cell, which is what the <a></a> tags get me.
Using the id returned from the database in an attribute, as in <div id="search-result-row-id-{{.ID}}"></div> I am able to work with individual records but I have yet to find a way to then capture a click in Go.
Before I run off and write this in javascript, does anyone know of a way to do this strictly in Go? I am not particularly adverse to using the tried-and-true js methods but I am curious to see if it could be done without it.

does anyone know of a way to do this strictly in Go?
As others have indicated in the comments, no, Go cannot capture the event in the browser.
For that you will need to use some JavaScript to send to the server (where Go runs) the web request for more information.
You could also push all the required information to the browser when you first serve the page and hide/show it based on CSS/JavaScript event but again, that's just regular web development and nothing to do with Go.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008