Can't XML or HTML Parse This Website in R - html

I need to scrape this page to get the value of the comment, as well as the Document and Submitter Information on the right side..
https://www.regulations.gov/document?D=FDA-2014-N-1207-7673
I've tried using read_html() and read_xml() from the xml2 package with no luck. I've tried getURLContent() followed by xmlParse() and htmlParse() from RCurl.
I even tried simply readLines(), which does not actually get me the content of the website.
I suppose I don't have a great understanding how this all works. Previous websites I have always been able to scrape with simply html_parse(), html_nodes() and html_attr(). How can I accomplish scraping this website?

Related

returning character(0) when scraping with rvest

I'm trying to do some web scraping with rvest. I'm new to R, so I have a bit of a knowledge barrier. I want to scrape the following URL:
https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85
That directs to a website in Arabic, but I don't think you need to be able to read Arabic to advise me. Basically, this is the first results page for a specific search term on this website (which is not a search engine). What I want to do is use rvest to scrape this page to return a list of the titles of the hyperlinks returned by the search. Using selectorgadget, I identified that the node containing those titles is called ".h2Newstitle". However, when I try to scrape that node using the code below, all I get in return is "character(0)":
library(tidyverse)
library(rvest)
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes(".h2NewsTitle") %>%
html_text()
I don't think the issue here has to do with the Arabic text itself. I'm pretty sure everything is in UTF-8, and I can scrape other nodes on the same page and return Arabic text without issue. For example, the code below returns the Arabic text "بحث أسبوعي", which corresponds to the Arabic text in that node on the page itself:
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes("WeeklySearch") %>%
html_text()
So I'm unsure why it is when I try to scrape the ".h2NewsTitle" node, I just get character(0) in return. I wonder if it has to do with some elements being rendered with JavaScript or something. This is a bit outside my expertise, so any advice on how to proceed would be appreciated. I'd like to continue using R, but am open to switching to Python/Beautiful Soup or something if it's better suited for this.

not scraping the html source, but the actual website

I am working on a project where I want to scrape a page like this, in order to get the city of origin. I tried to use the css selector: ".type-12~ .type-12+ .type-12" However I do not get the text into R.
Link:
https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description
I use rvest and and the read_html function.
However, it seems that the source has some scripts in it. Is there a way to scrape the website after the scripts have returned their results (as you see it with a browser)?
PS I looked at similar questions but did find the answer..
Code:
main.names <- read_html(x = paste0("https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description")) # feed `main.page` to the next step
names1 <- main.names %>% # feed `main.page` to the next step
html_nodes("div.mb0-md") %>% # get the CSS nodes
html_text()# extract the text
You should not do it. They provide a API which you can find here: https://status.kickstarter.com/api
Using APIs or Ajax/JSON calls is usually better since
The server isn't overused because your scraper visits every link it can find causing unnecessary traffic. That is bad for the speed of your program and bad for the servers of the site you are scraping.
You don't have to worry about that they changed a class name or id and your code won't work anymore
Especially the second part should interest you since it can take hours finding which class isn't returning a value anymore.
But to answer your question:
When you use the right scraper you can find all what you want. What tools are you using? There are possibilities to get data before the site is loaded or after. You can execute the JS on the site separately and find hidden content or find things like display:none Css classes...
It really depends on what you are using and how you use it.

Python Web Scrape Index

I am VERY new to web scraping in any shape or form, I've been trying to get into Python and I heard that web scraping was a good way to expose myself to Python. So, after many Google searches I finally came down to the use of two highly recommended modules: Requests and BeautifulSoup. I've read up a fair amount on both and have a basic understanding on how to use them.
I found a very basic website (basic in that there isn't much content or javascript and the like, making parsing the HTML a lot easier) and I have the following code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://www.basicwebs.co.uk/contact.htm').text)
for row in soup('div',{'id': 'Layer1'})[0].h2('font'):
tds = row.text
print tds
This code works. It produces the following result:
BASIC
WEBS
Contact details
Contact details
Which, if you spend a few minutes inspecting the code on this page, is the correct result (I assume). Now, the thing is, while this code works, what if I wanted to get a different part of the page? Like the little paragraph on the page that states "If you are interested in having a website designed and hosted by us, please contact us either by e-mail or telephone." - my understanding would be to simply change the index number to the corresponding header that this text is found under, but when I change it I get a message that the list index is out of range.
Can anybody help? (as simple as you can make it, if possible)
I'm using Python 2.7.8
The text you require surrounded by the font tag with an attribute size=3, so one way to do it is by selecting the first occurrence of it like this:
font_elements = soup('font', {'size': 3})
if font_elements:
print font_elements[0].text
RESULT:
If you are interested in having a website designed
and hosted by us, please contact us either by e-mail or telephone.
You can directly do this :
soup('font',{'size': '3'})[0].text
However, I want to draw your attention towards the mistake you made before.
soup('div',{'id': 'Layer1'})
this returns the div tag with id='Layer1' which can be more than one. So it basically returns a list of all HTML elements whose div tags have id='Layer1' but unfortunately the HTML you were trying to parse has one such element. So it went out of bound.
You can probably use some interactive interpreter of python like bpython or ipython to test what are you getting in an object.? Happy Hacking!!!
from urllib.request import urlopen
from bs4 import BeautifulSoup
web_address=' http://www.basicwebs.co.uk/contact.htm'
html = urlopen(web_address)
bs = BeautifulSoup(html.read(), 'html.parser')
contact_info = bs.findAll('h2', {'align':'left'})[0]
for info in contact_info:
print(info.get_text())

Automate Web Applications -parsing HTML Data

I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.
not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)

HTML parsing in Clojure

I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.