Skipping lines that contains only whitespace in R - html

I have a problem of reading some of the html subsites. Most of them work just fine but for e.g http://www-history.mcs.st-andrews.ac.uk/Biographies/De_Morgan.html have empty lines in H1 and H3. Because of that my data.frame is a total mess when it comes to that people e.g :
data frame example. Frame containts 4 columns "Name" "Date and place of birth" "Date and place of deat" "Link". Im supossed to make a table in LaTeX, but because of those rows with whitespace my tab at some points goes in wrong direction and a guys name is his date of birth and so on. To read that sites im using simply using loop from j=1 to length(LinkiWlasciwy)
matematyk=LinkWlasciwy[j] %>%
read_html() %>%
html_nodes(selektor1) %>%
html_text()
where selektor1="h3 font , h1". After that i save it contains to .txt file and read it in another script where i am supposed to make .tex file based out of these data. In my opinion it would be best to just delete lines in file that only contains whitespace such as space,\n etc. In my txt file for e.g.
Marie-Sophie Germain| 1 April 1776
in Paris, France| 27 June 1831
in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
As a separator i use " | " . Not all of them are the same, some contains only one space, some two and etc. All i want is to bring every wrong record to this
Marie-Sophie Germain| 1 April 1776 in Paris, France| 27 June 1831 in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
I had to delete http:// from the text samples because i dont have yet 10 reputation and they are counted as links

You can use library stringi:
library(stringi)
line<-c("Marie-Sophie Germain| 1 April 1776",
" ",
"in Paris, France| 27 June 1831",
" ",
"in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|")
line2<- line[stri_count_regex(line, "^[ \\t]+$") ==0]
line2
stri_paste(line2, collapse="")
Result:
[1] "Marie-Sophie Germain| 1 April 1776in Paris, France| 27 June 1831in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|"

Related

Web Scraping with rvest and xml2

I'm trying to scrape the date and policy type for COVID related announcements from this url: https://covid19.healthdata.org/united-states-of-america/alabama
The first date I'm trying to pull is the "April 4th, 2020" date for Alabama's Stay at Home Order.
As far as I can tell (as I am new to this), it has the xpath:
"//[#id="root"]/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span"
I've been using the following lines to try to retrieve it -
data <- read_html(url) %>%
html_nodes("span.ant-statistic-content-value")
data <- read_html(url) %>%
html_nodes(xpath = "//*[#id='root']/div/main/div[3]/div[1]/div[2]/div[1]/div[1]/div/div/span")
Neither are able to pull the information I'm looking for. Any help would be appreciated!
The data for this page is stored in a series of JSON files. If you use the developer tools from your browser and look for the Networks files of type XHR; you should obtain a list similar to this (Safari browser below):
Right click the names to copy URL link.
This script should get you started:
library(jsonlite)
#obtain the list of locations
locations<-fromJSON("https://covid19.healthdata.org/api/metadata/location?v=7", flatten = TRUE)
head(locations[, 1:9])
#get list if US locations
US <- locations$children[locations$location_name =="United States of America"]
head(US[[1]])
#Get data frame from interventions
#Create link with desired location_id (569 is Virginia)
#paste0("https://covid19.healthdata.org/api/data/intervention?location=", "569")
Interventions <- fromJSON("https://covid19.healthdata.org/api/data/intervention?location=569", flatten = TRUE)
Interventions
# date_reported covid_intervention_id location_id covid_intervention_measure_id covid_intervention_measure_name
# 1 2020-03-30 00:00:00 110 569 1 People instructed to stay at home
# 2 2020-03-16 00:00:00 258 569 2 Educational facilities closed
# 3 2020-04-19 00:00:00 437 569 7 Assumed_implemented_date
#Repeat for other links of interest

Removing special characters in Drupal Views CSV Data Export

I have a CSV file, Data Export from Books and Chapters in Drupal 7.
After exporting to a CSV file, the body of the chapter shows an "Â ", or "Â " in the source code, for each " ", in the CSV file. The same thing happens for an apostrophe: "’" for each apostrophe (').
An example of the Body in my CSV file:
  This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions).  The FMLA was effective on August 5, 1993, six months after its passage.Â
The same text in Drupal:
This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions). The FMLA was effective on August 5, 1993, six months after its passage.
The same text in my Drupal wysiwyg source code:
This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions). The FMLA was effective on August 5, 1993, six months after its passage.
From what I can see, turns into "  " in the CSV file.
In the Data Export View, I have the Rewrite Results of Content: Body (Body) set to strip HTML Tags. Moreover, I have adding to preserve the tag and excluded as well.
This is probably a beginner error. Does anyone know how to remove or replace the results to show the proper text without the odd markup ("  ", )?

How to extract text from a several "div class" (html) using R?

My goal is to extract info from this html page to create a database:
https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing
One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.
Example A
<div class="listing_column listing_row_price">
<div class="row_price">
$ 14,800
</div>
<div class="row_info">Ayer 19:53</div>
Example B
<div class="listing_column listing_row_price">
<div class="row_info">Ayer 19:50</div>
I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:
...
10 4000
11 14800
12 NA
13 14000
14 8000
...
But so far I've get this one and another full with NA.
...
10 4000
11 14800
12 14000
13 8000
14 8500
...
Commands used but didn't get what I want:
html1<-read_html("file.html")
title<-html_nodes(html1,"div")
html1<-toString(title)
pattern1<-'div class="row_price">([^<]*)<'
title3<-unlist(str_extract_all(title,pattern1))
title3<-title3[c(1:35)]
pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
title3<-unlist(str_extract(title3,pattern2))
title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))
I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.
There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:
library(rvest)
page <- read_html('page.html')
# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')
# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2,
html_text(html_children(x)[1]),
NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Extract text from HTML node tree with R

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-
require(RCurl)
require(XML)
query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)
query.IMDB
query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[#class='rating rating-list']")
My first attempt was just to use grep on the resulting vector, but this fails.
data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable
My next attempt was to use grep on the individual points in the query.IMDB vector:-
vect <- numeric(length(df.IMDB))
for (i in 1:length(df.IMDB)){
vect[i] <- data[grep("Users rated this", "", df.IMDB)]
}
but this also throws the closure not subsettable error.
Finally trying the above function without data[] around the grep throws
Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero
I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.
Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector
No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:
library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire 8.2 2,694
2 Bart the Genius 7.8 1,167
3 Homer's Odyssey 7.5 1,005
4 There's No Disgrace Like Home 7.9 1,017
5 Bart the General 8.0 992
6 Moaning Lisa 7.4 988
This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.

Filter only in one element/array of Twitter JSON file

I crawled the twitter JSON file from Streaming API and got a file of thousands lines of JSON data. However, this data contains of lots of elements such as "creation date", "source", "tweet text", etc.
I actually want to filter the word "iphone" in the tweet text. However, if I filter using GREP UNIX, it filters out not only in the "tweet text" field but also in the "source" field. So it means that a tweet that does not contains word "iphone" but tweeted from Twitter for Iphone as stated in the "Source" field will also be filtered.
Is there anyway to filter this JSON only in one certain field (in my case it is "tweet text" field).
Here's the example of one JSON line:
{"created_at":"Tue Aug 20 03:48:27 +0000 2013","id":369667218608369666,"id_str":"369667218608369666","text":"#Mattyb_chyeah_ yeah I'm only watching him! :)","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":369666992334073856,"in_reply_to_status_id_str":"369666992334073856","in_reply_to_user_id":1557571363,"in_reply_to_user_id_str":"1557571363","in_reply_to_screen_name":"Mattyb_chyeah_","user":{"id":1325959333,"id_str":"1325959333","name":"MattyBRapsTexas","screen_name":"MattyBRapsTexas","location":"Atlanta,Georgia","url":"http:\/\/www.instagram.com\/mattybrapstexas","description":"3 RT 6 Mentions He followed me on 4\/15\/13 6\/17\/13 Maddi Jane followed me on 6\/18\/13 #8:25pm! Cimorelli also follows Pizza Hut mentioned me 2 times on 7\/26\/13","protected":false,"followers_count":1095,"friends_count":426,"listed_count":8,"created_at":"Thu Apr 04 02:34:56 +0000 2013","favourites_count":226,"utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":3447,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1325959333\/1376759786","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"Mattyb_chyeah_","name":"MattyB (\u2661_\u2661\u2740)","id":1557571363,"id_str":"1557571363","indices":[0,15]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"
What are you using for your grep regex? If you are just using 'iphone' for the regex then yes, you'll get multiple hits. You can expand your regex to match iphone only in text section before the source:
grep '"text":".*iphone.*","source":' myfile.txt
will search for the pattern iphone after "text" but before "source". It will ignore iphone in the rest of the line.