Here is a big web document : https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut . I know how to extract the text with
library(rvest)
library(magrittr)
page_url<- "https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut"
page_html<- read_html(page_url)
document <- page_html %>%
html_nodes("hr") %>%
html_text()
document
[1] "Rappel de votre demande:"
[2] "Format de téléchargement: : Texte"
[3] "Vues 1 à 544 sur 544"
[4] "Nombre de pages: 544"
[5] "Notice complète:"
[6] "Titre : Oeuvres complètes de Molière : accompagnées de notes tirées de tous les commentateurs avec des remarques nouvelles. Monsieur de Pourceaugnac / par M. Félix Lemaistre"
[7] "Auteur : Molière (1622-1673). Auteur du texte"
[8] "Auteur : Voltaire (1694-1778). Auteur du texte"
[9] "Auteur : La Harpe, Jean François de (1739-1803). Auteur du texte"
[10] "Auteur : Auger, Louis-Simon (1772-1829). Auteur du texte"
However, it's important for me to track the page from which the text was extracted. The start and the end of a page is actually represented by an horizontal line as you can see here https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut. So instead of retrieving a vector in which each element represent a row of the document, I want to have a list in which each element is a page, and each page is a vector in which each element is a row of the document. Something like
[[1]]
[1] "avurrbbihevyupsexvgymphjhdiqtfxzlwrbzpuqqpcxtlyrmyfxewydqnwqpinafaajvhylgaerlqilsvlwnscbiwoyinwjoudu"
[2] "gcgyuizpzznacdnrucvcjajjkbfahvlqqcoudbhpvuuvgrefpglnweznrimuzuydbzjzvhqezmjqtndzdhvvvbnhyipujusjmbhf"
[3] "caugvpyabksaqgktlrcoghkgjaqglpicgcngovvecesasevcdsmimysvrojvpwhbewxfwhdysvdcwmgxlziajwhilclecnkobmnc"
[4] "vuskqpyfqvqexilxqbhviqbdhhldprgdhifwzvhhvcclmljdgqmzsjrvlosftjshpuhxyjfsmfkqsxhaafysgesxwtoechrtekhy"
[[2]]
[1] "muvahkvftgglaphbzfehpnzvemhzixawlvadoxncmtmtzhqjlciozhgspnrusbkycgoqovxslusonmgqehbajbwpcldjquxchsvx"
[2] "pnhpzpbhjvqhehmlchncmgnhapaoqncvezaphilrpqguetutczpydrqthgdhwjtmlfhgvqvofdcylefrmergbkkwnsxlojgyaagw"
[3] "okjhxdpliykzbmdaghtgnsqftxhgpmkpsmiknuugejnrqmzaxqdljnbroxensegyxpikhzwkfzrqairvdhcvglcelnexvcypjkrx"
[4] "ftrbacjpwgmiuwbprvdkfpplycthukvycsyrjwsrokrrvcylzaxxdsgwlctglqaylegeflnlodttkiincavtncxttegstkgvvqgo"
[[3]]
[1] "ndnsdtqxpatoigobldauekhqdbcgvyqmcwyvmcvaredlrfjafiidwvcczqmufvufwjtdhordkaauukjezkyaodffohbzrnhwvioi"
[2] "ywryphperpsnbuspbfengmlllevavpbebfquiguvahshxdleyutvknsfiqcvrsirajqkzppbutsfbspjoirnqacoipcfxisugrto"
[3] "ivuzuxpflzqyphbnsdwvrqwcblxfagdflhqpgldnxkpuhzlhapueowofcgnakgwajgnaaqcvqxzwmorcmjybljsioulscnnntbmx"
[4] "cpbjxincbyrdasbrgrfdzxdzlmogfjmezgdkswpmcjrrlonsvgsaccrjvpbholodgsdcwslpsylslhoxliarkbighsmffoxprffb"
library(stringi)
library(rvest)
library(tidyverse)
Cache the page since it's big and loads really slowly:
if (!file.exists("~/Data/forso.html")) {
read_html(
"https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut"
) -> pg
write_lines(as.character(pg), "~/Data/forso.html")
}
Read it in as lines. This is usually a really bad idea for working with HTML but it's better for this process since the XPath required for dealing with text between sequences of tags is gnarly and slow (even just finding the <hr> elements felt kinda slow using html_nodes():
doc <- read_lines("~/Data/forso.html")
Now, find all the <hr> elements, ignoring the first two since they are after the intro/metadata section:
pos <- which(doc == "<hr>")[-(1:2)]
Create start/end index marker positions the text:
starts <- head(pos, -1)
ends <- tail(pos, -1)
Iterate along the start/end positions, extract the text, split it into lines and make a data frame:
map_df(seq_along(starts), ~{
start <- starts[.x]
end <- ends[.x]
data_frame(
pg = .x,
txt = read_html(paste0(doc[start:end], collapse="\n")) %>%
html_children() %>%
html_text() %>%
stri_split_lines() %>%
flatten_chr() %>%
list()
)
}) -> xdf
Take a look:
xdf
## # A tibble: 542 x 2
## pg txt
## <int> <list>
## 1 1 <chr [4]>
## 2 2 <chr [2]>
## 3 3 <chr [13]>
## 4 4 <chr [1]>
## 5 5 <chr [35]>
## 6 6 <chr [19]>
## 7 7 <chr [22]>
## 8 8 <chr [18]>
## 9 9 <chr [16]>
## 10 10 <chr [36]>
## # ... with 532 more rows
Another look:
glimpse(xdf)
## Observations: 542
## Variables: 2
## $ pg <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, ...
## $ txt <list> [<"OEUVRES COMPLETES ", "DE MOLIERE ", "TOMI: III ", "">, <"PARIS. — I1IP. SIMON RAÇON ET COUP., RUE D...
One more:
str(head(xdf))
## Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:
## $ pg : int 1 2 3 4 5 6
## $ txt:List of 6
## ..$ : chr "OEUVRES COMPLETES " "DE MOLIERE " "TOMI: III " ""
## ..$ : chr "PARIS. — I1IP. SIMON RAÇON ET COUP., RUE D'ERFURTH, 1. " ""
## ..$ : chr "OEUVRES COMPLETES " "DE MOLIERE " "NOUVELLE ÉDITION " "ACe-OJIPAfi NEES DE NOTES TIRÉES DE TOUS L, E S COMMENTATEURS AVEC DES REMARQUES NOUVELLES " ...
## ..$ : chr ""
## ..$ : chr "OEUVRES " "COMPLÈTES " "DE MOLIÈRE " "MONSIEUR DE POURCEAUGNAC' " ...
## ..$ : chr "MONSIEUR DE POURCEAUGNAC. " "MATASSINS dansants. DEUX AVOCATS chantants. DEUX PROCUREURS dansants. DEUX SERGENTS dansants. TROUPE DE MASQUES"| __truncated__ "La scène est à Paris. " "ACTE PREMIER " ...
This captures empty lines as well, but I have no idea what you need outside of what you described.
another approach
As #hrbrmstr already mentioned in his answer, xpath is not very friendly if you want to extract nodes between other nodes... Things get very inefficient, very fast...
So, keep in mind that the following code will take up several minutes to complete (or longer, depending on your machine)... (maybe an other user can speed things up using this answer as a base).
Having said that:
library( xml2 )
library( data.table )
#get the contents od the webpage
doc <- read_html( "https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut" )
#determine how many hr-tags/nodes are there in the document
hr <- length( xml_nodes( doc, "hr") )
#create an empty list
l <- list()
#fill the list with a loop. This seems to take forever, but is works!
# just be patient (and get a cup of coffe. or two...).
for( i in seq(1, hr, by = 1) ) {
#set up the xpath.
#xpath: get all p-nodes after the i-th hr-nodes, that have exactly i preceding hr-nodes
xpath_ <- paste0 ( ".//hr[", i, "]/following-sibling::p[count(preceding-sibling::hr)=", i, "]" )
#
l[[i]] <- xml_find_all( doc, xpath = xpath_ ) %>% xml_text() %>% data.table()
}
some results
l[1:5]
# [[1]]
# Empty data.table (0 rows) of 1 col: .
#
# [[2]]
# Empty data.table (0 rows) of 1 col: .
#
# [[3]]
# .
# 1: OEUVRES COMPLETES
# 2: DE MOLIERE
# 3: TOMI: III
#
# [[4]]
# .
# 1: PARIS. — I1IP. SIMON RAÇON ET COUP., RUE D'ERFURTH, 1.
#
# [[5]]
# .
# 1: OEUVRES COMPLETES
# 2: DE MOLIERE
# 3: NOUVELLE ÉDITION
# 4: ACe-OJIPAfi NEES DE NOTES TIRÉES DE TOUS L, E S COMMENTATEURS AVEC DES REMARQUES NOUVELLES
# 5: PAR FÉLIX L E M A I T R E
# 6: P R É C É D É E
# 7: DE LA VIE DE MOLIÈRE PAR VOLTAIRE
# 8: TOME TROISIEME
# 9: PARIS
# 10: GARNIER FRÈRES, LIBRAIRES-ÉDITEURS
# 11: G, RUE DES SAINTS-PÈRES, ET P A L A I S-R 0 V A I., 213
# 12: 8 6 7
or bind everything together in a data.table
dt <- rbindlist(l, use.names = TRUE, idcol = "page")
# page .
# 1: 3 OEUVRES COMPLETES
# 2: 3 DE MOLIERE
# 3: 3 TOMI: III
# 4: 4 PARIS. — I1IP. SIMON RAÇON ET COUP., RUE D'ERFURTH, 1.
# 5: 5 OEUVRES COMPLETES
# 6: 5 DE MOLIERE
# 7: 5 NOUVELLE ÉDITION
# 8: 5 ACe-OJIPAfi NEES DE NOTES TIRÉES DE TOUS L, E S COMMENTATEURS AVEC DES REMARQUES NOUVELLES
# 9: 5 PAR FÉLIX L E M A I T R E
# 10: 5 P R É C É D É E
# 11: 5 DE LA VIE DE MOLIÈRE PAR VOLTAIRE
# 12: 5 TOME TROISIEME
# 13: 5 PARIS
# 14: 5 GARNIER FRÈRES, LIBRAIRES-ÉDITEURS
# 15: 5 G, RUE DES SAINTS-PÈRES, ET P A L A I S-R 0 V A I., 213
# 16: 5 8 6 7
# 17: 7 OEUVRES
# 18: 7 COMPLÈTES
# 19: 7 DE MOLIÈRE
# 20: 7 MONSIEUR DE POURCEAUGNAC'
Finding an index of all the hr nodes is a straightforward way to go about it. The mutate section is the most notable part which uses %in% and cumsum.
# set up and read
library(rvest)
library(xml2)
library(dplyr)
page_url<- "https://gallica.bnf.fr/ark:/12148/bpt6k5619759j.texteBrut"
page_html<- read_html(page_url)
# filter to body only, so no need to deal with child nodes
allbodynodes <- page_html %>%
xml_node('body')
# get all nodes and all hr nodes to compare later
# the first could be put into the pipeline, but it's more clear to me here
allnodes <- allbodynodes %>%
xml_nodes('*')
allhr <- allbodynodes %>%
xml_nodes('hr')
alltext <- allnodes %>%
html_text(trim = T) %>% # convert to text only
as.data.frame(stringsAsFactors = F) %>% # put into dataframe
select(maintext = '.') %>% # give the text a variable name
mutate(
ishr = allnodes %in% allhr, # check which nodes were <hr> (now blank)
page = cumsum(ishr) + 1 # add page number by running across the hr
) %>%
filter(!ishr) %>% # get rid of blank hr lines
select(-ishr) # get rid of all false ishr column
# split into a list of sorts if desired
alltextlist <- split(alltext$maintext,alltext$page)
I hope there's a more succinct way to create the index (preferably within the dplyr pipeline), but I haven't found it yet.
Related
I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX®ion=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX®ion=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"®ion=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.
This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"®ion=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed
I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)
I'm trying to extract a JSON data which is a column in a CSV file. So far I've come to the point where I've extracted the column in the right format, but the formatting is only correct when the variable type is factor. But I can't convert a factor to a json-file using the jsonlite package.
[1] {"id":509746197991998767,"visibility":{"percentage":100,"time":149797,"visible1":true,"visible2":false,"visible3":false,"activetab":true},"interaction":{"mouseovercount":1,"mouseovertime":1426,"videoplaytime":0,"engagementtime":0,"expandtime":0,"exposuretime":35192}}
Another approach is to use stringsAsFactors = F when importing, but I'm struggling in getting the formatting right, where each entry looks like this:
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Am I missing something obvious here? I simply just want to exract the JSON files that sits inside a CSV file.
Heres a small example of the CSV file:
"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"
Regards,
Frederik.
df <- readr::read_csv('"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"',
col_types = "-cc")
Using jsonlite::fromJSON on each separate value, then tidyr::unnest
library(dplyr)
f <- function(.x)
if (is.na(.x) || .x == "") data.frame()[1, ] else
as.data.frame(jsonlite::fromJSON(.x))
df %>%
tidyr::unnest(UnloadVars = lapply(UnloadVars, f)) %>%
mutate_at(vars(ends_with("id")), as.character)
# A tibble: 6 x 16
# CookieID id visibility.percentage visibility.time visibility.visible1 visibility.visible2 visibility.visible3 visibility.activetab interaction.mouseovercount interaction.mouseovertime interaction.videoplaytime interaction.engagementtime interaction.expandtime interaction.exposuretime clicks.x clicks.y
# <chr> <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 -8857188784608690176 509746197991998784 100 149797 TRUE FALSE FALSE TRUE 1 1426 0 0 0 35192 NA NA
# 2 -1695626857458244096 2917654329769114112 46 0 FALSE FALSE FALSE TRUE NA NA NA NA NA NA NA NA
# 3 437299165071669184 2252707957388071936 99 10168 TRUE FALSE FALSE TRUE 0 0 0 0 0 542 105 449
# 4 292660729552227520 <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
# 6 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
I used readr::read_csv to read in your sample data set.
> df <- readr::read_csv('~/sample.csv')
Parsed with column specification:
cols(
CookieID = col_double(),
UnloadVars = col_character()
)
As you can see the UnloadVars are read in as characters and not factors. If I now examine the first value in the UnloadVars columns I see the following which matches what you get,
> df$UnloadVars[1]
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Now, I use jsonlite::fromJSON,
> j <- jsonlite::fromJSON(df$UnloadVars[1])
> j
$id
[1] 5.097462e+17
$visibility
$visibility$percentage
[1] 100
$visibility$time
[1] 149797
$visibility$visible1
[1] TRUE
$visibility$visible2
[1] FALSE
$visibility$visible3
[1] FALSE
$visibility$activetab
[1] TRUE
$interaction
$interaction$mouseovercount
[1] 1
$interaction$mouseovertime
[1] 1426
$interaction$videoplaytime
[1] 0
$interaction$engagementtime
[1] 0
$interaction$expandtime
[1] 0
$interaction$exposuretime
[1] 35192
Which I believe is what you need since JSONs are parsed as lists in R.
It can be very tricky to deal with JSON data. As a general guide line, you should always strive to have your data in a data frame. This, however, is not always possible. In the specific case, I don't see a way you can have both visibility and interaction values at once in a nicely formatted data frame.
What I will do next is to extract the information from interaction into a data frame.
Load required packages and read the data
library(purrr)
library(dplyr)
library(tidyr)
df <- read.csv("sample.csv", stringsAsFactors = FALSE)
Then remove unvalid JSON
# remove rows without JSON (in this case, the 4th row)
df <- df %>%
dplyr::filter(UnloadVars != "")
Transform each JSON into a list and put them into UnloadVars column. If you didn't know that, it is possible to have list column in a data frame. This can be very useful.
out <- data_frame(CookieID = numeric(),
UnloadVars = list())
for (row in 1:nrow(df)) {
new_row <- data_frame(CookieID = df[row, ]$CookieID,
UnloadVars = list(jsonlite::fromJSON(df[row, ]$UnloadVars)))
out <- bind_rows(out, new_row)
}
out
We can now extract the IDs from the lists in Unload Vars. This is straight forward because there is only one ID per list.
out <- out %>%
mutate(id = map_chr(UnloadVars, ~ .$id))
This final part can seem a bit intimidating. But what I am doing here is taking interaction part from UnloadVars column and putting it into a interaction column. I then transform each row from interaction, which is a list, into a data frame with two columns: key and value. key contains the name of the interaction metric and value its value. I finally unnest it, so we get rid of list columns and end up with a nicely formatted data frame.
unpack_list <- function(obj, key_name) {
as.data.frame(obj) %>%
gather(key) %>%
return()
}
df_interaction <- out %>%
mutate(interaction = map(UnloadVars, ~ .$interaction)) %>%
mutate(interaction = map(interaction, ~ unpack_list(.x, key))) %>%
unnest(interaction)
df_interaction
The solution is not very elegant, but gets the job done. You could apply the same logic to extract information from visibility.
I'm trying to read the HTML data regarding Greyhound bus timings. An example can be found here. I'm mainly concerned with getting the schedule and status data off the table, but when I execute the following code:
library(XML)
url<-"http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
greyhound<-readHTMLTable(url)
greyhound<-greyhound[[2]]
This just produces the following table:
I'm not sure why it's grabbing data that's not even on the page, as opposed to the
you can not retrieve the data using readHTMLTable because the traject result are sent as javascript script. So you should select that script and parse it to extract the right information.
Her a solution , that do this :
Extract the javascript script that contain the json data
extract the json data from the script using regular expression
parse the json data to an R list
Reshape the resulted list into a table ( data.table here)
The code looks maybe short but it is really compact ( it takes me an hour to do produce it)!
library(XML)
library(httr)
library(jsonlite)
library(data.table)
dc <- htmlParse(GET(url))
script <- xpathSApply(dc,"//script/text()",xmlValue)[[5]]
res <- strsplit(script,"stopArray.push({",fixed=TRUE)[[1]][-1]
dcast(point~name,data=rbindlist(Map(function(x,y){
x <- paste('{',sub(');|);.*docum.*',"",x))
dx <- unlist(fromJSON(x))
data.frame(point=y,name=names(dx),value=dx)
},res,seq_along(res))
,fill=TRUE)[name!="polyline"])
the table result :
point category direction id lat linkName lon
1: 1 2 empty 562310 41.878589630127 Chicago_Amtrak_IL -87.6398544311523
2: 2 2 empty 560252 41.8748474121094 Chicago_IL -87.6435165405273
3: 3 1 empty 561627 41.7223281860352 Chicago_95th_&_Dan_Ryan_IL -87.6247329711914
4: 4 2 empty 260337 41.6039199829102 Gary_IN -87.3386917114258
5: 5 1 empty 260447 40.4209785461426 Lafayette_e_IN -86.8942031860352
6: 6 2 empty 260392 39.7617835998535 Indianapolis_IN -86.161018371582
7: 7 2 empty 250305 39.1079406738281 Cincinnati_OH -84.5041427612305
name shortName ticketName
1: Chicago Amtrak: 225 S Canal St, IL 60606 Chicago Amtrak, IL CHD
2: Chicago: 630 W Harrison St, IL 60607 Chicago, IL CHD
3: Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628 Chicago 95th & Dan Ryan, IL CHD
4: Gary: 100 W 4th Ave, IN 46402 Gary, IN GRY
5: Lafayette (e): 401 N 3rd St, IN 47901 Lafayette (e), IN XIN
6: Indianapolis: 350 S Illinois St, IN 46225 Indianapolis, IN IND
7: Cincinnati: 1005 Gilbert Ave, OH 45202 Cincinnati, OH CIN
As #agstudy notes, the data is rendered to HTML; it's not delivered via HTML directly from the server. Therefore, you can (a) use something like RSelenium to scrape the rendered content, or (b) extract the data from the <script> tags that contain the data.
To explain #agstudy's work, we observe that the data is contained in a series of stopArray.push() commands in one of the (many) script tags. For example:
stopArray.push({
"id" : "562310",
"name" : "Chicago Amtrak: 225 S Canal St, IL 60606",
"shortName" : "Chicago Amtrak, IL",
"ticketName" : "CHD",
"category" : 2,
"linkName" : "Chicago_Amtrak_IL",
"direction" : "empty",
"lat" : 41.87858963012695,
"lon" : -87.63985443115234,
"polyline" : "elr~Fnb|uOmC##nG?XBdH#rC?f#?P?V#`AlAAn#A`CCzBC~BE|CEdCA^Ap#A"
});
Now, this is json data contained inside each function call. I tend to think that if someone has gone to the work of formatting data in a machine-readable format, well golly we should appreciate it!
The tidyverse approach to this problem is as follows:
Download the page using the rvest package.
Identify the appropriate script tag to use by employing an xpath expression that searches for all script tags that contain the string url =.
Use a regular expression to pull out everything inside each stopArray.push() call.
Fix the formatting of the resulting object by (a) separating each block with commas, (b) surrounding the string by [] to indicate a json list.
Use jsonlite::fromJSON to convert into a data.frame.
Note that I hide the polyline column near the end, since it's too large to previous appropriately.
library(tidyverse)
library(rvest)
library(stringr)
library(jsonlite)
url <- "http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
page <- read_html(url)
page %>%
html_nodes(xpath = '//script[contains(text(), "url = ")]') %>%
html_text() %>%
str_extract_all(regex("(?<=stopArray.push\\().+?(?=\\);)", multiline = T, dotall = T), F) %>%
unlist() %>%
paste(collapse = ",") %>%
sprintf("[%s]", .) %>%
fromJSON() %>%
select(-polyline) %>%
head()
#> id name
#> 1 562310 Chicago Amtrak: 225 S Canal St, IL 60606
#> 2 560252 Chicago: 630 W Harrison St, IL 60607
#> 3 561627 Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628
#> 4 260337 Gary: 100 W 4th Ave, IN 46402
#> 5 260447 Lafayette (e): 401 N 3rd St, IN 47901
#> 6 260392 Indianapolis: 350 S Illinois St, IN 46225
#> shortName ticketName category
#> 1 Chicago Amtrak, IL CHD 2
#> 2 Chicago, IL CHD 2
#> 3 Chicago 95th & Dan Ryan, IL CHD 1
#> 4 Gary, IN GRY 2
#> 5 Lafayette (e), IN XIN 1
#> 6 Indianapolis, IN IND 2
#> linkName direction lat lon
#> 1 Chicago_Amtrak_IL empty 41.87859 -87.63985
#> 2 Chicago_IL empty 41.87485 -87.64352
#> 3 Chicago_95th_&_Dan_Ryan_IL empty 41.72233 -87.62473
#> 4 Gary_IN empty 41.60392 -87.33869
#> 5 Lafayette_e_IN empty 40.42098 -86.89420
#> 6 Indianapolis_IN empty 39.76178 -86.16102
I'm trying to get data for books prices from API (http://www.knigoed.info/api-prices.html) based on ISBN.
The idea is to submit vector of ISBNs to the function to get a data frame with all available info (or at least Data.Frame with prices from different vendors)
isbns<- c("9785170922789", "9785170804801", "9785699834174", "9785699717255", "9785170869237")
getISBNprice <- function(ISBN, source="http://www.knigoed.info/api/Prices?code=") {
pathA <- source
for (i in 1:length(ISBN)) {
ISB <- ISBN[i]
AAA <- paste(pathA, ISB, "&sortPrice=DESC&country=RU", sep="")
document <- fromJSON(AAA, flatten = FALSE)
dfp <- document$prices
dfp <- cbind(dfp,ISB )
# dfp <- cbind(dfp,BookID=document$bookId)
# dfp <- cbind(dfp,Title=document$title)
# dfp <- cbind(dfp,Author=document$author)
# dfp <- cbind(dfp,Publisher=document$publisher)
# dfp <- cbind(dfp,Series=document$series)
# dfp <- cbind(dfp,Picture=document$picture)
if (!exists("AAAA")) {AAAA<- dfp} else {bind_rows(AAAA, dfp) }
}
AAAA
}
But the function returns error:
1. In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
3: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
4: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
It's easiest make a list from the start, which will make simplifying later easier. The purrr package can make working with lists much easier, though the usages here can be replaced with base's lapply and mapply/Map if you prefer.
library(purrr)
# Paste is vectorized, so make a list of URLs all at once.
# `httr` can make a URL out of a list of named parameters, if it's more convenient.
results <- paste0("http://www.knigoed.info/api/Prices?code=",
isbns,
"&sortPrice=DESC&country=RU") %>%
# Iterate over vector of URLs, using fromJSON to pull and parse the request.
# map, like lapply, will put the results into a list.
map(jsonlite::fromJSON, flatten = FALSE)
# Grab "prices" element of each top-level list element
results %>% map('prices') %>%
# Iterate in parallel (like mapply/Map) over prices and isbns, making a data.frame of
# each. map2_df will coerce the resulting list of data.frames to a single data.frame.
map2_df(isbns, ~data.frame(isbn = .y, .x, stringsAsFactors = FALSE)) %>%
# For pretty printing
tibble::as_data_frame()
## # A tibble: 36 x 10
## isbn shopId name domain
## <chr> <chr> <chr> <chr>
## 1 9785170922789 29 Магистр booka.ru
## 2 9785170922789 3 Лабиринт labirint.ru
## 3 9785170922789 20 LitRes.ru litres.ru
## 4 9785170804801 29 Магистр booka.ru
## 5 9785170804801 2 Read.ru read.ru
## 6 9785170804801 3 Лабиринт labirint.ru
## 7 9785170804801 63 Эксмо eksmo.ru
## 8 9785170804801 1 OZON.ru ozon.ru
## 9 9785170804801 4 My-shop.ru my-shop.ru
## 10 9785170804801 1 OZON.ru ozon.ru
## # ... with 26 more rows, and 6 more variables: url <chr>, available <lgl>, downloadable <lgl>,
## # priceValue <dbl>, priceSuffix <chr>, year <int>
I'm scraping http://www.progarchives.com/album.asp?id= and get a warning message:
Warning message:
XML content does not seem to be XML:
http://www.progarchives.com/album.asp?id=2
http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5
The scraper works for each page separately but not for the urls b1=2:b2=1000.
library(RCurl)
library(XML)
getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}
prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}
prog.arch.scraper(urls)
Here's an alternate approach with rvest and dplyr:
library(rvest)
library(dplyr)
library(pbapply)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- html(sprintf(base_url, id))
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
albums <- bind_rows(pblapply(2:10, get_album_info))
head(albums)
## Source: local data frame [6 x 3]
##
## album date band
## 1 FOXTROT Studio Album, released in 1972 Genesis
## 2 NURSERY CRYME Studio Album, released in 1971 Genesis
## 3 GENESIS LIVE Live, released in 1973 Genesis
## 4 A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
## 6 GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
I didn't feel like barraging the site with a ton of reqs so bump up the sequence for your use. pblapply gives you a free progress bar.
To be kind to the site (esp since it doesn't explicitly prohibit scraping) you might want to throw a Sys.sleep(10) at the end of the get_album_info function.
UPDATE
To handle server errors (in this case 500, but it'll work for others, too), you can use try:
library(rvest)
library(dplyr)
library(pbapply)
library(data.table)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- try(html(sprintf(base_url, id)), silent=TRUE)
if (inherits(pg, "try-error")) {
data.frame(album=character(0), date=character(0), band=character(0))
} else {
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
}
albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))
## album date band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
## 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
## 3: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
You won't get any entries for the errant pages (in this case it just returns id 9, 10 and 30's entries).
Instead of xpathApply(), you could subset the first node in the node sets of each path and call xmlValue() on that. Here's what I came up with,
library(XML)
library(RCurl)
## define the urls and xpath queries
urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10)
path <- c(album = "//h1", date = "//strong", band = "//h2")
## define a re-usable curl handle for the c-level nodes
curl <- getCurlHandle()
## allocate the result list
out <- vector("list", length(urls))
## do the work
for(u in urls) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]]))
free(doc)
}
## structure the result
data.table::rbindlist(out)
# album date band
# 1: FOXTROT Studio Album, released in 1972 Genesis
# 2: NURSERY CRYME Studio Album, released in 1971 Genesis
# 3: GENESIS LIVE Live, released in 1973 Genesis
# 4: A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
# 5: FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
# 6: GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
# 7: GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz
# 8: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 9: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
Update: To handle the id queries do not exist, we can write a condition with RCurl::url.exists() that handles the bad ones. So the following function getAlbums() returns a character vector of the either the fetched xml values or NA, depending on the status of the url. You can change that if you want, of course. That was just a method that came to mind in the wee hours.
getAlbums <- function(url, id = numeric(), xPath = list()) {
urls <- sprintf("%s?id=%d", url, id)
curl <- getCurlHandle()
out <- vector("list", length(urls))
for(u in urls) {
out[[u]] <- if(url.exists(u)) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
lapply(path, function(x) xmlValue(doc[x][[1]]))
} else {
warning(sprintf("returning 'NA' for urls[%d] ", id[urls == u]))
structure(as.list(path[NA]), names = names(path))
}
if(exists("doc")) free(doc)
}
data.table::rbindlist(out)
}
url <- "http://www.progarchives.com/album.asp"
id <- c(9:10, 23, 28, 29, 30)
path <- c(album = "//h1", date = "//strong", band = "//h2")
getAlbums(url, id, path)
# album date band
# 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# 6: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
#
# Warning messages:
# 1: In albums(url, id, path) : returning 'NA' for urls[23]
# 2: In albums(url, id, path) : returning 'NA' for urls[28]
# 3: In albums(url, id, path) : returning 'NA' for urls[29]