I am trying to extract variable values from 16 .img files and bind it to a pre-existing structure, and while the code runs, it only spits back NA values. I've tried converting the crs but it hasn't done anything.
library(raster)
library(sp)
library(rgdal)
library(sf)
library(exactextractr)
setwd("~/preds")
pres.abs <- st_read("~spp106mort.PPsA.SF.shp") # load in GIS data
etpt_5.r <- raster("etpt_5.img", band=1)
etpt_6.r <- raster("etpt_6.img", band=1)
etpt_sprin.r <- raster("etpt_sprin.img", band=1)
exp1nrm.r <- raster("exp1nrm.img", band=1)
exp3nrm.r <- raster("exp3nrm.img", band=1)
exp5nrm.r <- raster("exp5nrm.img", band=1)
mind_yr_av.r <- raster("mind_yr_av.img", band=1)
prad_sw_di.r <- raster("prad_sw_di.img", band=1)
prec_w_hal.r <- raster("prec_w_hal.img", band=1)
prec_winte.r <- raster("prec_winte.img", band=1)
rough_1k.r <- raster("rough_1k.img", band=1)
tave_s_hal.r <- raster("tave_s_hal.img", band=1)
tave_sprin.r <- raster("tave_sprin.img", band=1)
tmax_s_hal.r <- raster("tmax_s_hal.img", band=1)
tmax_summe.r <- raster("tmax_summe.img", band=1)
topos.r <- raster("topos.img", band=1)
allvars <- stack(etpt_5.r, etpt_6.r, etpt_sprin.r, exp1nrm.r, exp3nrm.r, exp5nrm.r,
mind_yr_av.r, prad_sw_di.r, prec_w_hal.r, prec_winte.r, rough_1k.r,
tave_s_hal.r, tave_sprin.r, tmax_s_hal.r, tmax_summe.r, topos.r)
## converting pres.abs to new crs
pres.abs.old <- pres.abs
newcrs <- crs(allvars)
pres.abs.new <- st_transform(pres.abs, newcrs)
crs(allvars)
Coordinate Reference System:
Deprecated Proj.4 representation: +proj=longlat +datum=WGS84 +no_defs
WKT2 2019 representation:
GEOGCRS["WGS 84 (with axis order normalized for visualization)",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic longitude (Lon)",east,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["geodetic latitude (Lat)",north,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]
projection(allvars)
[1] "+proj=longlat +datum=WGS84 +no_defs"
crs(pres.abs.new)
Coordinate Reference System:
Deprecated Proj.4 representation: +proj=longlat +datum=WGS84 +no_defs
WKT2 2019 representation:
GEOGCRS["WGS 84 (with axis order normalized for visualization)",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic longitude (Lon)",east,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["geodetic latitude (Lat)",north,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]
projection(pres.abs.new)
[1] "+proj=longlat +datum=WGS84 +no_defs"
So I got the crs and projection the same, but when it comes time to extract, here's what happens
head(pres.abs.new)
Simple feature collection with 6 features and 12 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -96.00105 ymin: 23.00031 xmax: -96.00105 ymax: 23.00031
Geodetic CRS: WGS 84 (with axis order normalized for visualization)
UNIQUEID wgs_xF wgs_yF MORT106 exp5nrm topos rough_1k prad_sw_di tmax_s_hal etpt_6 mind_yr_av prec_w_hal
1 Z0401030100988520 -109.6594 33.25329 1 -23 -1.307692 25.61479 -5511 262.3333 101 -1676.667 34.83333
2 Z0401030301183589 -109.4448 33.20588 1 -48 -5.153846 51.95938 -5622 278.6667 101 -1708.083 29.83333
3 Z0401030401180927 -109.2798 33.42938 1 21 19.923077 39.72136 -5198 258.0000 100 -1680.667 33.50000
4 Z0401030501183471 -109.3880 33.38541 1 0 18.230770 27.97424 -5506 250.8333 100 -1585.667 35.83333
5 Z0401030601186305 -109.3887 33.34027 1 9 24.153847 34.67730 -5456 262.8333 101 -1679.917 33.00000
6 Z0401030701187126 -109.2232 33.56368 1 110 -33.538460 61.15303 -5409 248.5000 99 -1495.417 34.16667
geometry
1 POINT (-96.00105 23.00031)
2 POINT (-96.00105 23.00031)
3 POINT (-96.00105 23.00031)
4 POINT (-96.00105 23.00031)
5 POINT (-96.00105 23.00031)
6 POINT (-96.00105 23.00031)
extracted <- extract(allvars, pres.abs.new[, c("wgs_xF", "wgs_yF")])
etpt_5 etpt_6 etpt_sprin exp1nrm exp3nrm exp5nrm mind_yr_av prad_sw_di prec_w_hal prec_winte rough_1k tave_s_hal tave_sprin tmax_s_hal
[1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
tmax_summe topos
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] NA NA
[5,] NA NA
[6,] NA NA
This is weird because each of the rasters I imported have distinct minimum and maximum values, so I know that they're not actually NA
prec_winte.r
class : RasterLayer
dimensions : 1518, 1478, 2243604 (nrow, ncol, ncell)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -114.6167, -102.3, 29.49167, 42.14167 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : prec_winte.img
names : prec_winte
values : 3.333333, 140.3333 (min, max)
I think the issue is dependent on where in the raster or dataframe I'm pulling the values from, but I can't figure out where!
I have been trying to make a conditional sum based on a data.framethat has duplicates. I want to sum the ones that has an identical permno and date and create a separate column with this information filling in NA's or preferable 0's.
My data set looks like this:
data.frame(crsp)
permno date PAYDT DISTCD divamt FACPR FACSHR PRC RET
1 10022 19280929 19281001 1272 0.25 0 0 71.00 0.045208
2 10022 19280929 19281001 1232 1.00 0 0 71.00 0.045208
3 10022 19281031 NA NA NA NA NA 73.50 0.035211
4 10022 19281130 NA NA NA NA NA 72.50 -0.013605
5 10022 19281231 19290202 1232 1.00 0 0 68.00 -0.044828
6 10022 19281231 19290202 1272 0.25 0 0 68.00 -0.044828
7 10022 19290131 NA NA NA NA NA 73.75 0.084559
8 10022 19290228 NA NA NA NA NA 69.00 -0.064407
9 10022 19290328 19290401 1232 1.00 0 0 65.00 -0.039855
10 10022 19290328 19290401 1272 0.25 0 0 65.00 -0.039855
11 10022 19290430 NA NA NA NA NA 67.00 0.030769
12 10022 19290531 NA NA NA NA NA 64.75 -0.033582
First, I have created permno + date to make a unique pickup-code
crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="")
Second, I have then tried to sum the duplicates and making this into a new frame:
crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)
However, I am unable to transfer this information back correctly to the original data.frame(crsp), as the columns have different lenghts where cbind and cbind.fill don't allow me to match this correctly. Specifically, I want the sum of the divamts for one/the first of the unique permnodates so it corresponds with the remaining data.frame in length. I have not had succed with merge or match either.
I haven't tried loop functions yet or managed to create any if or ifelse functions with succes. Basically, this can be done in excel with the VLOOKUP or the index.match formula, however, this is more tricky in R than I first thought.
Help is much appreciated.
Best regards
Troels
You can use duplicated and merge to achieve this more easily. I've written an example. You'll have to alter this for your purposes, but hopefully it will put you on the right track:
# Creating a fake sample dataset.
set.seed(9)
permno <- 10022:10071 # Allowing 50 possible permno's.
date <- 19280929:19280978 # Allow 50 possible dates.
value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.
# Creating fake data frame.
crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))
# Loading a function that uses duplicated to get both the duplicated rows and the original rows.
fullDup <- function(x) {
bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
return(bool)
}
# Getting the duplicated rows.
crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.
# Now aggregate.
crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.
# Now merge back in with the original dataset.
crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)
I have a json text file which reads
{"type":"session.ended","v":2,"post.source":"1306210600-col001.sv1.movenetworks.com:2097#5","post.ip4":"75.114.187.146","post.rtime_secs":1371794661,"post.rtime_text":"2013-06-21 06:04:20","post.time_secs":1371794596,"post.time_text":"2013-06-21 06:03:16","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":65,"id.session":"3625657","id.sub":"2370b726-b96e-11e2-b3eb-1231380e1adf","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3050X","device.info":"Roku;3050X","device.serial":"12C241003940","device.version":"2.5.0.7","device.uuid":"51ce2255-62ad-5778-b2d7-b9543c1476c6","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130329","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"48154","ip.dma":505,"ip.dma_name":"unknown","ip.city":"livonia","ip.country":"united states","ip.region":"michigan","ip.continent":"north america","ip.isp":"bright house networks llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1338713","clip.pub":"CBD46B77","asset.id":524768,"asset.pub":"CBD46B77","asset.length_ms":1800000,"asset.guid":"b7c12c09dc5aec832e142a00b0f191fa","asset.title":"Diya Aur Bati Hum","asset.type":"captured","asset.adult":false,"asset.franchise_guid":"941496e1452b4fce9acfe7b3339924eb","asset.franchise_title":"Diya Aur Bati Hum","container.id":14,"container.pub":"CBD46B77","container.guid":"3caa6afd715e4c57ac4750d29e449a9c","container.title":"SPLUS","usage.elapsed_ms":2312,"usage.viewed_ms":392,"usage.stage":"mainVideo","exp.idle_ms":350,"exp.stalls":0,"exp.stalled_ms":0,"exp.frame_renders":0,"exp.frame_drops":0,"exp.ghost_session_ms":0,"exp.ghost_sessions":0,"exp.qmx_stale_ms":0,"exp.qmx_error_ms":0,"exp.qss_late_ms":0,"exp.qss_error_ms":0,"exp.fom":0,"exp.fom_weight":0,"data.dl_bytes":228,"data.ul_bytes":406,"http.oks":2,"http.errors":0,"http.timeouts":0,"net.throughput":8977,"http.slows":0,"data.bitrate_mean":1968,"data.bitrate_stddev":0,"data.bitrate_median":1950,"data.bitrate_modes":[1950],"data.streamlets":1,"data.late_streamlets":0,"data.all_streamlets":0,"data.bf_streamlets":0,"data.ab_streamlets":0}
{"type":"session.started","v":2,"post.source":"1306210600-col004.sv1.movenetworks.com:2183#6","post.ip4":"63.225.172.43","post.rtime_secs":1371794671,"post.rtime_text":"2013-06-21 06:04:31","post.time_secs":1371794660,"post.time_text":"2013-06-21 06:04:20","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":11,"id.session":"232169818","id.sub":"55d514ba-3858-11e2-91a7-12313d08e01f","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3100X","device.info":"Roku;3100X","device.serial":"13C2AE061481","device.version":"2.5.0.37","device.uuid":"7f5654d5-3aa7-5a5f-bb2b-8084da358942","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130615","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"98115","ip.dma":819,"ip.dma_name":"unknown","ip.city":"seattle","ip.country":"united states","ip.region":"washington","ip.continent":"north america","ip.isp":"qwest communications company llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1339170","clip.pub":"CBD46B77","asset.id":522015,"asset.pub":"CBD46B77","asset.length_ms":7200000,"asset.guid":"c6938cfa200a21e90dce41f5ed131cc2","asset.title":"Spark Top 20","asset.type":"captured","asset.adult":false,"container.id":277,"container.pub":"CBD46B77","container.guid":"03e3a689e245457bba2f98c30ef931fa","container.title":"BIGMGC","usage.stage":"mainVideo","exp.idle_ms":5772}
I want to load it in R and convert to a dataframe.Here field names are a part of the data and also we have unequal no of fields in each row (a total of 13 rows)
Any help will be appreciated.
Here's one way to do it:
file <- '[
{"type":"session.ended","v":2,"post.source":"1306210600-col001.sv1.movenetworks.com:2097#5","post.ip4":"75.114.187.146","post.rtime_secs":1371794661,"post.rtime_text":"2013-06-21 06:04:20","post.time_secs":1371794596,"post.time_text":"2013-06-21 06:03:16","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":65,"id.session":"3625657","id.sub":"2370b726-b96e-11e2-b3eb-1231380e1adf","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3050X","device.info":"Roku;3050X","device.serial":"12C241003940","device.version":"2.5.0.7","device.uuid":"51ce2255-62ad-5778-b2d7-b9543c1476c6","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130329","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"48154","ip.dma":505,"ip.dma_name":"unknown","ip.city":"livonia","ip.country":"united states","ip.region":"michigan","ip.continent":"north america","ip.isp":"bright house networks llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1338713","clip.pub":"CBD46B77","asset.id":524768,"asset.pub":"CBD46B77","asset.length_ms":1800000,"asset.guid":"b7c12c09dc5aec832e142a00b0f191fa","asset.title":"Diya Aur Bati Hum","asset.type":"captured","asset.adult":false,"asset.franchise_guid":"941496e1452b4fce9acfe7b3339924eb","asset.franchise_title":"Diya Aur Bati Hum","container.id":14,"container.pub":"CBD46B77","container.guid":"3caa6afd715e4c57ac4750d29e449a9c","container.title":"SPLUS","usage.elapsed_ms":2312,"usage.viewed_ms":392,"usage.stage":"mainVideo","exp.idle_ms":350,"exp.stalls":0,"exp.stalled_ms":0,"exp.frame_renders":0,"exp.frame_drops":0,"exp.ghost_session_ms":0,"exp.ghost_sessions":0,"exp.qmx_stale_ms":0,"exp.qmx_error_ms":0,"exp.qss_late_ms":0,"exp.qss_error_ms":0,"exp.fom":0,"exp.fom_weight":0,"data.dl_bytes":228,"data.ul_bytes":406,"http.oks":2,"http.errors":0,"http.timeouts":0,"net.throughput":8977,"http.slows":0,"data.bitrate_mean":1968,"data.bitrate_stddev":0,"data.bitrate_median":1950,"data.bitrate_modes":[1950],"data.streamlets":1,"data.late_streamlets":0,"data.all_streamlets":0,"data.bf_streamlets":0,"data.ab_streamlets":0}
,{"type":"session.started","v":2,"post.source":"1306210600-col004.sv1.movenetworks.com:2183#6","post.ip4":"63.225.172.43","post.rtime_secs":1371794671,"post.rtime_text":"2013-06-21 06:04:31","post.time_secs":1371794660,"post.time_text":"2013-06-21 06:04:20","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":11,"id.session":"232169818","id.sub":"55d514ba-3858-11e2-91a7-12313d08e01f","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3100X","device.info":"Roku;3100X","device.serial":"13C2AE061481","device.version":"2.5.0.37","device.uuid":"7f5654d5-3aa7-5a5f-bb2b-8084da358942","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130615","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"98115","ip.dma":819,"ip.dma_name":"unknown","ip.city":"seattle","ip.country":"united states","ip.region":"washington","ip.continent":"north america","ip.isp":"qwest communications company llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1339170","clip.pub":"CBD46B77","asset.id":522015,"asset.pub":"CBD46B77","asset.length_ms":7200000,"asset.guid":"c6938cfa200a21e90dce41f5ed131cc2","asset.title":"Spark Top 20","asset.type":"captured","asset.adult":false,"container.id":277,"container.pub":"CBD46B77","container.guid":"03e3a689e245457bba2f98c30ef931fa","container.title":"BIGMGC","usage.stage":"mainVideo","exp.idle_ms":5772}
]'
You need the function fromJSON of the RJSONIO package:
library(RJSONIO)
json <- fromJSON(file, nullValue = NA)
Replace (empty) lists by NA and convert to data frames:
dat <- lapply(json, function(j) {
as.data.frame(replace(j, sapply(j, is.list), NA))
})
Create a single data frame:
library(plyr)
res <- rbind.fill(dat)
The result (res):
type v
1 session.ended 2
2 session.started 2
post.source
1 1306210600-col001.sv1.movenetworks.com:2097#5
2 1306210600-col004.sv1.movenetworks.com:2183#6
post.ip4 post.rtime_secs post.rtime_text
1 75.114.187.146 1371794661 2013-06-21 06:04:20
2 63.225.172.43 1371794671 2013-06-21 06:04:31
post.time_secs post.time_text post.time_date
1 1371794596 2013-06-21 06:03:16 2013-06-21
2 1371794660 2013-06-21 06:04:20 2013-06-21
post.time_hour post.late_secs id.session
1 6 65 3625657
2 6 11 232169818
id.sub
1 2370b726-b96e-11e2-b3eb-1231380e1adf
2 55d514ba-3858-11e2-91a7-12313d08e01f
id.partner id.service
1 0CB48A664E514CA48D378D152574EDBB BBTV (CBD46B77)
2 0CB48A664E514CA48D378D152574EDBB BBTV (CBD46B77)
device.make device.model device.info device.serial
1 Roku 3050X Roku;3050X 12C241003940
2 Roku 3100X Roku;3100X 13C2AE061481
device.version device.uuid
1 2.5.0.7 51ce2255-62ad-5778-b2d7-b9543c1476c6
2 2.5.0.37 7f5654d5-3aa7-5a5f-bb2b-8084da358942
device.os device.os_version device.platform
1 Linux 2.6.35 Roku
2 Linux 2.6.35 Roku
device.platform_vendor device.platform_version
1 Move Networks 1.0.0.20130329
2 Move Networks 1.0.0.20130615
device.licenses user.type ip.provider ip.postal_code
1 NA Subscriber netacuity 48154
2 NA Subscriber netacuity 98115
ip.dma ip.dma_name ip.city ip.country ip.region
1 505 unknown livonia united states michigan
2 819 unknown seattle united states washington
ip.continent ip.isp ip.asn
1 north america bright house networks llc 0
2 north america qwest communications company llc 0
ip.asn_owner clip.id clip.pub asset.id asset.pub
1 ? 1338713 CBD46B77 524768 CBD46B77
2 ? 1339170 CBD46B77 522015 CBD46B77
asset.length_ms asset.guid
1 1800000 b7c12c09dc5aec832e142a00b0f191fa
2 7200000 c6938cfa200a21e90dce41f5ed131cc2
asset.title asset.type asset.adult
1 Diya Aur Bati Hum captured FALSE
2 Spark Top 20 captured FALSE
asset.franchise_guid asset.franchise_title
1 941496e1452b4fce9acfe7b3339924eb Diya Aur Bati Hum
2 <NA> <NA>
container.id container.pub
1 14 CBD46B77
2 277 CBD46B77
container.guid container.title
1 3caa6afd715e4c57ac4750d29e449a9c SPLUS
2 03e3a689e245457bba2f98c30ef931fa BIGMGC
usage.elapsed_ms usage.viewed_ms usage.stage
1 2312 392 mainVideo
2 NA NA mainVideo
exp.idle_ms exp.stalls exp.stalled_ms
1 350 0 0
2 5772 NA NA
exp.frame_renders exp.frame_drops exp.ghost_session_ms
1 0 0 0
2 NA NA NA
exp.ghost_sessions exp.qmx_stale_ms exp.qmx_error_ms
1 0 0 0
2 NA NA NA
exp.qss_late_ms exp.qss_error_ms exp.fom
1 0 0 0
2 NA NA NA
exp.fom_weight data.dl_bytes data.ul_bytes http.oks
1 0 228 406 2
2 NA NA NA NA
http.errors http.timeouts net.throughput http.slows
1 0 0 8977 0
2 NA NA NA NA
data.bitrate_mean data.bitrate_stddev
1 1968 0
2 NA NA
data.bitrate_median data.bitrate_modes data.streamlets
1 1950 1950 1
2 NA NA NA
data.late_streamlets data.all_streamlets
1 0 0
2 NA NA
data.bf_streamlets data.ab_streamlets
1 0 0
2 NA NA
I am trying to read data from tables similar to the following http://www.fec.gov/pubrec/fe1996/hraz.htm using R but have been unable to make progress. I realize that to do so I need to use XML and RCurl but in spite of the numerous other examples on the web concerning similar problems I have not been able to resolve this one.
The first issue is that the table is only a table when viewing it but is not coded as such. Treating it as an xml document I can access the "data" in the table but because there are several tables I would like to get I don't believe this to be the most elegant solution.
Treating it as an html document might work better but I am relatively unfamiliar with xpathApply and do not know how to get at the actual "data" in the table since it is not bracketed by anything (i.e. a i-/i or b-/b).
I have had some success using xml files in the past but this is my first attempt at doing something similar with html files. These files in particular seem to have less structure then other examples I have seen.
Any help is much appreciated.
Assuming you can read the html output into a text file (the equivalent of copying+pasting form your web browser),
this should get you a good chunk of the way there:
# x is the output from the website
library(stringr)
library(data.table)
# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)
# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]
# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)
# grab the straggling district number, use it as a name and remove it
# end of first line
eofl <- str_locate(districts, "\n")[,2]
# trim white space and assign as name
names(districts) <- str_trim(substr(districts, 1, eofl))
# remove first line
districts <- substr(districts, eofl+1, nchar(districts))
# replace the ending '-------' and trime white space
districts <- str_trim(str_replace_all(districts, "---*", ""))
# Adjust delimeter (this is the tricky part)
## more than two spaces are a spearator
districts <- str_replace_all(districts, " +", "\t")
## lines that are total tallies are missing two columns.
## thus, need to add two extra delims. After the first and third columns
# this function will
padDelims <- function(section, splton) {
# split into lines
section <- strsplit(section, splton)[[1]]
# identify lines starting with totals
LinesToFix <- str_detect(section, "^Total")
# pad appropriate columns
section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])
# any rows missing delims, pad at end
counts <- str_count(section, "\t")
toadd <- max(counts) - counts
section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd)
# paste it back together and return
paste(section, collapse=splton)
}
districts <- lapply(districts, padDelims, splton="\n")
# reading the table and simultaneously addding the district column
districtTables <-
lapply(names(districts), function(d)
data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
# ... or without adding district number:
## lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))
# flatten it
votes <- do.call(rbind, districtTables)
setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )
Sample table:
votes
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
1: Salmon, Matt R 33672 100.00 135634.00 60.18 1
2: Total Party Votes: 33672 NA NA NA 1
3: NA NA NA NA 1
4: Cox, John W(D)/D 1942 100.00 89738.00 39.82 1
5: Total Party Votes: 1942 NA NA NA 1
6: NA NA NA NA 1
7: Total District Votes: 35614 NA 225372.00 NA 1
8: Pastor, Ed D 29969 100.00 81982.00 65.01 2
9: Total Party Votes: 29969 NA NA NA 2
10: NA NA NA NA 2
...
51: Hayworth, J.D. R 32554 100.00 121431.00 47.57 6
52: Total Party Votes: 32554 NA NA NA 6
53: NA NA NA NA 6
54: Owens, Steve D 35137 100.00 118957.00 46.60 6
55: Total Party Votes: 35137 NA NA NA 6
56: NA NA NA NA 6
57: Anderson, Robert LBT 148 100.00 14899.00 5.84 6
58: NA NA NA NA 6
59: Total District Votes: 67839 NA 255287.00 NA 6
60: NA NA NA NA 6
61: Total State Votes: 368185 NA 1356446.00 NA 6
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District