Zero-Inflated Poisson Regression

Zero-Inflated Poisson Regression - regression

I want to predict the number of visits of a place of interest, i.e. restaurants, and my data has the following characteristics: my visits data is count data, it is highly skewed, there are a high amount of zeros present and there are some numbers acting like outliers. I cannot remove the zeros because they are indicating that either there are not visits at that specific time or there are no places of interest in that area. I can't remove the outliers either because it indicates the maximum number of visits during that time period. My dependent variable is the number of visits per census tract, and my independent variables involve census data such as total population, age groups (15-24, 25-34, 35-44, 45-44, 55-64), household size, household income (less than $10K...$149.9K), population by gender, transportation (total vehicles, total public, total walked, total other), and the number of places of interests per tract. Below the distribution of the number of visits:
Since the distribution is not normal, I worked the generalized linear model using a Poisson distribution. However, the variance is greater than the mean, indicating over-dispersion. Since there is a high number of zeros, my intuition tells me to use the zero-inflated poisson or the negative binomial. When I ran the zero-inflated poisson model using the zeroinfl() function in R, I got the following results:
Call:
zeroinfl(formula = Vsts_p_ ~ Ttl_ppl + A_15__2 + A_25__3 + A_35__4 + A_45__5 +
A_55__6 + Hshlds_ + L__1000 + X10000_ + X15000_ + X25000_ + X35000_ +
X50000_ + X75000_ + X100000 + Totl_Ml + Ttl_vhc + Ttl_pbl + Ttl_wlk +
Ttl____ + POIs, data = rest_train, family = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-9.6015 -0.6256 -0.2359 -0.1049 82.1648
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.7918192 NA NA NA
Ttl_ppl -0.0007789 NA NA NA
A_15__2 0.0004508 NA NA NA
A_25__3 -0.0009964 NA NA NA
A_35__4 0.0008019 NA NA NA
A_45__5 -0.0031752 NA NA NA
A_55__6 -0.0006723 NA NA NA
Hshlds_ 0.0019657 NA NA NA
L__1000 -0.0499632 NA NA NA
X10000_ 0.1235269 NA NA NA
X15000_ -0.1172473 NA NA NA
X25000_ 0.0497449 NA NA NA
X35000_ -0.0297751 NA NA NA
X50000_ -0.1157051 NA NA NA
X75000_ -0.0102363 NA NA NA
X100000 -0.0120035 NA NA NA
Totl_Ml 0.0011945 NA NA NA
Ttl_vhc -0.0004261 NA NA NA
Ttl_pbl -0.0091919 NA NA NA
Ttl_wlk -0.0042964 NA NA NA
Ttl____ 0.0073077 NA NA NA
POIs 0.0583964 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.834e+00 NA NA NA
Ttl_ppl 1.235e-03 NA NA NA
A_15__2 -1.707e-03 NA NA NA
A_25__3 -4.341e-03 NA NA NA
A_35__4 -6.557e-04 NA NA NA
A_45__5 -5.858e-03 NA NA NA
A_55__6 -2.595e-03 NA NA NA
Hshlds_ -1.411e-04 NA NA NA
L__1000 -5.746e-02 NA NA NA
X10000_ 6.327e-02 NA NA NA
X15000_ 2.452e-02 NA NA NA
X25000_ -2.395e-02 NA NA NA
X35000_ 7.081e-03 NA NA NA
X50000_ -5.557e-03 NA NA NA
X75000_ 8.765e-03 NA NA NA
X100000 2.724e-03 NA NA NA
Totl_Ml 9.600e-04 NA NA NA
Ttl_vhc 7.418e-04 NA NA NA
Ttl_pbl -1.734e-05 NA NA NA
Ttl_wlk 1.022e-02 NA NA NA
Ttl____ -9.307e-04 NA NA NA
POIs -1.009e+00 NA NA NA
Number of iterations in BFGS optimization: 50
Log-likelihood: -6258 on 44 Df
Why I'm getting NA results? I ran the glm poisson model and it ran without any issues. My intention is to predict the number of visits using the generalized geographical weighted regression (GWR) model but first, I want to see which of the independent variables influence the dependent variable, I cannot use the correlation plot to do that because of the distribution not being normal. I need to select the appropriate independent variables to run the ggwr model.

Related

Extracting Raster Variables but only getting NA's

I am trying to extract variable values from 16 .img files and bind it to a pre-existing structure, and while the code runs, it only spits back NA values. I've tried converting the crs but it hasn't done anything.
library(raster)
library(sp)
library(rgdal)
library(sf)
library(exactextractr)
setwd("~/preds")
pres.abs <- st_read("~spp106mort.PPsA.SF.shp") # load in GIS data
etpt_5.r <- raster("etpt_5.img", band=1)
etpt_6.r <- raster("etpt_6.img", band=1)
etpt_sprin.r <- raster("etpt_sprin.img", band=1)
exp1nrm.r <- raster("exp1nrm.img", band=1)
exp3nrm.r <- raster("exp3nrm.img", band=1)
exp5nrm.r <- raster("exp5nrm.img", band=1)
mind_yr_av.r <- raster("mind_yr_av.img", band=1)
prad_sw_di.r <- raster("prad_sw_di.img", band=1)
prec_w_hal.r <- raster("prec_w_hal.img", band=1)
prec_winte.r <- raster("prec_winte.img", band=1)
rough_1k.r <- raster("rough_1k.img", band=1)
tave_s_hal.r <- raster("tave_s_hal.img", band=1)
tave_sprin.r <- raster("tave_sprin.img", band=1)
tmax_s_hal.r <- raster("tmax_s_hal.img", band=1)
tmax_summe.r <- raster("tmax_summe.img", band=1)
topos.r <- raster("topos.img", band=1)
allvars <- stack(etpt_5.r, etpt_6.r, etpt_sprin.r, exp1nrm.r, exp3nrm.r, exp5nrm.r,
mind_yr_av.r, prad_sw_di.r, prec_w_hal.r, prec_winte.r, rough_1k.r,
tave_s_hal.r, tave_sprin.r, tmax_s_hal.r, tmax_summe.r, topos.r)
## converting pres.abs to new crs
pres.abs.old <- pres.abs
newcrs <- crs(allvars)
pres.abs.new <- st_transform(pres.abs, newcrs)
crs(allvars)
Coordinate Reference System:
Deprecated Proj.4 representation: +proj=longlat +datum=WGS84 +no_defs
WKT2 2019 representation:
GEOGCRS["WGS 84 (with axis order normalized for visualization)",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic longitude (Lon)",east,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["geodetic latitude (Lat)",north,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]
projection(allvars)
[1] "+proj=longlat +datum=WGS84 +no_defs"
crs(pres.abs.new)
Coordinate Reference System:
Deprecated Proj.4 representation: +proj=longlat +datum=WGS84 +no_defs
WKT2 2019 representation:
GEOGCRS["WGS 84 (with axis order normalized for visualization)",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic longitude (Lon)",east,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]],
AXIS["geodetic latitude (Lat)",north,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433,
ID["EPSG",9122]]]]
projection(pres.abs.new)
[1] "+proj=longlat +datum=WGS84 +no_defs"
So I got the crs and projection the same, but when it comes time to extract, here's what happens
head(pres.abs.new)
Simple feature collection with 6 features and 12 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -96.00105 ymin: 23.00031 xmax: -96.00105 ymax: 23.00031
Geodetic CRS: WGS 84 (with axis order normalized for visualization)
UNIQUEID wgs_xF wgs_yF MORT106 exp5nrm topos rough_1k prad_sw_di tmax_s_hal etpt_6 mind_yr_av prec_w_hal
1 Z0401030100988520 -109.6594 33.25329 1 -23 -1.307692 25.61479 -5511 262.3333 101 -1676.667 34.83333
2 Z0401030301183589 -109.4448 33.20588 1 -48 -5.153846 51.95938 -5622 278.6667 101 -1708.083 29.83333
3 Z0401030401180927 -109.2798 33.42938 1 21 19.923077 39.72136 -5198 258.0000 100 -1680.667 33.50000
4 Z0401030501183471 -109.3880 33.38541 1 0 18.230770 27.97424 -5506 250.8333 100 -1585.667 35.83333
5 Z0401030601186305 -109.3887 33.34027 1 9 24.153847 34.67730 -5456 262.8333 101 -1679.917 33.00000
6 Z0401030701187126 -109.2232 33.56368 1 110 -33.538460 61.15303 -5409 248.5000 99 -1495.417 34.16667
geometry
1 POINT (-96.00105 23.00031)
2 POINT (-96.00105 23.00031)
3 POINT (-96.00105 23.00031)
4 POINT (-96.00105 23.00031)
5 POINT (-96.00105 23.00031)
6 POINT (-96.00105 23.00031)
extracted <- extract(allvars, pres.abs.new[, c("wgs_xF", "wgs_yF")])
etpt_5 etpt_6 etpt_sprin exp1nrm exp3nrm exp5nrm mind_yr_av prad_sw_di prec_w_hal prec_winte rough_1k tave_s_hal tave_sprin tmax_s_hal
[1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
tmax_summe topos
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] NA NA
[5,] NA NA
[6,] NA NA
This is weird because each of the rasters I imported have distinct minimum and maximum values, so I know that they're not actually NA
prec_winte.r
class : RasterLayer
dimensions : 1518, 1478, 2243604 (nrow, ncol, ncell)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -114.6167, -102.3, 29.49167, 42.14167 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : prec_winte.img
names : prec_winte
values : 3.333333, 140.3333 (min, max)
I think the issue is dependent on where in the raster or dataframe I'm pulling the values from, but I can't figure out where!

Scrape nested html structure

I would like to scrape the data from this site, without losing the information from the nested structure. Consider the name benodanil, which not only belongs to benzanilide fungicides, but also to anilide fungicides and amide fungicides. It's not necessarily always 3 classes, but at least one and up to many. So, ideally, I'd want a data.frame that looks as such:
name
class1
class2
class3
...
benodanil
benzanilide fungicides
anilide fungicides
amide fungicides
NA
aureofungin
antibiotic fungicides
NA
NA
NA
...
...
...
...
I can scrape the data, but can't wrap my head around how to handle the information in the nested structure. What I tried so far:
require(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
# extract lists
li = html_nodes(site, 'li')
# extract unorder lists
ul = html_nodes(site, 'ul')
# loop idea
l = list()
for (i in seq_along(li)) {
li1 = html_nodes(li[i], 'a')
name = na.omit(unique(html_attr(li1, 'href')))
clas = na.omit(unique(html_attr(li1, 'name')))
l[[i]] = list(name = name,
clas = clas)
}
An additional problem is, that some names occur more than one time, such as bixafen. Hence, I guess the job has to be done iteratively.

library(dplyr)
library(tidyr)
library(rvest)
url = 'http://www.alanwood.net/pesticides/class_fungicides.html'
site = read_html(url)
a <- site %>% html_nodes('li ul a')
tibble(name = a %>% html_attr('href'),
class = a %>% html_attr('name')) %>%
fill(class) %>%
filter(!is.na(name)) %>%
mutate(name = sub('\\.html', '', name)) %>%
group_by(name) %>%
mutate(col = paste0('class', row_number())) %>%
pivot_wider(names_from = col, values_from = class) %>%
ungroup()
# A tibble: 189 x 4
# name class1 class2 class3
# <chr> <chr> <chr> <chr>
# 1 benalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 2 benalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 3 furalaxyl acylamino_acid_fungici… furanilide_fungicides NA
# 4 metalaxyl acylamino_acid_fungici… anilide_fungicides NA
# 5 metalaxyl-m acylamino_acid_fungici… anilide_fungicides NA
# 6 pefurazoate acylamino_acid_fungici… NA NA
# 7 valifenalate acylamino_acid_fungici… NA NA
# 8 bixafen anilide_fungicides picolinamide_fungici… pyrazolecarboxamide_fungic…
# 9 boscalid anilide_fungicides NA NA
#10 carboxin anilide_fungicides NA NA
# … with 179 more rows
Extract name and class from the webpage, fill the NA values with the previous non-NA, drop rows with NA values and get the data in wide format.

Conditional sum on data.frame based on duplicates

I have been trying to make a conditional sum based on a data.framethat has duplicates. I want to sum the ones that has an identical permno and date and create a separate column with this information filling in NA's or preferable 0's.
My data set looks like this:
data.frame(crsp)
permno date PAYDT DISTCD divamt FACPR FACSHR PRC RET
1 10022 19280929 19281001 1272 0.25 0 0 71.00 0.045208
2 10022 19280929 19281001 1232 1.00 0 0 71.00 0.045208
3 10022 19281031 NA NA NA NA NA 73.50 0.035211
4 10022 19281130 NA NA NA NA NA 72.50 -0.013605
5 10022 19281231 19290202 1232 1.00 0 0 68.00 -0.044828
6 10022 19281231 19290202 1272 0.25 0 0 68.00 -0.044828
7 10022 19290131 NA NA NA NA NA 73.75 0.084559
8 10022 19290228 NA NA NA NA NA 69.00 -0.064407
9 10022 19290328 19290401 1232 1.00 0 0 65.00 -0.039855
10 10022 19290328 19290401 1272 0.25 0 0 65.00 -0.039855
11 10022 19290430 NA NA NA NA NA 67.00 0.030769
12 10022 19290531 NA NA NA NA NA 64.75 -0.033582
First, I have created permno + date to make a unique pickup-code
crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="")
Second, I have then tried to sum the duplicates and making this into a new frame:
crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)
However, I am unable to transfer this information back correctly to the original data.frame(crsp), as the columns have different lenghts where cbind and cbind.fill don't allow me to match this correctly. Specifically, I want the sum of the divamts for one/the first of the unique permnodates so it corresponds with the remaining data.frame in length. I have not had succed with merge or match either.
I haven't tried loop functions yet or managed to create any if or ifelse functions with succes. Basically, this can be done in excel with the VLOOKUP or the index.match formula, however, this is more tricky in R than I first thought.
Help is much appreciated.
Best regards
Troels

You can use duplicated and merge to achieve this more easily. I've written an example. You'll have to alter this for your purposes, but hopefully it will put you on the right track:
# Creating a fake sample dataset.
set.seed(9)
permno <- 10022:10071 # Allowing 50 possible permno's.
date <- 19280929:19280978 # Allow 50 possible dates.
value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.
# Creating fake data frame.
crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))
# Loading a function that uses duplicated to get both the duplicated rows and the original rows.
fullDup <- function(x) {
bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
return(bool)
}
# Getting the duplicated rows.
crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.
# Now aggregate.
crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.
# Now merge back in with the original dataset.
crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)

How to convert json file into dataframe in R?

I have a json text file which reads
{"type":"session.ended","v":2,"post.source":"1306210600-col001.sv1.movenetworks.com:2097#5","post.ip4":"75.114.187.146","post.rtime_secs":1371794661,"post.rtime_text":"2013-06-21 06:04:20","post.time_secs":1371794596,"post.time_text":"2013-06-21 06:03:16","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":65,"id.session":"3625657","id.sub":"2370b726-b96e-11e2-b3eb-1231380e1adf","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3050X","device.info":"Roku;3050X","device.serial":"12C241003940","device.version":"2.5.0.7","device.uuid":"51ce2255-62ad-5778-b2d7-b9543c1476c6","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130329","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"48154","ip.dma":505,"ip.dma_name":"unknown","ip.city":"livonia","ip.country":"united states","ip.region":"michigan","ip.continent":"north america","ip.isp":"bright house networks llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1338713","clip.pub":"CBD46B77","asset.id":524768,"asset.pub":"CBD46B77","asset.length_ms":1800000,"asset.guid":"b7c12c09dc5aec832e142a00b0f191fa","asset.title":"Diya Aur Bati Hum","asset.type":"captured","asset.adult":false,"asset.franchise_guid":"941496e1452b4fce9acfe7b3339924eb","asset.franchise_title":"Diya Aur Bati Hum","container.id":14,"container.pub":"CBD46B77","container.guid":"3caa6afd715e4c57ac4750d29e449a9c","container.title":"SPLUS","usage.elapsed_ms":2312,"usage.viewed_ms":392,"usage.stage":"mainVideo","exp.idle_ms":350,"exp.stalls":0,"exp.stalled_ms":0,"exp.frame_renders":0,"exp.frame_drops":0,"exp.ghost_session_ms":0,"exp.ghost_sessions":0,"exp.qmx_stale_ms":0,"exp.qmx_error_ms":0,"exp.qss_late_ms":0,"exp.qss_error_ms":0,"exp.fom":0,"exp.fom_weight":0,"data.dl_bytes":228,"data.ul_bytes":406,"http.oks":2,"http.errors":0,"http.timeouts":0,"net.throughput":8977,"http.slows":0,"data.bitrate_mean":1968,"data.bitrate_stddev":0,"data.bitrate_median":1950,"data.bitrate_modes":[1950],"data.streamlets":1,"data.late_streamlets":0,"data.all_streamlets":0,"data.bf_streamlets":0,"data.ab_streamlets":0}
{"type":"session.started","v":2,"post.source":"1306210600-col004.sv1.movenetworks.com:2183#6","post.ip4":"63.225.172.43","post.rtime_secs":1371794671,"post.rtime_text":"2013-06-21 06:04:31","post.time_secs":1371794660,"post.time_text":"2013-06-21 06:04:20","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":11,"id.session":"232169818","id.sub":"55d514ba-3858-11e2-91a7-12313d08e01f","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3100X","device.info":"Roku;3100X","device.serial":"13C2AE061481","device.version":"2.5.0.37","device.uuid":"7f5654d5-3aa7-5a5f-bb2b-8084da358942","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130615","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"98115","ip.dma":819,"ip.dma_name":"unknown","ip.city":"seattle","ip.country":"united states","ip.region":"washington","ip.continent":"north america","ip.isp":"qwest communications company llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1339170","clip.pub":"CBD46B77","asset.id":522015,"asset.pub":"CBD46B77","asset.length_ms":7200000,"asset.guid":"c6938cfa200a21e90dce41f5ed131cc2","asset.title":"Spark Top 20","asset.type":"captured","asset.adult":false,"container.id":277,"container.pub":"CBD46B77","container.guid":"03e3a689e245457bba2f98c30ef931fa","container.title":"BIGMGC","usage.stage":"mainVideo","exp.idle_ms":5772}
I want to load it in R and convert to a dataframe.Here field names are a part of the data and also we have unequal no of fields in each row (a total of 13 rows)
Any help will be appreciated.

Here's one way to do it:
file <- '[
{"type":"session.ended","v":2,"post.source":"1306210600-col001.sv1.movenetworks.com:2097#5","post.ip4":"75.114.187.146","post.rtime_secs":1371794661,"post.rtime_text":"2013-06-21 06:04:20","post.time_secs":1371794596,"post.time_text":"2013-06-21 06:03:16","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":65,"id.session":"3625657","id.sub":"2370b726-b96e-11e2-b3eb-1231380e1adf","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3050X","device.info":"Roku;3050X","device.serial":"12C241003940","device.version":"2.5.0.7","device.uuid":"51ce2255-62ad-5778-b2d7-b9543c1476c6","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130329","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"48154","ip.dma":505,"ip.dma_name":"unknown","ip.city":"livonia","ip.country":"united states","ip.region":"michigan","ip.continent":"north america","ip.isp":"bright house networks llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1338713","clip.pub":"CBD46B77","asset.id":524768,"asset.pub":"CBD46B77","asset.length_ms":1800000,"asset.guid":"b7c12c09dc5aec832e142a00b0f191fa","asset.title":"Diya Aur Bati Hum","asset.type":"captured","asset.adult":false,"asset.franchise_guid":"941496e1452b4fce9acfe7b3339924eb","asset.franchise_title":"Diya Aur Bati Hum","container.id":14,"container.pub":"CBD46B77","container.guid":"3caa6afd715e4c57ac4750d29e449a9c","container.title":"SPLUS","usage.elapsed_ms":2312,"usage.viewed_ms":392,"usage.stage":"mainVideo","exp.idle_ms":350,"exp.stalls":0,"exp.stalled_ms":0,"exp.frame_renders":0,"exp.frame_drops":0,"exp.ghost_session_ms":0,"exp.ghost_sessions":0,"exp.qmx_stale_ms":0,"exp.qmx_error_ms":0,"exp.qss_late_ms":0,"exp.qss_error_ms":0,"exp.fom":0,"exp.fom_weight":0,"data.dl_bytes":228,"data.ul_bytes":406,"http.oks":2,"http.errors":0,"http.timeouts":0,"net.throughput":8977,"http.slows":0,"data.bitrate_mean":1968,"data.bitrate_stddev":0,"data.bitrate_median":1950,"data.bitrate_modes":[1950],"data.streamlets":1,"data.late_streamlets":0,"data.all_streamlets":0,"data.bf_streamlets":0,"data.ab_streamlets":0}
,{"type":"session.started","v":2,"post.source":"1306210600-col004.sv1.movenetworks.com:2183#6","post.ip4":"63.225.172.43","post.rtime_secs":1371794671,"post.rtime_text":"2013-06-21 06:04:31","post.time_secs":1371794660,"post.time_text":"2013-06-21 06:04:20","post.time_date":"2013-06-21","post.time_hour":6,"post.late_secs":11,"id.session":"232169818","id.sub":"55d514ba-3858-11e2-91a7-12313d08e01f","id.partner":"0CB48A664E514CA48D378D152574EDBB","id.service":"BBTV (CBD46B77)","device.make":"Roku","device.model":"3100X","device.info":"Roku;3100X","device.serial":"13C2AE061481","device.version":"2.5.0.37","device.uuid":"7f5654d5-3aa7-5a5f-bb2b-8084da358942","device.os":"Linux","device.os_version":"2.6.35","device.platform":"Roku","device.platform_vendor":"Move Networks","device.platform_version":"1.0.0.20130615","device.licenses":[],"user.type":"Subscriber","ip.provider":"netacuity","ip.postal_code":"98115","ip.dma":819,"ip.dma_name":"unknown","ip.city":"seattle","ip.country":"united states","ip.region":"washington","ip.continent":"north america","ip.isp":"qwest communications company llc","ip.asn":0,"ip.asn_owner":"?","clip.id":"1339170","clip.pub":"CBD46B77","asset.id":522015,"asset.pub":"CBD46B77","asset.length_ms":7200000,"asset.guid":"c6938cfa200a21e90dce41f5ed131cc2","asset.title":"Spark Top 20","asset.type":"captured","asset.adult":false,"container.id":277,"container.pub":"CBD46B77","container.guid":"03e3a689e245457bba2f98c30ef931fa","container.title":"BIGMGC","usage.stage":"mainVideo","exp.idle_ms":5772}
]'
You need the function fromJSON of the RJSONIO package:
library(RJSONIO)
json <- fromJSON(file, nullValue = NA)
Replace (empty) lists by NA and convert to data frames:
dat <- lapply(json, function(j) {
as.data.frame(replace(j, sapply(j, is.list), NA))
})
Create a single data frame:
library(plyr)
res <- rbind.fill(dat)
The result (res):
type v
1 session.ended 2
2 session.started 2
post.source
1 1306210600-col001.sv1.movenetworks.com:2097#5
2 1306210600-col004.sv1.movenetworks.com:2183#6
post.ip4 post.rtime_secs post.rtime_text
1 75.114.187.146 1371794661 2013-06-21 06:04:20
2 63.225.172.43 1371794671 2013-06-21 06:04:31
post.time_secs post.time_text post.time_date
1 1371794596 2013-06-21 06:03:16 2013-06-21
2 1371794660 2013-06-21 06:04:20 2013-06-21
post.time_hour post.late_secs id.session
1 6 65 3625657
2 6 11 232169818
id.sub
1 2370b726-b96e-11e2-b3eb-1231380e1adf
2 55d514ba-3858-11e2-91a7-12313d08e01f
id.partner id.service
1 0CB48A664E514CA48D378D152574EDBB BBTV (CBD46B77)
2 0CB48A664E514CA48D378D152574EDBB BBTV (CBD46B77)
device.make device.model device.info device.serial
1 Roku 3050X Roku;3050X 12C241003940
2 Roku 3100X Roku;3100X 13C2AE061481
device.version device.uuid
1 2.5.0.7 51ce2255-62ad-5778-b2d7-b9543c1476c6
2 2.5.0.37 7f5654d5-3aa7-5a5f-bb2b-8084da358942
device.os device.os_version device.platform
1 Linux 2.6.35 Roku
2 Linux 2.6.35 Roku
device.platform_vendor device.platform_version
1 Move Networks 1.0.0.20130329
2 Move Networks 1.0.0.20130615
device.licenses user.type ip.provider ip.postal_code
1 NA Subscriber netacuity 48154
2 NA Subscriber netacuity 98115
ip.dma ip.dma_name ip.city ip.country ip.region
1 505 unknown livonia united states michigan
2 819 unknown seattle united states washington
ip.continent ip.isp ip.asn
1 north america bright house networks llc 0
2 north america qwest communications company llc 0
ip.asn_owner clip.id clip.pub asset.id asset.pub
1 ? 1338713 CBD46B77 524768 CBD46B77
2 ? 1339170 CBD46B77 522015 CBD46B77
asset.length_ms asset.guid
1 1800000 b7c12c09dc5aec832e142a00b0f191fa
2 7200000 c6938cfa200a21e90dce41f5ed131cc2
asset.title asset.type asset.adult
1 Diya Aur Bati Hum captured FALSE
2 Spark Top 20 captured FALSE
asset.franchise_guid asset.franchise_title
1 941496e1452b4fce9acfe7b3339924eb Diya Aur Bati Hum
2 <NA> <NA>
container.id container.pub
1 14 CBD46B77
2 277 CBD46B77
container.guid container.title
1 3caa6afd715e4c57ac4750d29e449a9c SPLUS
2 03e3a689e245457bba2f98c30ef931fa BIGMGC
usage.elapsed_ms usage.viewed_ms usage.stage
1 2312 392 mainVideo
2 NA NA mainVideo
exp.idle_ms exp.stalls exp.stalled_ms
1 350 0 0
2 5772 NA NA
exp.frame_renders exp.frame_drops exp.ghost_session_ms
1 0 0 0
2 NA NA NA
exp.ghost_sessions exp.qmx_stale_ms exp.qmx_error_ms
1 0 0 0
2 NA NA NA
exp.qss_late_ms exp.qss_error_ms exp.fom
1 0 0 0
2 NA NA NA
exp.fom_weight data.dl_bytes data.ul_bytes http.oks
1 0 228 406 2
2 NA NA NA NA
http.errors http.timeouts net.throughput http.slows
1 0 0 8977 0
2 NA NA NA NA
data.bitrate_mean data.bitrate_stddev
1 1968 0
2 NA NA
data.bitrate_median data.bitrate_modes data.streamlets
1 1950 1950 1
2 NA NA NA
data.late_streamlets data.all_streamlets
1 0 0
2 NA NA
data.bf_streamlets data.ab_streamlets
1 0 0
2 NA NA

Reading fixed width format text tables from HTML page

I am trying to read data from tables similar to the following http://www.fec.gov/pubrec/fe1996/hraz.htm using R but have been unable to make progress. I realize that to do so I need to use XML and RCurl but in spite of the numerous other examples on the web concerning similar problems I have not been able to resolve this one.
The first issue is that the table is only a table when viewing it but is not coded as such. Treating it as an xml document I can access the "data" in the table but because there are several tables I would like to get I don't believe this to be the most elegant solution.
Treating it as an html document might work better but I am relatively unfamiliar with xpathApply and do not know how to get at the actual "data" in the table since it is not bracketed by anything (i.e. a i-/i or b-/b).
I have had some success using xml files in the past but this is my first attempt at doing something similar with html files. These files in particular seem to have less structure then other examples I have seen.
Any help is much appreciated.

Assuming you can read the html output into a text file (the equivalent of copying+pasting form your web browser),
this should get you a good chunk of the way there:
# x is the output from the website
library(stringr)
library(data.table)
# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)
# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]
# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)
# grab the straggling district number, use it as a name and remove it
# end of first line
eofl <- str_locate(districts, "\n")[,2]
# trim white space and assign as name
names(districts) <- str_trim(substr(districts, 1, eofl))
# remove first line
districts <- substr(districts, eofl+1, nchar(districts))
# replace the ending '-------' and trime white space
districts <- str_trim(str_replace_all(districts, "---*", ""))
# Adjust delimeter (this is the tricky part)
## more than two spaces are a spearator
districts <- str_replace_all(districts, " +", "\t")
## lines that are total tallies are missing two columns.
## thus, need to add two extra delims. After the first and third columns
# this function will
padDelims <- function(section, splton) {
# split into lines
section <- strsplit(section, splton)[[1]]
# identify lines starting with totals
LinesToFix <- str_detect(section, "^Total")
# pad appropriate columns
section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])
# any rows missing delims, pad at end
counts <- str_count(section, "\t")
toadd <- max(counts) - counts
section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd)
# paste it back together and return
paste(section, collapse=splton)
}
districts <- lapply(districts, padDelims, splton="\n")
# reading the table and simultaneously addding the district column
districtTables <-
lapply(names(districts), function(d)
data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
# ... or without adding district number:
## lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))
# flatten it
votes <- do.call(rbind, districtTables)
setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )
Sample table:
votes
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
1: Salmon, Matt R 33672 100.00 135634.00 60.18 1
2: Total Party Votes: 33672 NA NA NA 1
3: NA NA NA NA 1
4: Cox, John W(D)/D 1942 100.00 89738.00 39.82 1
5: Total Party Votes: 1942 NA NA NA 1
6: NA NA NA NA 1
7: Total District Votes: 35614 NA 225372.00 NA 1
8: Pastor, Ed D 29969 100.00 81982.00 65.01 2
9: Total Party Votes: 29969 NA NA NA 2
10: NA NA NA NA 2
...
51: Hayworth, J.D. R 32554 100.00 121431.00 47.57 6
52: Total Party Votes: 32554 NA NA NA 6
53: NA NA NA NA 6
54: Owens, Steve D 35137 100.00 118957.00 46.60 6
55: Total Party Votes: 35137 NA NA NA 6
56: NA NA NA NA 6
57: Anderson, Robert LBT 148 100.00 14899.00 5.84 6
58: NA NA NA NA 6
59: Total District Votes: 67839 NA 255287.00 NA 6
60: NA NA NA NA 6
61: Total State Votes: 368185 NA 1356446.00 NA 6
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Zero-Inflated Poisson Regression - regression

Related

Extracting Raster Variables but only getting NA's

Scrape nested html structure

Conditional sum on data.frame based on duplicates

How to convert json file into dataframe in R?

Reading fixed width format text tables from HTML page

Categories

Resources