Data left out when reading json online to R - json

I try to read online json data to R through the below codes in R:
library('jsonlite')
address<-'https://data.cityofchicago.org/resource/qnmj-8ku6.json'
sample<-fromJSON(address)
The codes did run and have results in right format of a table. But only produced 1000 observations while the original city portal database has more than 200,000 observations. I am not sure what to be fixed to download the whole dataset. Please help.

You're using the wrong link to get the data. You can see the correct link by going to 'Export'
library(jsonlite)
address <- "https://data.cityofchicago.org/api/views/qnmj-8ku6/rows.json?accessType=DOWNLOAD"
sample <- fromJSON(address)
length(sample)
# [1]
length(sample[[2]])
# [1] 274228
Although, you may want to get it as a .csv to make it easier to work with straight away?
address <- "https://data.cityofchicago.org/api/views/qnmj-8ku6/rows.csv?accessType=DOWNLOAD"
sample_csv <- read.csv(address)
nrow(sample_csv)
# [1] 274228
str(sample_csv)
# 'data.frame': 274228 obs. of 22 variables:
# $ ID : int 10512552 10517063 10517120 10518590 10518648
# $ Case.Number : Factor w/ 274219 levels "HA107183","HA156050",..
# $ Date : Factor w/ 112977 levels "01/01/2014 01:00:00 AM",..
# $ Block : Factor w/ 27499 levels "0000X E 100TH PL",..
# $ IUCR : Factor w/ 331 levels "0110","0141",..
# $ Primary.Type : Factor w/ 33 levels "ARSON","ASSAULT",..
# $ Description : Factor w/ 310 levels "$500 AND UNDER",..
# ... etc

Related

Difference in Difference in R (Callaway & Sant'Anna)

I'm trying to implement the DiD package by Callaway and Sant'Anna in my master thesis, but I'm coming across errors when I run the DiD code and when I try to view the summary.
did1 <- att_gt(yname = "countgreen",
gname = "signing_year",
idname = "investorid",
tname = "dealyear",
data = panel8)
This code warns me that:
"Be aware that there are some small groups in your dataset.
Check groups: 2006,2007,2008,2011. Dropped 109 observations that had missing data.overlap condition violated for 2009 in time period 2001Not enough control units for group 2009 in time period 2001 to run specified regression"
This error is repeated several hundred times.
Does this mean I need to re-match my treatment firms to control firms using a 1:3 ration (treat:control) rather than the 1:1 I used previously?
Then when I run this code:
summary(did1)
I get this message:
Error in Math.data.frame(list(`mpobj$group` = c(2009L, 2009L, 2009L, 2009L, : non-numeric variable(s) in data frame: mpobj$att
I'm really not too sure what this means.
Can anyone help trouble shoot?
Thanks,
Rory
I don't know the DiD package but i can't answer about the :summary(did1)
If you do str(did1) you should have something like this :
'data.frame': 6 obs. of 7 variables:
$ cluster : int 1 2 3 4 5 6
$ price_scal : num -0.572 -0.132 0.891 1.091 -0.803 ...
$ hd_scal : num -0.778 0.63 0.181 -0.24 0.244 ...
$ ram_scal : num -0.6937 0.00479 0.46411 0.00653 -0.31204 ...
$ screen_scal: num -0.457 2.642 -0.195 2.642 -0.325 ...
$ ads_scal : num 0.315 -0.889 0.472 0.47 -0.822 ...
$ trend_scal : num -0.604 1.267 -0.459 -0.413 1.156 ...
But in your case you should have one variable mpobj$att that is a factor or a str column.
Maybe this should also make the DiD code run.

rjson::fromJSON returns only the first item

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.
I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?
Here's an example that mimics my problem:
> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)
$3x
[1] 20
$6y
[1] 23
The desired output is something like:
# a data frame for both variables
3x 6y
1 20 23
2 60 50
or,
# a data frame for each variable
3x
1 20
2 60
6y
1 23
2 50
What you are looking for is actually a combination of lapply and some application of rbind or related.
I'll extend your data a little, just to have more than 2 elements.
eg <- c("{\"3x\": 20, \"6y\": 23}",
"{\"3x\": 60, \"6y\": 50}",
"{\"3x\": 99, \"6y\": 72}")
library(jsonlite)
Using base R, we can do
do.call(rbind.data.frame, lapply(eg, fromJSON))
# X3x X6y
# 1 20 23
# 2 60 50
# 3 99 72
You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.
Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)
If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps
eg <- c(eg, "{\"3x\": 99}")
then you'll notice that the base R solution no longer works by default.
do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) :
# numbers of columns of arguments do not match
There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:
library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
# `3x` `6y`
# <int> <int>
# 1 20 23
# 2 60 50
# 3 99 72
# 4 99 NA
though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.
eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99

Importing Data from a json file in R

I have a json data file from which I want to import in R. I tried searching for similar blogs but either they are getting data from URLs or the syntax gave errors.
Let's say the name of the json file is "Jsdata.json"
How can i get the data from Jsdata.json to R and convert it into the excel/csv format for a better picture.
To confirm, this is the output using rjson package. The file parameter has to be explicitly specified here, otherwise the function will treat it as a json string and throw an error.
myList = rjson::fromJSON(file = "JsData.json")
myList
# [[1]]
# [[1]]$key
# [1] "type1|new york, ny|NYC|hit"
#
# [[1]]$doc_count
# [1] 12
# [[2]]
# [[2]]$key
# [1] "type1|omaha, ne|Omaha|hit"
# [[2]]$doc_count
# [1] 8
# [[3]]
# [[3]]$key
# [1] "type2|yuba city, ca|Yuba|hit"
# [[3]]$doc_count
# [1] 9
In order to convert this to data frame, you can do:
do.call(rbind, lapply(myList, data.frame))
# key doc_count
# 1 type1|new york, ny|NYC|hit 12
# 2 type1|omaha, ne|Omaha|hit 8
# 3 type2|yuba city, ca|Yuba|hit 9
Write the data frame as csv using write.csv(..., sep = "\t") and configure your excel so that the delimiter matches your sep here should work.
And the JsData.json data looks like this:
[{"key":"type1|new york, ny|NYC|hit","doc_count":12},
{"key":"type1|omaha, ne|Omaha|hit","doc_count":8},
{"key":"type2|yuba city, ca|Yuba|hit","doc_count":9}]

Importing/Conditioning a file.txt with a "kind" of json structure in R

I wanted to import a .txt file in R but the format is really special and it's looks like a json format but I don't know how to import it. There is an example of my data:
{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}
To deal with this is used this code:
test1 <- read.csv("C:/Users/test1.txt", header=FALSE)
## Import as 5 observations (5th is all empty) of 1700 variables
#(in fact 40 observations of 11 variables). In fact when I imported the
#.txt file, it's having one line (5th obs) empty, and 4 lines of data and
#placed next to each other 4 lines of data of 11 variables.
# Get the different lines
part1=test1[1:10]
part2=test1[11:20]
part3=test1[21:30]
part4=test1[31:40]
...
## Remove the empty line (there were an empty line after each)
part1=part1[-5,]
part2=part2[-5,]
part3=part3[-5,]
...
## Rename the columns
names(part1)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part2)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part3)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
...
## Assemble data to have one dataset
data=rbind(part1,part2,part3,part4,part5,part6,part7,part8,part9,part10)
## Formate Date Time
times <- as.POSIXct(data$`Date Time`, format='{datetime:%Y-%m-%d %H:%M:%S')
data$`Date Time` <- times
## Keep only the Date
data$Date <- as.Date(times)
## Formate data - Remove text
data$Subject <- gsub("subject:", "", data$Subject)
data$Sscore <- gsub("sscore:", "", data$Sscore)
...
So My code is working to reinstate the data but it's maybe very difficult and more long I know there is better ways to do it, so if you could help me with that I would be very grateful.
There are many packages that read JSON, e.g. rjson, jsonlite, RJSONIO (they will turn in up a google search) - just pick one and give it a go.
e.g.
library(jsonlite)
json.text <- '{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}'
x <- fromJSON(paste0('[', json.text, ']'))
datetime subject sscore smean svscore sdispersion svolume sbuzz lastclose companyname
1 2015-07-08 09:10:00 MMM -0.2280 0.2593 -0.2795 0.375 8 0.6026 155.430000000 3M Company
2 2015-07-07 09:10:00 MMM 0.2977 0.2713 -0.7436 0.400 5 0.4895 155.080000000 3M Company
3 2015-07-06 09:10:00 MMM -1.0057 0.2579 -1.3796 1.000 1 0.4531 155.380000000 3M Company
I paste the '[' and ']' around your JSON because you have multiple JSON elements (the rows in the dataframe above) and for this to be well-formed JSON it needs to be an array, i.e. [ {...}, {...}, {...} ] rather than {...}, {...}, {...}.

permuting a path while keeping end vertices constant?

I have animal movement paths from GPS collars (the animal's location was recorded every 2h). To study how the actual path compares to random paths I need to generate alternate paths by randomly distributing the original route segments between the actual beginning and end locations (first and last vertices). I thought a good way to go would be to use the permute.vertices function in igraph. However, I cannot figure out how to keep the first and last vertices constant.
Here is a sample data set:
I'm starting out with a matrix of from-coordinates and to-coordinates that define the steps:
library(igraph)
path <- matrix (c(-111.52, -111.49, -111.48, -111.47, -111.46,
35.34, 35.35, 35.33, 35.32, 35.31,
-111.49, -111.48, -111.47, -111.46, -111.5,
35.35, 35.33, 35.32, 35.31, 35.4),
nrow=5, ncol=4)
path<-as.data.frame(path)
names(path)<-c("From.x","From.y","To.x","To.y")
From <- 0:(nrow(path)-1)
To <- 1:nrow(path)
path <- cbind(From, To, path)
Turning the data.frame into a graph:
path <- graph.data.frame(path,directed=FALSE)
V(path)
Randomly permuting the vertices:
path2 <- permute.vertices(path, permutation=sample(vcount(path)))
V(path2)
How could I write the code to keep the first and last vertices always "0" and "5"? (or depending on the path, of course, a different number than "5")
I also then need to extract the coordinates from the permuted path and get them into a matrix. I tried it with the tkplot.getcoords command, but am not sure how to transform them back (I suppose tkplot transforms them somehow).
tkplot(path2)
kplot.getcoords(1, norm = TRUE)
I'm using RStudio on Windows 8.
Then just permute the rest of the vertices, and keep 0 and 5:
perm <- c(1, sample(2:(vcount(path)-1)), 5)
perm
# [1] 1 4 5 3 2 5
path2 <- permute.vertices(path, permutation=perm)
V(path2)
# Vertex sequence:
# [1] "0" "4" "3" "1" "5" "0"
For your other question, please explain better what you want, because I am not sure what kind of matrix you want to create.