Formatting JSON data in R

Formatting JSON data in R - json

I'm really new to working with JSON data, so I had a question about formatting.
Here's the link to the data I was trying to work with
I was using JSONlite and did this:
shot<-"http://stats.nba.com/stats/playerdashptshotlog?DateFrom=&DateTo=&
GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&
Outcome=&Period=0&PlayerID=202322&Season=2014-15&SeasonSegment=&
SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision="
I then did fromJSON:
json_data <- fromJSON(paste(readLines(shot), collapse=""))
This gives me the data in a list. My issue (although for all I know I messed up working towards this) is trying to create a data frame out of this info. I was able to make a data frame with code I read under similar questions on the site, but it is all of the data in just one column. Any recommendations would be appreciated!
Thanks

Normally, first thing to do when you get a JSON, you look at the structure.
str(json_data)
Doing so will reveal that your data has a very simple structure: is is a dataframe with rows, a line of headers, wrapped in some metadata about what each column means. Using the $ will allow you to address those specific components. In other words, your specific json is already a data frame structure, all you gotta to is take it out of json
library(jsonlite)
json_data <- fromJSON(paste(readLines(shot), collapse=""))
str(json_data)
mydf <- data.frame(json_data$resultSets$rowSet)
colnames(mydf) <- unlist(json_data$resultSets$headers)
You ought to get something like this:
head(mydf)
GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD
1 0021401215 APR 14, 2015 - WAS # IND A L -4 1 1
2 0021401215 APR 14, 2015 - WAS # IND A L -4 2 1
3 0021401215 APR 14, 2015 - WAS # IND A L -4 3 1
4 0021401215 APR 14, 2015 - WAS # IND A L -4 4 1
5 0021401215 APR 14, 2015 - WAS # IND A L -4 5 1
6 0021401215 APR 14, 2015 - WAS # IND A L -4 6 1
GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE SHOT_RESULT
1 10:33 7.7 0 1 25 3 missed
2 8:41 14 10 9.6 10.7 2 made
3 6:42 14.9 11 9.7 18.2 2 missed
4 5:16 19 3 3.5 4.2 2 made
5 4:45 19.8 3 3.7 3.3 2 missed
6 3:08 13.5 10 9.7 18 2 missed
CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS
1 Hill, George 201588 4.3 0 0
2 Hill, George 201588 5.7 1 2
3 Hill, George 201588 3 0 0
4 Miles, CJ 101139 4 1 2
5 Hill, Solomon 203524 3 0 0
6 Hill, George 201588 4.5 0 0

Related

Automatically converting a 1-level list to a nested list

Here we have a link where there is a table:
http://pitzavod.ru/products/upakovka/
When I read it with pd.read_html I do get a list, but it 1.) is not nested, thus when converted to a dataframe it is not readable, 2.) contains integers 0 to number of rows in the table on the website.
The list I get looks like:
[ 0 1 \
0 Показатели Марка целлюлозы
1 ОСН NaN
2 Механическая прочность при размоле в мельнице ... 10 000 740 520
3 Степень делигнификации, п.е. 28 - 45
4 Сорность - число соринок в условной массе 500г... 6500
5 Влажность, % не более 20
2
0 Методы испытаний
1 NaN
2 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 ГОСТ 10070
4 ГОСТ 14363.3
5 ГОСТ 16932 ]
Is there a way to easily clean this pandas outpute, or do I properly need to parse the website? Thank you.

That's because read_html returns always a list (even if the number of tables is 1).
pandas.read_html :Read HTML tables into a list of DataFrame objects.
You need to slice it with [0] :
df = pd.read_html("http://pitzavod.ru/products/upakovka/")[0]

Output (showing the last two columns) :
1 2
0 Марка целлюлозы Методы испытаний
1 ОСН Методы испытаний
2 10 000 740 520 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 28 - 45 ГОСТ 10070
4 6500 ГОСТ 14363.3
5 20 ГОСТ 16932

i am having a matrix in ssrs with 12 month columns and category with date as below.How to find the difference between first row and second row

Category/Date jul aug sep oct nov dec ...jan
AA on 2020jun 5 6 3 8 2 7 ... 4
AA on 2020May 7 3 2 6 5 5 ....7
Difference -2 1 1 2 -3 2 ...-3
I am using wcf query to fetch the data.The dataset looks like
category -AA
Date - 2020Jun
Month1 4
Month12 7
Month11 2
Month10 8
i am using a switch statement like below to display each column value
switch(Reportdate.Month(jun 2020) = 1, Fields.Month1.Value,
Reportdate.Month = 12, Fields.Month12.Value)
The problem i am facing is with finding the difference in each column

Data science Pandas CSV

I have a csv file with 1461 attributes. I want to load it into a pandas data frame. The problem is, many rows do not have values for trailing consecutive columns. So pandas gives parsing error due to irregular length. How to put missing values for the leading columns for once and load the csv file into the data frame?
Edit1
We can see that the data set csv file is as follows
a,b,c,d,e,f,g,h,i"""
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
I want a pandas data frame like below
"""a b c d e f g h i
1 2 4 5 ? ? ? ? ?
1 0 9 8 7 6 5 4 7
1 3 5 6 7 ? ? ? ?
6 7 8 8 9 4 5 3 5"""
NaN in place of ? can be ok
we don't have enough commas unequal length problem.

It seems you can use parameter names in read_csv for column names by range (if attributes are columns):
import pandas as pd
from pandas.compat import StringIO
temp=u"""
a,v
c,v,f,r
b,g
y"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
#in real data change 4 to 1461
names = range(4)
df = pd.read_csv(StringIO(temp), names=names)
print (df)
0 1 2 3
0 a v NaN NaN
1 c v f r
2 b g NaN NaN
3 y NaN NaN NaN
EDIT:
temp=u"""a,b,c,d,e,f,g,h,i
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print (df)
a b c d e f g h i
0 1 2 4 5 NaN NaN NaN NaN NaN
1 1 0 9 8 7.0 6.0 5.0 4.0 7.0
2 1 3 5 6 7.0 NaN NaN NaN NaN
3 6 7 8 8 9.0 4.0 5.0 3.0 5.0

Grab HTML table using XML

I am trying to read an html table using the package XML, but even though it looks easy, I haven’t managed to do it. I tried everything, but the names of the columns are always fixed by R as V1, V2, V3,…
This is the code:
require(XML)
tbl <- readHTMLTable("http://facedata.ornl.gov/ornl/npp_98-08.html”,
header = c("year","ring","CO2", "stem","root","leaf","fine root", "NPP"),
skip.rows=c(1,2),colClasses=c(rep("factor",3),rep("numeric",5)))
Many thanks for your help

The first row of the table is causing trouble. It maybe easiest to remove it:
library(XML)
appURL <- "http://facedata.ornl.gov/ornl/npp_98-08.html"
doc <- htmlParse(appURL)
removeNodes(doc["//table/tr[1]"]) # remove the first row with the troublesome header
myTable <- readHTMLTable(doc, which = 1)
> head(myTable)
Year Plot CO2 Stem Coarse Root Leaf Fine Root Total NPP
1 1998 1 elev 1540 127 362 168 2197
2 1998 2 elev 1487 139 418 175 2219
3 1998 3 amb 1085 112 333 231 1762
4 1998 4 amb 1204 113 368 185 1870
5 1998 5 amb 1136 109 382 56 1683
6 1999 1 elev 1218 98 475 295 2086

Problems with node names in Igraph with R

I have a list of data.frames (dfList) from which I'd like to generate directed weighted networks in Igraph in R.
The edge-weight variable is "ValueUSD", while vertices are identified by numbers with the columns "Reporter" and "Partner".
Here it follows a function I prepared called "write.graphs" to be used with lapply on my "dfList".
write.graphs<-function(filename){
d<-graph.data.frame(filename[c("Reporter", "Partner")], directed=TRUE)
d <- set.edge.attribute(d, "weight", value=filename$ValueUSD)
}
graphs<-lapply(dfList, write.graphs)
Every thing work perfectly. If I check for vertex names I get:
graphs[1]$names
NULL
But troubles emerge when I want to use names in place of numbers to identify my vertices, using the corresponding columns "ReporterN" and "PartnerN" in each of my data.frames in dfLIst.
Here you can see how my data.frames look like:
dfList[1]
$`Aug 2014`
Reporter YearPeriod Year Period Commodity Partner NetWeightKg ValueUSD Price PartnerN ReporterN
1 76 201408 2014 8 150910 0 4472917 22028271 4.924811 World Brazil
2 76 201408 2014 8 150910 32 380891 1533948 4.027262 Argentina Brazil
3 76 201408 2014 8 150910 152 239776 1336057 5.572105 Chile Brazil
4 76 201408 2014 8 150910 251 289 2164 7.487889 France Brazil
5 76 201408 2014 8 150910 300 27592 170658 6.185054 Greece Brazil
6
This is the message I get:
> grafi<-lapply(dfList, scrivi.grafi)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
In `d' `NA' elements were replaced with string "NA"
2: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
In `d' `NA' elements were replaced with string "NA"
3: In graph.data.frame(filename[c("ReporterN", "PartnerN")], ... :
If it helps, I checked:
class(dfList[1]$PartnerN)
[1] "NULL"
Any suggestion? Can anyone explain me what happens?
Thanks a lot, Umberto.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Formatting JSON data in R - json

Related

Automatically converting a 1-level list to a nested list

i am having a matrix in ssrs with 12 month columns and category with date as below.How to find the difference between first row and second row

Data science Pandas CSV

Grab HTML table using XML

Problems with node names in Igraph with R

Categories

Resources