I am collecting sports data for a database I will be creating soon, and one of the columns I have for one of my tables is player birthdays. I am using R to scrape and collect the data, and here is what it currently looks like:
head(my_df)
Name BirthDate Age Place Height Weight
1 Mike NA NA NA
2 Austin August 17, 1994 22.295 Roseville, California 73 210
3 Koby January 19, 1997 20.140 Wolfforth, Texas 70 255
4 Ty 1991 26.000 Amarillo, Texas 70 165
5 Cole NA NA NA
6 Jeff July 24, 1995 21.320 Boulder, Colorado 72 200
dput(head(my_df))
structure(list(Name = c("Mike", "Austin", "Koby", "Ty", "Cole",
"Jeff"), BirthDate = c("", "August 17, 1994", "January 19, 1997",
"1991", "", "July 24, 1995"), Age = c(NA, 22.295, 20.14, 26,
NA, 21.32), Place = c("", "Roseville, California", "Wolfforth, Texas",
"Amarillo, Texas", "", "Boulder, Colorado"), Height = c(NA, 73,
70, 70, NA, 72), Weight = c(NA, 210L, 255L, 165L, NA, 200L)), .Names = c("Name",
"BirthDate", "Age", "Place", "Height", "Weight"), row.names = 25:30, class = "data.frame")
The BirthDate column has two inconsistencies that make appropriate formatting difficult:
there are entirely missing values
for some columns, I have the birthyear but am missing Day and Month.
I'm relatively new with databases and am trying to plan ahead with this. Does anybody have any thoughts on the best way to handle this? To be more specific, how can I write the R code to format the column the right way?
Not a complete answer but a direction to try...
Genealogy databases, which have similar problems, will often have an extra column (a sort of meta-date column) that expresses confidence in, or quality of, the birth-date column. This can then be used at query-time to mediate the results returned. Could be "estimated", "not given", "year only" to suit your app.
If we intend to do calculations on date as date, or on year and month columns, for example: the age of the player, how many of them were born on January, etc. then we can can separate it into 3 numeric columns.
do.call(rbind,
lapply(my_df$BirthDate, function(i){
x <- rev(unlist(strsplit(i, " ")))
res <- data.frame(
BirthDate = i,
YYYY = as.numeric(x[1]),
DD = as.numeric(sub(",", "", x[2], fixed = TRUE)),
MM = match(x[3], month.name))
res$DOB <- as.Date(paste(res$DD, res$MM, res$YYYY, sep = "/"),
format = "%d/%m/%Y")
res
}))
# BirthDate YYYY DD MM DOB
# 1 NA NA NA <NA>
# 2 August 17, 1994 1994 17 8 1994-08-17
# 3 January 19, 1997 1997 19 1 1997-01-19
# 4 1991 1991 NA NA <NA>
# 5 NA NA NA <NA>
# 6 July 24, 1995 1995 24 7 1995-07-24
Ok, as per a previous question (here) I've now managed to read a load of JSON data into R and to get the data into a data frame. here's the code:-
getCall <- GET("http://long-url.com",
authenticate("myusername", "password"))
contJSON <- content(getCall)
contJSON = sub("\n\r\n", "", contJSON)
df1 <- fromJSON(sprintf("[%s]", gsub("\n", ",", contJSON)), asText=TRUE)
df <- data.frame(matrix(unlist(df1), nrow=31, byrow=T))
Which gets me a data frame that looks as follows:-
head(df[,1:8])
X1 X2 X3 X4 X5 X6 X7 X8
1 2013-05-01 33682 11838 8023 3815 84 177.000000 177.000000
2 2013-05-02 32622 11626 7945 3681 58 210.000000 210.000000
3 2013-05-03 28467 11102 7786 3316 56 186.000000 186.000000
4 2013-05-04 20884 9031 6670 2361 51 7.000000 7.000000
5 2013-05-05 20481 8782 6390 2392 58 1.000000 1.000000
6 2013-05-06 25175 10019 7082 2937 62 24.000000 24.000000
However, there are no column names in my data frame. When I search for "names" in my JSON object R returns "NULL" so that doesn't give me anything useful.
I am wondering if there is any simple way (that might be repeatable on more general cases) to get the names for the column headers from the JSON schema.
I'm aware there are similar questions elsewhere on the site, but this one did not appear to be covered.
EDIT: As per the comment, here is the structure of the contJSON object.
"{\"metricDate\":\"2013-05-01\",\"pageCountTotal\":\"33682\",\"landCountTotal\":\"11838\",\"newLandCountTotal\":\"8023\",\"returnLandCountTotal\":\"3815\",\"spiderCountTotal\":\"84\",\"goalCountTotal\":\"177.000000\",\"callGoalCountTotal\":\"177.000000\",\"callCountTotal\":\"237.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"74.68\"}\n{\"metricDate\":\"2013-05-02\",\"pageCountTotal\":\"32622\",\"landCountTotal\":\"11626\",\"newLandCountTotal\":\"7945\",\"returnLandCountTotal\":\"3681\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"210.000000\",\"callGoalCountTotal\":\"210.000000\",\"callCountTotal\":\"297.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"70.71\"}\n{\"metricDate\":\"2013-05-03\",\"pageCountTotal\":\"28467\",\"landCountTotal\":\"11102\",\"newLandCountTotal\":\"7786\",\"returnLandCountTotal\":\"3316\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"186.000000\",\"callGoalCountTotal\":\"186.000000\",\"callCountTotal\":\"261.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"71.26\"}\n{\"metricDate\":\"2013-05-04\",\"pageCountTotal\":\"20884\",\"landCountTotal\":\"9031\",\"newLandCountTotal\":\"6670\",\"returnLandCountTotal\":\"2361\",\"spiderCountTotal\":\"51\",\"goalCountTotal\":\"7.000000\",\"callGoalCountTotal\":\"7.000000\",\"callCountTotal\":\"44.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.08\",\"callConversionPerc\":\"15.91\"}\n{\"metricDate\":\"2013-05-05\",\"pageCountTotal\":\"20481\",\"landCountTotal\":\"8782\",\"newLandCountTotal\":\"6390\",\"returnLandCountTotal\":\"2392\",\"spiderCountTotal\":\"58\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"12.50\"}\n{\"metricDate\":\"2013-05-06\",\"pageCountTotal\":\"25175\",\"landCountTotal\":\"10019\",\"newLandCountTotal\":\"7082\",\"returnLandCountTotal\":\"2937\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"24.000000\",\"callGoalCountTotal\":\"24.000000\",\"callCountTotal\":\"47.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.24\",\"callConversionPerc\":\"51.06\"}\n{\"metricDate\":\"2013-05-07\",\"pageCountTotal\":\"35892\",\"landCountTotal\":\"12615\",\"newLandCountTotal\":\"8391\",\"returnLandCountTotal\":\"4224\",\"spiderCountTotal\":\"62\",\"goalCountTotal\":\"239.000000\",\"callGoalCountTotal\":\"239.000000\",\"callCountTotal\":\"321.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.89\",\"callConversionPerc\":\"74.45\"}\n{\"metricDate\":\"2013-05-08\",\"pageCountTotal\":\"34106\",\"landCountTotal\":\"12391\",\"newLandCountTotal\":\"8389\",\"returnLandCountTotal\":\"4002\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"221.000000\",\"callGoalCountTotal\":\"221.000000\",\"callCountTotal\":\"295.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"74.92\"}\n{\"metricDate\":\"2013-05-09\",\"pageCountTotal\":\"32721\",\"landCountTotal\":\"12447\",\"newLandCountTotal\":\"8541\",\"returnLandCountTotal\":\"3906\",\"spiderCountTotal\":\"54\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"280.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.66\",\"callConversionPerc\":\"73.93\"}\n{\"metricDate\":\"2013-05-10\",\"pageCountTotal\":\"29724\",\"landCountTotal\":\"11616\",\"newLandCountTotal\":\"8063\",\"returnLandCountTotal\":\"3553\",\"spiderCountTotal\":\"139\",\"goalCountTotal\":\"207.000000\",\"callGoalCountTotal\":\"207.000000\",\"callCountTotal\":\"301.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.78\",\"callConversionPerc\":\"68.77\"}\n{\"metricDate\":\"2013-05-11\",\"pageCountTotal\":\"22061\",\"landCountTotal\":\"9660\",\"newLandCountTotal\":\"6971\",\"returnLandCountTotal\":\"2689\",\"spiderCountTotal\":\"52\",\"goalCountTotal\":\"3.000000\",\"callGoalCountTotal\":\"3.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.03\",\"callConversionPerc\":\"7.50\"}\n{\"metricDate\":\"2013-05-12\",\"pageCountTotal\":\"23341\",\"landCountTotal\":\"9935\",\"newLandCountTotal\":\"6960\",\"returnLandCountTotal\":\"2975\",\"spiderCountTotal\":\"45\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"12.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-13\",\"pageCountTotal\":\"36565\",\"landCountTotal\":\"13583\",\"newLandCountTotal\":\"9277\",\"returnLandCountTotal\":\"4306\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"246.000000\",\"callGoalCountTotal\":\"246.000000\",\"callCountTotal\":\"324.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.81\",\"callConversionPerc\":\"75.93\"}\n{\"metricDate\":\"2013-05-14\",\"pageCountTotal\":\"35260\",\"landCountTotal\":\"13797\",\"newLandCountTotal\":\"9375\",\"returnLandCountTotal\":\"4422\",\"spiderCountTotal\":\"59\",\"goalCountTotal\":\"212.000000\",\"callGoalCountTotal\":\"212.000000\",\"callCountTotal\":\"283.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"74.91\"}\n{\"metricDate\":\"2013-05-15\",\"pageCountTotal\":\"35836\",\"landCountTotal\":\"13792\",\"newLandCountTotal\":\"9532\",\"returnLandCountTotal\":\"4260\",\"spiderCountTotal\":\"94\",\"goalCountTotal\":\"187.000000\",\"callGoalCountTotal\":\"187.000000\",\"callCountTotal\":\"258.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.36\",\"callConversionPerc\":\"72.48\"}\n{\"metricDate\":\"2013-05-16\",\"pageCountTotal\":\"33136\",\"landCountTotal\":\"12821\",\"newLandCountTotal\":\"8755\",\"returnLandCountTotal\":\"4066\",\"spiderCountTotal\":\"65\",\"goalCountTotal\":\"192.000000\",\"callGoalCountTotal\":\"192.000000\",\"callCountTotal\":\"260.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.50\",\"callConversionPerc\":\"73.85\"}\n{\"metricDate\":\"2013-05-17\",\"pageCountTotal\":\"29564\",\"landCountTotal\":\"11721\",\"newLandCountTotal\":\"8191\",\"returnLandCountTotal\":\"3530\",\"spiderCountTotal\":\"213\",\"goalCountTotal\":\"166.000000\",\"callGoalCountTotal\":\"166.000000\",\"callCountTotal\":\"222.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.42\",\"callConversionPerc\":\"74.77\"}\n{\"metricDate\":\"2013-05-18\",\"pageCountTotal\":\"23686\",\"landCountTotal\":\"9916\",\"newLandCountTotal\":\"7335\",\"returnLandCountTotal\":\"2581\",\"spiderCountTotal\":\"56\",\"goalCountTotal\":\"5.000000\",\"callGoalCountTotal\":\"5.000000\",\"callCountTotal\":\"34.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.05\",\"callConversionPerc\":\"14.71\"}\n{\"metricDate\":\"2013-05-19\",\"pageCountTotal\":\"23528\",\"landCountTotal\":\"9952\",\"newLandCountTotal\":\"7184\",\"returnLandCountTotal\":\"2768\",\"spiderCountTotal\":\"57\",\"goalCountTotal\":\"1.000000\",\"callGoalCountTotal\":\"1.000000\",\"callCountTotal\":\"14.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.01\",\"callConversionPerc\":\"7.14\"}\n{\"metricDate\":\"2013-05-20\",\"pageCountTotal\":\"37391\",\"landCountTotal\":\"13488\",\"newLandCountTotal\":\"9024\",\"returnLandCountTotal\":\"4464\",\"spiderCountTotal\":\"69\",\"goalCountTotal\":\"227.000000\",\"callGoalCountTotal\":\"227.000000\",\"callCountTotal\":\"291.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"78.01\"}\n{\"metricDate\":\"2013-05-21\",\"pageCountTotal\":\"36299\",\"landCountTotal\":\"13174\",\"newLandCountTotal\":\"8817\",\"returnLandCountTotal\":\"4357\",\"spiderCountTotal\":\"77\",\"goalCountTotal\":\"164.000000\",\"callGoalCountTotal\":\"164.000000\",\"callCountTotal\":\"221.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.24\",\"callConversionPerc\":\"74.21\"}\n{\"metricDate\":\"2013-05-22\",\"pageCountTotal\":\"34201\",\"landCountTotal\":\"12433\",\"newLandCountTotal\":\"8388\",\"returnLandCountTotal\":\"4045\",\"spiderCountTotal\":\"76\",\"goalCountTotal\":\"195.000000\",\"callGoalCountTotal\":\"195.000000\",\"callCountTotal\":\"262.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.57\",\"callConversionPerc\":\"74.43\"}\n{\"metricDate\":\"2013-05-23\",\"pageCountTotal\":\"32951\",\"landCountTotal\":\"11611\",\"newLandCountTotal\":\"7757\",\"returnLandCountTotal\":\"3854\",\"spiderCountTotal\":\"68\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"231.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.44\",\"callConversionPerc\":\"72.29\"}\n{\"metricDate\":\"2013-05-24\",\"pageCountTotal\":\"28967\",\"landCountTotal\":\"10821\",\"newLandCountTotal\":\"7396\",\"returnLandCountTotal\":\"3425\",\"spiderCountTotal\":\"106\",\"goalCountTotal\":\"167.000000\",\"callGoalCountTotal\":\"167.000000\",\"callCountTotal\":\"203.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.54\",\"callConversionPerc\":\"82.27\"}\n{\"metricDate\":\"2013-05-25\",\"pageCountTotal\":\"19741\",\"landCountTotal\":\"8393\",\"newLandCountTotal\":\"6168\",\"returnLandCountTotal\":\"2225\",\"spiderCountTotal\":\"78\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"28.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-26\",\"pageCountTotal\":\"19770\",\"landCountTotal\":\"8237\",\"newLandCountTotal\":\"6009\",\"returnLandCountTotal\":\"2228\",\"spiderCountTotal\":\"79\",\"goalCountTotal\":\"0.000000\",\"callGoalCountTotal\":\"0.000000\",\"callCountTotal\":\"8.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.00\",\"callConversionPerc\":\"0.00\"}\n{\"metricDate\":\"2013-05-27\",\"pageCountTotal\":\"26208\",\"landCountTotal\":\"9755\",\"newLandCountTotal\":\"6779\",\"returnLandCountTotal\":\"2976\",\"spiderCountTotal\":\"82\",\"goalCountTotal\":\"26.000000\",\"callGoalCountTotal\":\"26.000000\",\"callCountTotal\":\"40.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"0.27\",\"callConversionPerc\":\"65.00\"}\n{\"metricDate\":\"2013-05-28\",\"pageCountTotal\":\"36980\",\"landCountTotal\":\"12463\",\"newLandCountTotal\":\"8226\",\"returnLandCountTotal\":\"4237\",\"spiderCountTotal\":\"132\",\"goalCountTotal\":\"208.000000\",\"callGoalCountTotal\":\"208.000000\",\"callCountTotal\":\"276.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.67\",\"callConversionPerc\":\"75.36\"}\n{\"metricDate\":\"2013-05-29\",\"pageCountTotal\":\"34190\",\"landCountTotal\":\"12014\",\"newLandCountTotal\":\"8279\",\"returnLandCountTotal\":\"3735\",\"spiderCountTotal\":\"90\",\"goalCountTotal\":\"179.000000\",\"callGoalCountTotal\":\"179.000000\",\"callCountTotal\":\"235.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.49\",\"callConversionPerc\":\"76.17\"}\n{\"metricDate\":\"2013-05-30\",\"pageCountTotal\":\"33867\",\"landCountTotal\":\"11965\",\"newLandCountTotal\":\"8231\",\"returnLandCountTotal\":\"3734\",\"spiderCountTotal\":\"63\",\"goalCountTotal\":\"160.000000\",\"callGoalCountTotal\":\"160.000000\",\"callCountTotal\":\"219.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.34\",\"callConversionPerc\":\"73.06\"}\n{\"metricDate\":\"2013-05-31\",\"pageCountTotal\":\"27536\",\"landCountTotal\":\"10302\",\"newLandCountTotal\":\"7333\",\"returnLandCountTotal\":\"2969\",\"spiderCountTotal\":\"108\",\"goalCountTotal\":\"173.000000\",\"callGoalCountTotal\":\"173.000000\",\"callCountTotal\":\"226.000000\",\"onlineGoalCountTotal\":\"0.000000\",\"conversionPerc\":\"1.68\",\"callConversionPerc\":\"76.55\"}\n\r\n"
One thing that works is to split on newlines, call from JSON on each row, then recombine the result.
contJSON <- sub("\n\r\n", "", contJSON) #as before
rowJSON <- strsplit(contJSON, "\n")[[1]]
row <- lapply(rowJSON, fromJSON)
as.data.frame(do.call(rbind, row))