I want to play with data that is now saved in JSON format. But I am very new to R and have little clue of how to play with data. You can see below what I managed to achieve. But first, my code:
library(rjson)
json_file <- "C:\\Users\\Saonkfas\\Desktop\\WOWPAPI\\wowpfinaljson.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
I was able to the data:
for (x in json_data){print (x)}
Although output looks pretty raw:
[[1]]
[[1]]$wins
[1] "118"
[[1]]$losses
[1] "40"
# And so on
Note that the JSON is somewhat nested. I could create tables with Python, but R seems much more complicated.
Edit:
My JSON:
{
"play1": [
{
"wins": "118",
"losses": "40",
"max_killed": "7",
"battles": "158",
"plane_id": "4401",
"max_ground_object_destroyed": "3"
},
{
"wins": "100",
"losses": "58",
"max_killed": "7",
"battles": "158",
"plane_id": "2401",
"max_ground_object_destroyed": "3"
},
{
"wins": "120",
"losses": "38",
"max_killed": "7",
"battles": "158",
"plane_id": "2403",
"max_ground_object_destroyed": "3"
}
],
"play2": [
{
"wins": "12",
"losses": "450",
"max_killed": "7",
"battles": "158",
"plane_id": "4401",
"max_ground_object_destroyed": "3"
},
{
"wins": "150",
"losses": "8",
"max_killed": "7",
"battles": "158",
"plane_id": "2401",
"max_ground_object_destroyed": "3"
},
{
"wins": "120",
"losses": "328",
"max_killed": "7",
"battles": "158",
"plane_id": "2403",
"max_ground_object_destroyed": "3"
}
],
fromJSON returns a list, you can use the *apply functions to go through each element.
It's fairly straightforward (once you know what to do!) to convert it to a "table" (data frame is the correct R terminology).
library(rjson)
# You can pass directly the filename
my.JSON <- fromJSON(file="test.json")
df <- lapply(my.JSON, function(play) # Loop through each "play"
{
# Convert each group to a data frame.
# This assumes you have 6 elements each time
data.frame(matrix(unlist(play), ncol=6, byrow=T))
})
# Now you have a list of data frames, connect them together in
# one single dataframe
df <- do.call(rbind, df)
# Make column names nicer, remove row names
colnames(df) <- names(my.JSON[[1]][[1]])
rownames(df) <- NULL
df
wins losses max_killed battles plane_id max_ground_object_destroyed
1 118 40 7 158 4401 3
2 100 58 7 158 2401 3
3 120 38 7 158 2403 3
4 12 450 7 158 4401 3
5 150 8 7 158 2401 3
6 120 328 7 158 2403 3
I find jsonlite to be a little more user friendly for this task. Here is a comparison of three JSON parsing packages (biased in favor of jsonlite)
library(jsonlite)
data <- fromJSON('path/to/file.json')
data
#> $play1
# wins losses max_killed battles plane_id max_ground_object_destroyed
# 1 118 40 7 158 4401 3
# 2 100 58 7 158 2401 3
# 3 120 38 7 158 2403 3
#
# $play2
# wins losses max_killed battles plane_id max_ground_object_destroyed
# 1 12 450 7 158 4401 3
# 2 150 8 7 158 2401 3
# 3 120 328 7 158 2403 3
If you want to collapse those list names into a new column, I recommend dplyr::bind_rows rather than do.call(rbind, data)
library(dplyr)
data <- bind_rows(data, .id = 'play')
# Source: local data frame [6 x 7]
# play wins losses max_killed battles plane_id max_ground_object_destroyed
# (chr) (chr) (chr) (chr) (chr) (chr) (chr)
# 1 play1 118 40 7 158 4401 3
# 2 play1 100 58 7 158 2401 3
# 3 play1 120 38 7 158 2403 3
# 4 play2 12 450 7 158 4401 3
# 5 play2 150 8 7 158 2401 3
# 6 play2 120 328 7 158 2403 3
Beware that the columns may not have the type you expect (notice the columns are all characters since all of the numbers were quoted in the provided JSON data)!
Edit Nov. 2017: One approach to type conversion would be to use mutate_if to guess the intended type of character columns.
data <- mutate_if(data, is.character, type.convert, as.is = TRUE)
I prefer tidyjson over rjson and jsonlite as it has a easy workflow for converting multilevel nested json objects to 2 dimensional tables. Your problem can be easily solved using this package from github.
devtools::install_github("sailthru/tidyjson")
library(tidyjson)
library(dplyr)
> json %>% as.tbl_json %>% gather_keys %>% gather_array %>%
+ spread_values(
+ wins = jstring("wins"),
+ losses = jstring("losses"),
+ max_killed = jstring("max_killed"),
+ battles = jstring("battles"),
+ plane_id = jstring("plane_id"),
+ max_ground_object_destroyed = jstring("max_ground_object_destroyed")
+ )
Output
document.id key array.index wins losses max_killed battles plane_id max_ground_object_destroyed
1 1 play1 1 118 40 7 158 4401 3
2 1 play1 2 100 58 7 158 2401 3
3 1 play1 3 120 38 7 158 2403 3
4 1 play2 1 12 450 7 158 4401 3
5 1 play2 2 150 8 7 158 2401 3
6 1 play2 3 120 328 7 158 2403 3
Related
I am building my first web-scraper using Python and BS4. I wanted to investigate time-trial data from the 2018 KONA Ironman World Championship. What is the best method for converting JSON to CSV?
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
sauce =
'http://m.ironman.com/triathlon/events/americas/ironman/world-
championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank,
overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
with open('data.json', 'w') as jsonfile:
json.dump(parse_table(soup), jsonfile)
print(json.dumps(parse_table(soup), indent=3))
JSON output contains the name of the athlete followed by their division, gender, and overall rank as well as swim, bike and run time:
{
"Avila, Anthony 2470": [ {
"div_rank": "138", "gender_rank": "1243", "overall_rank": "1565", "swim": "01:20:11", "bike": "05:27:59", "run": "04:31:56"
}
],
"Lindgren, Mikael 1050": [ {
"div_rank": "151", "gender_rank": "872", "overall_rank": "983", "swim": "01:09:06", "bike": "05:17:51", "run": "03:49:20"
}
],
"Umezawa, Kazuyoshi 1870": [ {
"div_rank": "229", "gender_rank": "1589", "overall_rank": "2186", "swim": "01:17:22", "bike": "06:14:45", "run": "07:16:21"
}
],
"Maric, Bojan 917": [ {
"div_rank": "162", "gender_rank": "923", "overall_rank": "1065", "swim": "01:03:22", "bike": "05:13:56", "run": "04:01:45"
}
],
"Nishioka, Maki 2340": [ {
"div_rank": "6", "gender_rank": "52", "overall_rank": "700", "swim": "00:58:40", "bike": "05:19:10", "run": "03:33:58"
}...
}
There's a few things you could look into. You could work with pandas to do .read_json(). Or what I did was just iterate over the key, values you had and threw that into a dataframe. Once you have a dataframe, you can just write csv.
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)
Output:
print (result)
bike ... name
0 05:27:59 ... Avila, Anthony 2470
1 05:17:51 ... Lindgren, Mikael 1050
2 06:14:45 ... Umezawa, Kazuyoshi 1870
3 05:13:56 ... Maric, Bojan 917
4 05:19:10 ... Nishioka, Maki 2340
5 04:32:26 ... Rana, Ivan 18
6 04:49:08 ... Spalding, Joel 1006
7 04:50:10 ... Samuel, Mark 2479
8 06:45:57 ... Long, Felicia 1226
9 05:24:33 ... Mccarroll, Charles 1355
10 06:36:36 ... Freeman, Roger 154
11 --:--:-- ... Solis, Eduardo 1159
12 04:55:29 ... Schlohmann, Thomas 1696
13 05:39:18 ... Swinson, Renee 1568
14 04:40:41 ... Mechin, Antoine 2226
15 05:23:18 ... Hammond, Serena 1548
16 05:15:10 ... Hassel, Diana 810
17 06:15:59 ... Netto, Laurie-Ann 1559
18 --:--:-- ... Mazur, Maksym 1412
19 07:11:19 ... Weiskopf-Larson, Sue 870
20 05:49:02 ... Sosnowska, Aleksandra 1921
21 06:45:48 ... Wendel, Sandra 262
22 04:39:46 ... Oosterdijk, Tom 2306
23 06:03:01 ... Moss, Julie 358
24 06:24:58 ... Borgio, Alessandro 726
25 05:07:42 ... Newsome, Jason 1058
26 04:44:46 ... Wild, David 2008
27 04:46:06 ... Weitz, Matti 2239
28 04:41:05 ... Gyde, Sam 1288
29 05:27:55 ... Yamauchi, Akio 452
... ... ...
2442 04:38:36 ... Lunn, Paul 916
2443 05:27:27 ... Van Soest, John 1169
2444 06:07:56 ... Austin, John 194
2445 05:20:26 ... Mcgrath, Scott 1131
2446 04:53:27 ... Pike, Chris 1743
2447 05:23:20 ... Ball, Duncan 722
2448 05:33:26 ... Fauske Haferkamp, Cathrine 1222
2449 05:17:34 ... Vocking, Peter 641
2450 05:15:30 ... Temme, Travis 1010
2451 07:14:14 ... Sun, Shiyi 2342
2452 04:52:14 ... Li, Peng Cheng 2232
2453 06:26:26 ... Lloyd, Graham 148
2454 04:44:42 ... Bartelle, Daniel 1441
2455 04:51:58 ... Overmars, Koen Pieter 1502
2456 05:23:24 ... Muroya, Koji 439
2457 05:45:42 ... Brown, Ani De Leon 1579
2458 06:42:16 ... Peters, Nancy 370
2459 06:43:07 ... Albarote, Lezette 1575
2460 04:50:45 ... Mohr, Robert 1990
2461 07:17:40 ... Hirose, Roen 497
2462 05:12:10 ... Girardi, Bruno 312
2463 04:59:44 ... Cowan, Anthony 966
2464 06:03:59 ... Hoskens, Rudy 433
2465 04:32:20 ... Baker, Edward 1798
2466 05:11:07 ... Svetana, Ushakova 2343
2467 05:56:06 ... Peterson, Greg 779
2468 05:22:15 ... Wallace, Patrick 287
2469 05:53:14 ... Lott, Jon 914
2470 05:00:29 ... Goodlad, Martin 977
2471 04:34:59 ... Maley, Joel 1840
[2472 rows x 7 columns]
ADDITIONAL
Just also to point out, the data is already returned to you as a json structure. The only difference is you'd need to work out the query parameters to iterate over the pages, which is much slower than your code, so there is a trade off. Unless you look into the query parameters to return all 2460 results versus 30 at a time/per page. But that is also an option to get that json structure.
But you can take the json structure and normalize it to a dataframe, then save as csv.
import requests
from pandas.io.json import json_normalize
import pandas as pd
request_url = 'http://m.ironman.com/Handlers/EventLiveResultsMobile.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page = ''
params = {
'year': '2018',
'race': 'worldchampionship',
'q': '',
'p': page,
'so': 'orank',
'sd': ''}
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
lastPage = jsonObj['lastPage']
result = pd.DataFrame()
for page in range(1, lastPage):
page = str(page)
print ('Processed Page: '+ page)
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
temp_df = json_normalize(jsonObj['records'])
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)
I am trying to scrap results of Polish elections that were held this weekend, but I come to problem that before every intager random float is added.
I have tried using htmltab, but it did not work - as you can see random number is added
library(htmltab)
url <- "https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council"
tmp <- htmltab::htmltab(doc = html, which = 1)
tmp
Wyszczególnienie Liczba
2 Mieszkańców 0.972440432 755 957
3 Wyborców 0.977263472 273 653
4 Obwodów 0.99998061 940
I have checked in html what is the problem:
library(xml2)
library(rvest)
webpage <- xml2::read_html(url)
a <- webpage %>%
rvest::html_nodes("tbody")
a[1]
<tbody>\n<tr>\n<td>Mieszkańców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97244043</span>2 755 957</td>\n </tr>\n<tr>\n<td>Wyborców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97726347</span>2 273 653</td>\n </tr>\n<tr>\n<td>Obwodów</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.9999806</span>1 940</td>\n </tr>\n</tbody>"
I assume the problem is with <span class=\"hidden\">, but how to get rid of it?
EDIT
I need the info from the 9th table with results of the parties
Nr listy Komitet wyborczy Liczba % głosów ważnych
Głosów na kandydatów komitetu Kandydatów
12 KOMITET WYBORCZY WYBORCÓW Z DUTKIEWICZEM DLA DOLNEGO ŚLĄSKA 93 260 45 8.29%
9 KOMITET WYBORCZY WYBORCÓW WOLNOŚĆ W SAMORZĄDZIE 15 499 46 1.38%
8 KOMITET WYBORCZY WYBORCÓW KUKIZ'15 53 800 41 4.78%
1 KOMITET WYBORCZY WYBORCÓW BEZPARTYJNI SAMORZĄDOWCY 168 442 46 14.98%
11 KOMITET WYBORCZY WOLNI I SOLIDARNI 9 624 38 0.86%
7 KOMITET WYBORCZY RUCH NARODOWY RP 14 874 38 1.32%
10 KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ 320 908 45 28.53%
2 KOMITET WYBORCZY POLSKIE STRONNICTWO LUDOWE 58 820 46 5.23%
6 KOMITET WYBORCZY PARTII RAZEM 18 087 44 1.61%
3 KOMITET WYBORCZY PARTIA ZIELONI 19 783 36 1.76%
5 KOALICYJNY KOMITET WYBORCZY SLD LEWICA RAZEM 61 889 46 5.50%
4 KOALICYJNY KOMITET WYBORCZY PLATFORMA.NOWOCZESNA KOALICJA OBYWATELSKA 289 831 46 25.77%
EDIT 2
I have found not the most elegant solution:
#https://stackoverflow.com/questions/7963898/extracting-the-last-n-characters-from-a-string-in-r
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
tmp <- htmltab::htmltab(doc = html, which = 9)
tmp2 <- xml2::read_html(html) %>%
rvest::html_nodes("tbody") %>%
magrittr::extract2(9) %>%
rvest::html_nodes("tr") %>%
rvest::html_nodes("td") %>%
rvest::html_nodes("span") %>%
rvest::html_text() %>%
matrix(ncol = 4, byrow = T) %>%
data.frame()
names(tmp) <- c("a", "b", "c", "d", "e", "f", "g")
tmp3 <- cbind(tmp, tmp2) %>%
mutate(n_to_delate = nchar(X1),
c1 = as.character(c),
n_whole = nchar(c1),
c2 = substrRight(c1, n_whole - n_to_delate),
c3 = gsub(" ", "", c2),
c4 = as.numeric(c3)) %>%
select(b, c4)
names(tmp3) <- c("party", "n_of_votes")
Solving the original question:
You can remove those nodes before the conversion to a table:
library(rvest)
pg <- read_html("https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council")
tbl_1 <- html_nodes(pg, xpath=".//table[#class = 'stat_table']")[1]
xml_remove(html_nodes(tbl_1, xpath=".//span[#class='hidden']"))
html_table(tbl_1)
## [[1]]
## Wyszczególnienie Liczba
## 1 Mieszkańców 2 755 957
## 2 Wyborców 2 273 653
## 3 Obwodów 1 940
Solving the updated requirements:
library(rvest)
pg <- read_html("https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council")
Let's target that particular table. Using the "View Source" version of the document, we can go for the header that precedes that table and then got to the table:
target_tbl <- html_node(pg, xpath=".//header[contains(., 'mandatów pomiędzy')]/following-sibling::table")
Still get rid of the hidden spans:
xml_remove(html_nodes(target_tbl, xpath=".//span[#class='hidden']"))
Now, we need to know how many real columns there are since it has one of those daft headers that are multi-line with <td>'s that span multiple columns:
length(
html_nodes(target_tbl, xpath=".//tbody/tr[1]") %>%
html_nodes("td")
) -> n_cols
Now we pull out each column, set good column names, turn it into a data frame and remove the junk column that is just feeding the filled in bars:
as.data.frame(
setNames(
lapply(1:n_cols, function(.idx) {
html_nodes(target_tbl, xpath=sprintf(".//tbody/tr/td[%s]", .idx)) %>%
html_text(trim=TRUE)
}),
c(
"nr_listy", "komitet_wyborczy", "głosów_na_kandydatów_komitetu",
"kandydatów", "mandatów", "pct_głosów_ważnych", "junk",
"udział_w_podziale_mandatów"
)
),
stringsAsFactors = FALSE
) -> xdf
xdf$junk <- NULL
str(xdf)
## 'data.frame': 12 obs. of 7 variables:
## $ nr_listy : chr "1" "2" "3" "4" ...
## $ komitet_wyborczy : chr "KOMITET WYBORCZY WYBORCÓW BEZPARTYJNI SAMORZĄDOWCY" "KOMITET WYBORCZY POLSKIE STRONNICTWO LUDOWE" "KOMITET WYBORCZY PARTIA ZIELONI" "KOALICYJNY KOMITET WYBORCZY PLATFORMA.NOWOCZESNA KOALICJA OBYWATELSKA" ...
## $ głosów_na_kandydatów_komitetu: chr "168 442" "58 820" "19 783" "289 831" ...
## $ kandydatów : chr "46" "46" "36" "46" ...
## $ mandatów : chr "6" "1" "0" "13" ...
## $ pct_głosów_ważnych : chr "14.98%" "5.23%" "1.76%" "25.77%" ...
## $ udział_w_podziale_mandatów : chr "Tak" "Tak" "Nie" "Tak" ...
I don't think piping makes the lapply() block more readable but just in case it's preferred:
lapply(1:n_cols, function(.idx) {
html_nodes(target_tbl, xpath=sprintf(".//tbody/tr/td[%s]", .idx)) %>%
html_text(trim=TRUE)
}) %>%
setNames(c(
"nr_listy", "komitet_wyborczy", "głosów_na_kandydatów_komitetu",
"kandydatów", "mandatów", "pct_głosów_ważnych", "junk",
"udział_w_podziale_mandatów"
)) %>%
as.data.frame(stringsAsFactors = FALSE) -> xdf
Data Preparation
comp <-
c('[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]',
'[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]')
id = c(1,2)
jsonData = as.data.frame(id,comp)
jsonData
id
[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}] 1
[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}] 2
I am not sure why 'comp' not came as column name and why 'id' came later if it's defined before, Also its giving error if I define 'as.data.frame(comp,id)'
Now I am dealing with JSON data
library(jsonlite)
library(tidyverse)
library(dplyr)
data <- jsonData %>% mutate(x = lapply(comp,fromJSON)) %>% unnest(x)
data
id id1 name
1 1 28 Google
2 1 12 Microsoft
3 2 32 Microsoft
4 2 878 Facebook
Is there any better way to deal with JSON in R, like any library which directly convert JSON to normal column, currently I am taking small data so its look easy but I have multiple columns having JSON input and Its too much performance hit for my report
JSON is text. Text parsing is slow. Also not sure why library(dplyr) is there since it comes with the tidyverse. And, you should consider reading up on how to make data frames.
Regardless. We'll make an representative example: 500,000 rows:
library(tidyverse)
data_frame(
id = rep(c(1L, 2L), 250000),
comp = rep(c(
'[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]',
'[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]'
), 250000)
) -> xdf
There are many JSON processing packages in R. Test out a few. This uses ndjson which has a function flatten() which takes a character vector of JSON strings and makes a "completely flat" structure from it.
I'm only using different data frame variables for explanatory clarity and benchmarking later.
pull(xdf, comp) %>%
ndjson::flatten() %>%
bind_cols(select(xdf, id)) -> ydf
That makes:
ydf
## Source: local data table [500,000 x 5]
##
## # A tibble: 500,000 x 5
## `0.id` `0.name` `1.id` `1.name` id
## <dbl> <chr> <dbl> <chr> <int>
## 1 28. Google 12. Microsoft 1
## 2 32. Microsoft 878. Facebook 2
## 3 28. Google 12. Microsoft 1
## 4 32. Microsoft 878. Facebook 2
## 5 28. Google 12. Microsoft 1
## 6 32. Microsoft 878. Facebook 2
## 7 28. Google 12. Microsoft 1
## 8 32. Microsoft 878. Facebook 2
## 9 28. Google 12. Microsoft 1
## 10 32. Microsoft 878. Facebook 2
## # ... with 499,990 more rows
We can turn that back into a more tidy data frame:
bind_rows(
select(ydf, id = id, id1=`0.id`, name=`0.name`),
select(ydf, id = id, id1=`1.id`, name=`1.name`)
) %>%
mutate(id1 = as.integer(id1))
## Source: local data table [1,000,000 x 3]
##
## # A tibble: 1,000,000 x 3
## id id1 name
## <int> <int> <chr>
## 1 1 28 Google
## 2 2 32 Microsoft
## 3 1 28 Google
## 4 2 32 Microsoft
## 5 1 28 Google
## 6 2 32 Microsoft
## 7 1 28 Google
## 8 2 32 Microsoft
## 9 1 28 Google
## 10 2 32 Microsoft
## # ... with 999,990 more rows
Now, we'll benchmark with 1,000 rows since I'm not waiting for the full 500,000 run to microbenchmark:
data_frame(
id = rep(c(1L, 2L), 500),
comp = rep(c(
'[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]',
'[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]'
), 500)
) -> xdf
microbenchmark::microbenchmark(
faster = {
pull(xdf, comp) %>%
ndjson::flatten() %>%
bind_cols(select(xdf, id)) -> ydf
bind_rows(
select(ydf, id = id, id1=`0.id`, name=`0.name`),
select(ydf, id = id, id1=`1.id`, name=`1.name`)
) %>%
mutate(id1 = as.integer(id1))
}
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## faster 12.46409 13.71483 14.73997 14.40582 15.47529 21.09543 100
So:
15ms for 1,000 rows
15ms * 500 = 7.5s for 500,000
If you're not pedantic about the id1 column needing to be an integer, you can likely shave off a few ms.
There are other approaches. And, if you regularly work with columns of JSON data, I highly recommend checking out Apache Drill and the sergeant package.
I am relatively new to R and in here. I am trying to read in a CSV file that has multiple symbols with OHLCV and date in string YYYYMMDD format
Data format example
I have tried:
data <- read.csv(file="DFM.csv", sep=",", dec=".", header=TRUE, col.names = c("Symbols", "Date", "Open", "High", "Low", "Close", "Volume"), stringsAsFactors = FALSE)
> class(data)
[1] "data.frame"
> head(data)
Symbols Date Open High Low Close Volume
1 DIB 20160630 5.03 5.12 5.03 5.11 6171340
2 DIB 20160629 5.10 5.11 5.02 5.02 5241741
3 DIB 20160628 5.05 5.11 5.02 5.07 5258839
4 DIB 20160627 5.01 5.11 5.01 5.03 5038589
5 DIB 20160626 4.94 5.04 4.90 5.02 10593471
6 DIB 20160623 5.14 5.14 5.09 5.12 3069970
as.Date(data$Date, format="%Y%m%d") # didn't work
Somehow I need to load it in getSymbols() so I can use chart_Series() to plot the charts. Can anyone help?
Using your example data this is one possible solution to import the file, convert the Date column, split the file by Symbol and arrange it in a way to chart the individual objects(stocks) in a straightforward way:
First and last 3 lines of original file data (allStocks):
> both(allStocks)
Symbol Date Open High Low Close Volme
1 DIB 20160630 5.03 5.12 5.03 5.11 6171340
2 DIB 20160629 5.10 5.11 5.02 5.02 5241741
3 DIB 20160628 5.05 5.11 5.02 5.07 5258839
Symbol Date Open High Low Close Volme
16 CBD 20160627 5.6 5.6 5.6 5.6 0
17 CBD 20160626 5.6 5.6 5.6 5.6 0
18 CBD 20160623 5.6 5.6 5.6 5.6 0
Lets's start by converting the Date column:
allStocks$Date <- as.Date(as.character(allStocks$Date), format="%Y%m%d")
Next, split allStocks by Symbol which gives you a list where each list element represents an individual stock with name Symbol :
allStocks <- split(allStocks,allStocks$Symbol)
Next, get rid of the Symbol column to prepare for a xts object:
allStocks <- lapply(allStocks, function(x) as.xts(x[,3:7],order.by=x[,2]))
and finally convert the list into individual xts-objects each representing a stock with name Symbol:
list2env(allStocks,envir=.GlobalEnv)
You should now have 3 nicely formatted objects in your GlobalEnvironment ready to be charted.
i.e. str and first,last lines of stock DIB:
> str(DIB)
An ‘xts’ object on 2016-06-23/2016-06-30 containing:
Data: num [1:6, 1:5] 5.14 4.94 5.01 5.05 5.1 5.03 5.14 5.04 5.11 5.11 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Open" "High" "Low" "Close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
> both(DIB)
Open High Low Close Volme
2016-06-23 5.14 5.14 5.09 5.12 3069970
2016-06-26 4.94 5.04 4.90 5.02 10593471
2016-06-27 5.01 5.11 5.01 5.03 5038539
Open High Low Close Volme
2016-06-28 5.05 5.11 5.02 5.07 5258839
2016-06-29 5.10 5.11 5.02 5.02 5241741
2016-06-30 5.03 5.12 5.03 5.11 6171340
I have a file containing over 1500 json objects that I want to work with in R. I've been able to import the data as a list, but am having trouble coercing it into a useful structure. I want to create a data frame containing a row for each json object and a column for each key:value pair.
I've recreated my situation with this small, fake data set:
[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]
Some features of the data:
The objects all contain the same number of key:value pairs although
some of the values are null
There are two non-numeric columns per object (name and group)
name is the unique identifier, there are 10 or so groups
many of the name and group entires contain spaces, commas and other punctuation.
Based on this question: R list(structure(list())) to data frame, I tried the following:
json_file <- "test.json"
json_data <- fromJSON(json_file)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
With both my real data and this fake data, the last line give me this error:
Error in data.frame(name = "Doe, John", group = "Red", `age (y)` = 24, :
arguments imply differing number of rows: 1, 0
You just need to replace your NULLs with NAs:
require(RJSONIO)
json_file <- '[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]'
json_file <- fromJSON(json_file)
json_file <- lapply(json_file, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
Once you have a non-null value for each element, you can call rbind without getting an error:
do.call("rbind", json_file)
name group age (y) height (cm) wieght (kg) score
[1,] "Doe, John" "Red" "24" "182" "74.8" NA
[2,] "Doe, Jane" "Green" "30" "170" "70.1" "500"
[3,] "Smith, Joan" "Yellow" "41" "169" "60" NA
[4,] "Brown, Sam" "Green" "22" "183" "75" "865"
[5,] "Jones, Larry" "Green" "31" "178" "83.9" "221"
[6,] "Murray, Seth" "Red" "35" "172" "76.2" "413"
[7,] "Doe, Jane" "Yellow" "22" "164" "68" "902"
This is very simple if you use either library(jsonlite) or library(jsonify)
Both of these handle the null values and converts them to NA, and they preserve the data types.
Data
json_file <- '[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]'
jsonlite
library(jsonlite)
jsonlite::fromJSON( json_file )
# name group age (y) height (cm) wieght (kg) score
# 1 Doe, John Red 24 182 74.8 NA
# 2 Doe, Jane Green 30 170 70.1 500
# 3 Smith, Joan Yellow 41 169 60.0 NA
# 4 Brown, Sam Green 22 183 75.0 865
# 5 Jones, Larry Green 31 178 83.9 221
# 6 Murray, Seth Red 35 172 76.2 413
# 7 Doe, Jane Yellow 22 164 68.0 902
str( jsonlite::fromJSON( json_file ) )
# 'data.frame': 7 obs. of 6 variables:
# $ name : chr "Doe, John" "Doe, Jane" "Smith, Joan" "Brown, Sam" ...
# $ group : chr "Red" "Green" "Yellow" "Green" ...
# $ age (y) : int 24 30 41 22 31 35 22
# $ height (cm): int 182 170 169 183 178 172 164
# $ wieght (kg): num 74.8 70.1 60 75 83.9 76.2 68
# $ score : int NA 500 NA 865 221 413 902
jsonify
library(jsonify)
jsonify::from_json( json_file )
# name group age (y) height (cm) wieght (kg) score
# 1 Doe, John Red 24 182 74.8 NA
# 2 Doe, Jane Green 30 170 70.1 500
# 3 Smith, Joan Yellow 41 169 60.0 NA
# 4 Brown, Sam Green 22 183 75.0 865
# 5 Jones, Larry Green 31 178 83.9 221
# 6 Murray, Seth Red 35 172 76.2 413
# 7 Doe, Jane Yellow 22 164 68.0 90
str( jsonify::from_json( json_file ) )
# 'data.frame': 7 obs. of 6 variables:
# $ name : chr "Doe, John" "Doe, Jane" "Smith, Joan" "Brown, Sam" ...
# $ group : chr "Red" "Green" "Yellow" "Green" ...
# $ age (y) : int 24 30 41 22 31 35 22
# $ height (cm): int 182 170 169 183 178 172 164
# $ wieght (kg): num 74.8 70.1 60 75 83.9 76.2 68
# $ score : int NA 500 NA 865 221 413 902
To remove null values use parameter nullValue
json_data <- fromJSON(json_file, nullValue = NA)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
this way there won´t be any unnecessary quotes in your output
library(rjson)
Lines <- readLines("yelp_academic_dataset_business.json")
business <- as.data.frame(t(sapply(Lines, fromJSON)))
You may try this to load JSON data into R
dplyr::bind_rows(fromJSON(file_name))
Changing the package from rjson to jsonlite fixed it for me.
So instead of this:
fromAPIPlantsPages <- rjson::fromJSON(content(apiGetPlants,type="text",encoding = "UTF-8"))
dfPlantenAPI <- as.data.frame(fromAPIPlantsPages)
I changed it to this:
fromAPIPlantsPages <- jsonlite::fromJSON(content(apiGetPlants,type="text",encoding = "UTF-8"))
dfPlantenAPI <- as.data.frame(fromAPIPlantsPages)