Python and BeautifulSoup: How to convert JSON into CSV - json

I am building my first web-scraper using Python and BS4. I wanted to investigate time-trial data from the 2018 KONA Ironman World Championship. What is the best method for converting JSON to CSV?
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
sauce =
'http://m.ironman.com/triathlon/events/americas/ironman/world-
championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank,
overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
with open('data.json', 'w') as jsonfile:
json.dump(parse_table(soup), jsonfile)
print(json.dumps(parse_table(soup), indent=3))
JSON output contains the name of the athlete followed by their division, gender, and overall rank as well as swim, bike and run time:
{
"Avila, Anthony 2470": [ {
"div_rank": "138", "gender_rank": "1243", "overall_rank": "1565", "swim": "01:20:11", "bike": "05:27:59", "run": "04:31:56"
}
],
"Lindgren, Mikael 1050": [ {
"div_rank": "151", "gender_rank": "872", "overall_rank": "983", "swim": "01:09:06", "bike": "05:17:51", "run": "03:49:20"
}
],
"Umezawa, Kazuyoshi 1870": [ {
"div_rank": "229", "gender_rank": "1589", "overall_rank": "2186", "swim": "01:17:22", "bike": "06:14:45", "run": "07:16:21"
}
],
"Maric, Bojan 917": [ {
"div_rank": "162", "gender_rank": "923", "overall_rank": "1065", "swim": "01:03:22", "bike": "05:13:56", "run": "04:01:45"
}
],
"Nishioka, Maki 2340": [ {
"div_rank": "6", "gender_rank": "52", "overall_rank": "700", "swim": "00:58:40", "bike": "05:19:10", "run": "03:33:58"
}...
}

There's a few things you could look into. You could work with pandas to do .read_json(). Or what I did was just iterate over the key, values you had and threw that into a dataframe. Once you have a dataframe, you can just write csv.
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)
Output:
print (result)
bike ... name
0 05:27:59 ... Avila, Anthony 2470
1 05:17:51 ... Lindgren, Mikael 1050
2 06:14:45 ... Umezawa, Kazuyoshi 1870
3 05:13:56 ... Maric, Bojan 917
4 05:19:10 ... Nishioka, Maki 2340
5 04:32:26 ... Rana, Ivan 18
6 04:49:08 ... Spalding, Joel 1006
7 04:50:10 ... Samuel, Mark 2479
8 06:45:57 ... Long, Felicia 1226
9 05:24:33 ... Mccarroll, Charles 1355
10 06:36:36 ... Freeman, Roger 154
11 --:--:-- ... Solis, Eduardo 1159
12 04:55:29 ... Schlohmann, Thomas 1696
13 05:39:18 ... Swinson, Renee 1568
14 04:40:41 ... Mechin, Antoine 2226
15 05:23:18 ... Hammond, Serena 1548
16 05:15:10 ... Hassel, Diana 810
17 06:15:59 ... Netto, Laurie-Ann 1559
18 --:--:-- ... Mazur, Maksym 1412
19 07:11:19 ... Weiskopf-Larson, Sue 870
20 05:49:02 ... Sosnowska, Aleksandra 1921
21 06:45:48 ... Wendel, Sandra 262
22 04:39:46 ... Oosterdijk, Tom 2306
23 06:03:01 ... Moss, Julie 358
24 06:24:58 ... Borgio, Alessandro 726
25 05:07:42 ... Newsome, Jason 1058
26 04:44:46 ... Wild, David 2008
27 04:46:06 ... Weitz, Matti 2239
28 04:41:05 ... Gyde, Sam 1288
29 05:27:55 ... Yamauchi, Akio 452
... ... ...
2442 04:38:36 ... Lunn, Paul 916
2443 05:27:27 ... Van Soest, John 1169
2444 06:07:56 ... Austin, John 194
2445 05:20:26 ... Mcgrath, Scott 1131
2446 04:53:27 ... Pike, Chris 1743
2447 05:23:20 ... Ball, Duncan 722
2448 05:33:26 ... Fauske Haferkamp, Cathrine 1222
2449 05:17:34 ... Vocking, Peter 641
2450 05:15:30 ... Temme, Travis 1010
2451 07:14:14 ... Sun, Shiyi 2342
2452 04:52:14 ... Li, Peng Cheng 2232
2453 06:26:26 ... Lloyd, Graham 148
2454 04:44:42 ... Bartelle, Daniel 1441
2455 04:51:58 ... Overmars, Koen Pieter 1502
2456 05:23:24 ... Muroya, Koji 439
2457 05:45:42 ... Brown, Ani De Leon 1579
2458 06:42:16 ... Peters, Nancy 370
2459 06:43:07 ... Albarote, Lezette 1575
2460 04:50:45 ... Mohr, Robert 1990
2461 07:17:40 ... Hirose, Roen 497
2462 05:12:10 ... Girardi, Bruno 312
2463 04:59:44 ... Cowan, Anthony 966
2464 06:03:59 ... Hoskens, Rudy 433
2465 04:32:20 ... Baker, Edward 1798
2466 05:11:07 ... Svetana, Ushakova 2343
2467 05:56:06 ... Peterson, Greg 779
2468 05:22:15 ... Wallace, Patrick 287
2469 05:53:14 ... Lott, Jon 914
2470 05:00:29 ... Goodlad, Martin 977
2471 04:34:59 ... Maley, Joel 1840
[2472 rows x 7 columns]
ADDITIONAL
Just also to point out, the data is already returned to you as a json structure. The only difference is you'd need to work out the query parameters to iterate over the pages, which is much slower than your code, so there is a trade off. Unless you look into the query parameters to return all 2460 results versus 30 at a time/per page. But that is also an option to get that json structure.
But you can take the json structure and normalize it to a dataframe, then save as csv.
import requests
from pandas.io.json import json_normalize
import pandas as pd
request_url = 'http://m.ironman.com/Handlers/EventLiveResultsMobile.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page = ''
params = {
'year': '2018',
'race': 'worldchampionship',
'q': '',
'p': page,
'so': 'orank',
'sd': ''}
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
lastPage = jsonObj['lastPage']
result = pd.DataFrame()
for page in range(1, lastPage):
page = str(page)
print ('Processed Page: '+ page)
response = requests.get(request_url, headers=headers, params=params)
jsonObj = response.json()
temp_df = json_normalize(jsonObj['records'])
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('path/to/filename.csv', index=False)

Related

ValueError while using CTGAN library generating the sample

I am using CTGAN library on colab notebook. I have passed on a tabular dataset, with one categorical feature. I have mentioned the categorical feature as given in dcumentation. The model training is also complete without error. I am getting ValueError, while generating a simulated data.
How to resolve this error
Adding a reproducible code below
import pandas as pd
import numpy as np
import seaborn as sns
from ctgan import CTGAN
iris = sns.load_dataset('iris')
iris.head()
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(iris['species'].unique())
iris['species'] = pd.DataFrame(le.transform(iris['species']))
data = iris.copy()
ctgan_model = CTGAN(epochs=2,batch_size=50,verbose = True)
ctgan_model.fit(data)
n_ctgan_generated_data = 2000
synthetic_data = ctgan.sample(n_ctgan_generated_data)
Complete error message
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-199b6dc04389> in <module>
1 n_ctgan_generated_data = 2000
----> 2 synthetic_data = ctgan.sample(n_ctgan_generated_data)
6 frames
/usr/local/lib/python3.8/dist-packages/ctgan/synthesizers/base.py in wrapper(self, *args, **kwargs)
48 def wrapper(self, *args, **kwargs):
49 if self.random_states is None:
---> 50 return function(self, *args, **kwargs)
51
52 else:
/usr/local/lib/python3.8/dist-packages/ctgan/synthesizers/ctgan.py in sample(self, n, condition_column, condition_value)
475 data = data[:n]
476
--> 477 return self._transformer.inverse_transform(data)
478
479 def set_device(self, device):
/usr/local/lib/python3.8/dist-packages/ctgan/data_transformer.py in inverse_transform(self, data, sigmas)
211 column_data = data[:, st:st + dim]
212 if column_transform_info.column_type == 'continuous':
--> 213 recovered_column_data = self._inverse_transform_continuous(
214 column_transform_info, column_data, sigmas, st)
215 else:
/usr/local/lib/python3.8/dist-packages/ctgan/data_transformer.py in _inverse_transform_continuous(self, column_transform_info, column_data, sigmas, st)
185 def _inverse_transform_continuous(self, column_transform_info, column_data, sigmas, st):
186 gm = column_transform_info.transform
--> 187 data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
188 data.iloc[:, 1] = np.argmax(column_data[:, 1:], axis=1)
189 if sigmas is not None:
/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
670 )
671 else:
--> 672 mgr = ndarray_to_mgr(
673 data,
674 index,
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
322 )
323
--> 324 _check_values_indices_shape_match(values, index, columns)
325
326 if typ == "array":
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
391 passed = values.shape
392 implied = (len(index), len(columns)-1)
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
394
395
ValueError: Shape of passed values is (2000, 2), indices imply (2000, 3)
Is this issue from the library, where I have to change any source code?

Append information in the th tags to td rows

I am an economist struggling with coding and data scraping.
I am scarping data from the main and unique table on this webpage (https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/). I can retrieve all the information of the td HTML tags with python selenium by referring to the class element. The same goes for the th tag where it is stored the information of the date and stage of the competition. In my final dataset, I would like to have the information stored in the th tag in two rows (data and stage of the competition) next to the other rows in the table. Basically, for each match, I would like to have the date and the stage of the competition in rows and not as the head of each group of matches.
The only solution I came up with is to index all the rows (with both th and td tags) and build a while loop to append the information in the th tags to the td rows whose index is lower than the next index for the th tag. Hope I made myself clear (if not I will try to give a more graphical explanation). However, I am not able to code such a logic construct due to my poor coding abilities. I do not know if I need two loops to iterate through different tags (td and th) and in case how to do that. If you have any easier solution, it is more than welcome!
Thanks in advance for the precious help!
code below:
from selenium import webdriver
import time
import pandas as pd
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
"Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
data[key] = list()
del data_keys
# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
# Teams
for el in driver.find_elements_by_class_name('name.table-participant'):
el = el.text.strip().split(" - ")
data["Home_Team"].append(el[0])
data["Away_Team"].append(el[1])
data["Season"].append(season_filt)
# Scores
for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
el = el.text.split(":")
if el[1][-3:] == " OT":
data["OT"].append(True)
el[1] = el[1][:-3]
else:
data["OT"].append(False)
data["Home_Score"].append(el[0])
data["Away_Score"].append(el[1])
# Match times
for el in driver.find_elements_by_class_name("table-time"):
data["Match_Time"].append(el.text)
# Odds
i = 0
for el in driver.find_elements_by_class_name("odds-nowrp"):
i += 1
if i%2 == 0:
data["Away_Odd"].append(el.text)
else:
data["Home_Odd"].append(el.text)
# N_Bookmakers
for el in driver.find_elements_by_class_name("center.info-value"):
data["N_Bookmakers"].append(el.text)
# TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)
except:
pass
driver.quit()
data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)
I would like to add this information to my dataset as two additional rows:
for el in driver.find_elements_by_class_name("first2.tl")[1:]:
el = el.text.strip().split(" - ")
data["date"].append(el[0])
data["stage"].append(el[1])
Few things I would change here.
Don't overwrite variables. You store elements in your el variable, then you over write the element with your strings. It may work for you here, but you may get yourself into trouble with that practice later on, especially since you are iterating through those elements. It makes it hard to debug too.
I know Selenium has ways to parse the html. But I personally feel BeautifulSoup is a tad easier to parse with and is a little more intuitive if you are simply just trying to pull out data from the html. So I went with BeautifulSoup's .find_previous() to get the tags that precede the games, essentially then able to get your date and stage content.
Lastly, I like to construct a list of dictionaries to make up the data frame. Each item in the list is a dictionary key:value where the key is the column name and value is the data. You sort of do the opposite in creating a dictionary of lists. Now there is nothing wrong with that, but if the lists don't have the same length, you're get an error when trying to create the dataframe. Where as with my way, if for what ever reason there is a value missing, it will still create the dataframe, but will just have a null or nan for the missing data.
There may be more work you need to do with the code to go through the pages, but this gets you the data in the form you need.
Code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)
rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {'id':'tournamentTable'})
trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
for each in trs:
teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
ot = False
for score in scores:
if 'OT' in score:
ot == True
scores = [x.replace('\xa0OT','') for x in scores]
matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text
# Odds
i = 0
for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
i += 1
if i%2 == 0:
away_odd = each_odd.text
else:
home_odd = each_odd.text
n_bookmakers = soup.find('td',{'class':'center info-value'}).text
date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
date = date_stage[0]
stage = date_stage[1]
row = {'Season':season_filt,
'Home_Team':teams[0],
'Away_Team':teams[1],
'Home_Score':scores[0],
'Away_Score':scores[1],
'OT':ot,
'Match_Time':matchTime,
'Home_Odd':home_odd,
'Away_Odd':away_odd,
'N_Bookmakers':n_bookmakers,
'Date':date,
'Stage':stage}
rows.append(row)
except:
pass
driver.quit()
data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)
Output:
print(data.head(15).to_string())
Season Home_Team Away_Team Home_Score Away_Score OT Match_Time Home_Odd Away_Odd N_Bookmakers Date Stage
0 2013-2014 Real Madrid Maccabi Tel Aviv 86 98 False 18:00 -667 +493 7 18 May 2014 Final Four
1 2013-2014 Barcelona CSKA Moscow 93 78 False 15:00 -135 +112 7 18 May 2014 Final Four
2 2013-2014 Barcelona Real Madrid 62 100 False 19:00 +134 -161 7 16 May 2014 Final Four
3 2013-2014 CSKA Moscow Maccabi Tel Aviv 67 68 False 16:00 -278 +224 7 16 May 2014 Final Four
4 2013-2014 Real Madrid Olympiacos 83 69 False 18:45 -500 +374 7 25 Apr 2014 Play Offs
5 2013-2014 CSKA Moscow Panathinaikos 74 44 False 16:00 -370 +295 7 25 Apr 2014 Play Offs
6 2013-2014 Olympiacos Real Madrid 71 62 False 18:45 +127 -152 7 23 Apr 2014 Play Offs
7 2013-2014 Maccabi Tel Aviv Olimpia Milano 86 66 False 17:45 -217 +179 7 23 Apr 2014 Play Offs
8 2013-2014 Panathinaikos CSKA Moscow 73 72 False 16:30 -106 -112 7 23 Apr 2014 Play Offs
9 2013-2014 Panathinaikos CSKA Moscow 65 59 False 18:45 -125 +104 7 21 Apr 2014 Play Offs
10 2013-2014 Maccabi Tel Aviv Olimpia Milano 75 63 False 18:15 -189 +156 7 21 Apr 2014 Play Offs
11 2013-2014 Olympiacos Real Madrid 78 76 False 17:00 +104 -125 7 21 Apr 2014 Play Offs
12 2013-2014 Galatasaray Barcelona 75 78 False 17:00 +264 -333 7 20 Apr 2014 Play Offs
13 2013-2014 Olimpia Milano Maccabi Tel Aviv 91 77 False 18:45 -286 +227 7 18 Apr 2014 Play Offs
14 2013-2014 CSKA Moscow Panathinaikos 77 51 False 16:15 -303 +247 7 18 Apr 2014 Play Offs

json decode issue backtesting with zipline

I'm running the test code below on ubuntu server. I'm trying to back test a simple buy and hold strategy with zipline. I'm just trying to make sure I've got everything installed correctly to back test. I'm pretty new to zipline. I'm getting the error below and not sure why.
I'm following the steps in this blog post:
https://towardsdatascience.com/introduction-to-backtesting-trading-strategies-7afae611a35e
I'm running the code in a jupyter notebook.
when I run the code before to verify the bundle has been ingested this is what it shows:
code:
!zipline bundles
output:
apple-prices-2017-2019 2020-06-06 20:33:05.356288
apple-prices-2017-2019 2020-06-05 06:33:39.834841
apple-prices-2017-2019 2020-06-05 06:29:53.091904
apple-prices-2017-2019 2020-06-05 06:26:56.583051
csvdir <no ingestions>
quandl 2020-06-06 20:25:58.737940
quandl 2020-06-06 20:18:01.977412
quantopian-quandl 2020-06-04 16:15:57.717373
quantopian-quandl 2020-05-29 05:59:54.967114
the code I'm running in the jupyter notebook is below, along with the error. does anyone see what the issue is, and can you suggest how to fix?
code:
%%zipline --start 2016-1-1 --end 2017-12-31 --capital-base 1050.0 -o buy_and_hold.pkl
# imports
from zipline.api import order, symbol, record
# parameters
selected_stock = 'AAPL'
n_stocks_to_buy = 10
def initialize(context):
context.has_ordered = False
def handle_data(context, data):
# record price for further inspection
record(price=data.current(symbol(selected_stock), 'price'))
# trading logic
if not context.has_ordered:
# placing order, negative number for sale/short
order(symbol(selected_stock), n_stocks_to_buy)
# setting up a flag for holding a position
context.has_ordered = True
error:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-10-d11030c42e8b> in <module>()
----> 1 get_ipython().run_cell_magic('zipline', '--start 2016-1-1 --end 2017-12-31 --capital-base 1050.0 -o buy_and_hold.pkl', "\n# imports\nfrom zipline.api import order, symbol, record\n\n# parameters\nselected_stock = 'AAPL'\nn_stocks_to_buy = 10\n\ndef initialize(context):\n context.has_ordered = False \n\ndef handle_data(context, data):\n # record price for further inspection\n record(price=data.current(symbol(selected_stock), 'price'))\n \n # trading logic\n if not context.has_ordered:\n # placing order, negative number for sale/short\n order(symbol(selected_stock), n_stocks_to_buy)\n # setting up a flag for holding a position\n context.has_ordered = True")
~/anaconda3/envs/py36/lib/python3.5/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2165 magic_arg_s = self.var_expand(line, stack_depth)
2166 with self.builtin_trap:
-> 2167 result = fn(magic_arg_s, cell)
2168 return result
2169
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/__main__.py in zipline_magic(line, cell)
309 '%s%%zipline' % ((cell or '') and '%'),
310 # don't use system exit and propogate errors to the caller
--> 311 standalone_mode=False,
312 )
313 except SystemExit as e:
~/anaconda3/envs/py36/lib/python3.5/site-packages/click/core.py in main(self, args, prog_name, complete_var, standalone_mode, **extra)
780 try:
781 with self.make_context(prog_name, args, **extra) as ctx:
--> 782 rv = self.invoke(ctx)
783 if not standalone_mode:
784 return rv
~/anaconda3/envs/py36/lib/python3.5/site-packages/click/core.py in invoke(self, ctx)
1064 _maybe_show_deprecated_notice(self)
1065 if self.callback is not None:
-> 1066 return ctx.invoke(self.callback, **ctx.params)
1067
1068
~/anaconda3/envs/py36/lib/python3.5/site-packages/click/core.py in invoke(*args, **kwargs)
608 with augment_usage_errors(self):
609 with ctx:
--> 610 return callback(*args, **kwargs)
611
612 def forward(*args, **kwargs): # noqa: B902
~/anaconda3/envs/py36/lib/python3.5/site-packages/click/decorators.py in new_func(*args, **kwargs)
19
20 def new_func(*args, **kwargs):
---> 21 return f(get_current_context(), *args, **kwargs)
22
23 return update_wrapper(new_func, f)
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/__main__.py in run(ctx, algofile, algotext, define, data_frequency, capital_base, bundle, bundle_timestamp, start, end, output, trading_calendar, print_algo, metrics_set, local_namespace, blotter)
274 local_namespace=local_namespace,
275 environ=os.environ,
--> 276 blotter=blotter,
277 )
278
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/utils/run_algo.py in _run(handle_data, initialize, before_trading_start, analyze, algofile, algotext, defines, data_frequency, capital_base, data, bundle, bundle_timestamp, start, end, output, trading_calendar, print_algo, metrics_set, local_namespace, environ, blotter)
157 trading_calendar=trading_calendar,
158 trading_day=trading_calendar.day,
--> 159 trading_days=trading_calendar.schedule[start:end].index,
160 )
161 first_trading_day =\
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/finance/trading.py in __init__(self, load, bm_symbol, exchange_tz, trading_calendar, trading_day, trading_days, asset_db_path, future_chain_predicates, environ)
101 trading_day,
102 trading_days,
--> 103 self.bm_symbol,
104 )
105
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/data/loader.py in load_market_data(trading_day, trading_days, bm_symbol, environ)
147 # date so that we can compute returns for the first date.
148 trading_day,
--> 149 environ,
150 )
151 tc = ensure_treasury_data(
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/data/loader.py in ensure_benchmark_data(symbol, first_date, last_date, now, trading_day, environ)
214
215 try:
--> 216 data = get_benchmark_returns(symbol)
217 data.to_csv(get_data_filepath(filename, environ))
218 except (OSError, IOError, HTTPError):
~/anaconda3/envs/py36/lib/python3.5/site-packages/zipline/data/benchmarks.py in get_benchmark_returns(symbol)
33 'https://api.iextrading.com/1.0/stock/{}/chart/5y'.format(symbol)
34 )
---> 35 data = r.json()
36
37 df = pd.DataFrame(data)
~/anaconda3/envs/py36/lib/python3.5/site-packages/requests/models.py in json(self, **kwargs)
894 # used.
895 pass
--> 896 return complexjson.loads(self.text, **kwargs)
897
898 #property
~/anaconda3/envs/py36/lib/python3.5/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
317 parse_int is None and parse_float is None and
318 parse_constant is None and object_pairs_hook is None and not kw):
--> 319 return _default_decoder.decode(s)
320 if cls is None:
321 cls = JSONDecoder
~/anaconda3/envs/py36/lib/python3.5/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
~/anaconda3/envs/py36/lib/python3.5/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
this post solve my problem. only needed the first two steps listed in it. you need to update the benchmarks.py and loader.py files in zipline/data
Getting JSONDecodeError: Expecting value: line 1 column 1 (char 0) with Python + Zipline + Docker + Jupyter

How to import JSON into R and convert it to table?

I want to play with data that is now saved in JSON format. But I am very new to R and have little clue of how to play with data. You can see below what I managed to achieve. But first, my code:
library(rjson)
json_file <- "C:\\Users\\Saonkfas\\Desktop\\WOWPAPI\\wowpfinaljson.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
I was able to the data:
for (x in json_data){print (x)}
Although output looks pretty raw:
[[1]]
[[1]]$wins
[1] "118"
[[1]]$losses
[1] "40"
# And so on
Note that the JSON is somewhat nested. I could create tables with Python, but R seems much more complicated.
Edit:
My JSON:
{
"play1": [
{
"wins": "118",
"losses": "40",
"max_killed": "7",
"battles": "158",
"plane_id": "4401",
"max_ground_object_destroyed": "3"
},
{
"wins": "100",
"losses": "58",
"max_killed": "7",
"battles": "158",
"plane_id": "2401",
"max_ground_object_destroyed": "3"
},
{
"wins": "120",
"losses": "38",
"max_killed": "7",
"battles": "158",
"plane_id": "2403",
"max_ground_object_destroyed": "3"
}
],
"play2": [
{
"wins": "12",
"losses": "450",
"max_killed": "7",
"battles": "158",
"plane_id": "4401",
"max_ground_object_destroyed": "3"
},
{
"wins": "150",
"losses": "8",
"max_killed": "7",
"battles": "158",
"plane_id": "2401",
"max_ground_object_destroyed": "3"
},
{
"wins": "120",
"losses": "328",
"max_killed": "7",
"battles": "158",
"plane_id": "2403",
"max_ground_object_destroyed": "3"
}
],
fromJSON returns a list, you can use the *apply functions to go through each element.
It's fairly straightforward (once you know what to do!) to convert it to a "table" (data frame is the correct R terminology).
library(rjson)
# You can pass directly the filename
my.JSON <- fromJSON(file="test.json")
df <- lapply(my.JSON, function(play) # Loop through each "play"
{
# Convert each group to a data frame.
# This assumes you have 6 elements each time
data.frame(matrix(unlist(play), ncol=6, byrow=T))
})
# Now you have a list of data frames, connect them together in
# one single dataframe
df <- do.call(rbind, df)
# Make column names nicer, remove row names
colnames(df) <- names(my.JSON[[1]][[1]])
rownames(df) <- NULL
df
wins losses max_killed battles plane_id max_ground_object_destroyed
1 118 40 7 158 4401 3
2 100 58 7 158 2401 3
3 120 38 7 158 2403 3
4 12 450 7 158 4401 3
5 150 8 7 158 2401 3
6 120 328 7 158 2403 3
I find jsonlite to be a little more user friendly for this task. Here is a comparison of three JSON parsing packages (biased in favor of jsonlite)
library(jsonlite)
data <- fromJSON('path/to/file.json')
data
#> $play1
# wins losses max_killed battles plane_id max_ground_object_destroyed
# 1 118 40 7 158 4401 3
# 2 100 58 7 158 2401 3
# 3 120 38 7 158 2403 3
#
# $play2
# wins losses max_killed battles plane_id max_ground_object_destroyed
# 1 12 450 7 158 4401 3
# 2 150 8 7 158 2401 3
# 3 120 328 7 158 2403 3
If you want to collapse those list names into a new column, I recommend dplyr::bind_rows rather than do.call(rbind, data)
library(dplyr)
data <- bind_rows(data, .id = 'play')
# Source: local data frame [6 x 7]
# play wins losses max_killed battles plane_id max_ground_object_destroyed
# (chr) (chr) (chr) (chr) (chr) (chr) (chr)
# 1 play1 118 40 7 158 4401 3
# 2 play1 100 58 7 158 2401 3
# 3 play1 120 38 7 158 2403 3
# 4 play2 12 450 7 158 4401 3
# 5 play2 150 8 7 158 2401 3
# 6 play2 120 328 7 158 2403 3
Beware that the columns may not have the type you expect (notice the columns are all characters since all of the numbers were quoted in the provided JSON data)!
Edit Nov. 2017: One approach to type conversion would be to use mutate_if to guess the intended type of character columns.
data <- mutate_if(data, is.character, type.convert, as.is = TRUE)
I prefer tidyjson over rjson and jsonlite as it has a easy workflow for converting multilevel nested json objects to 2 dimensional tables. Your problem can be easily solved using this package from github.
devtools::install_github("sailthru/tidyjson")
library(tidyjson)
library(dplyr)
> json %>% as.tbl_json %>% gather_keys %>% gather_array %>%
+ spread_values(
+ wins = jstring("wins"),
+ losses = jstring("losses"),
+ max_killed = jstring("max_killed"),
+ battles = jstring("battles"),
+ plane_id = jstring("plane_id"),
+ max_ground_object_destroyed = jstring("max_ground_object_destroyed")
+ )
Output
document.id key array.index wins losses max_killed battles plane_id max_ground_object_destroyed
1 1 play1 1 118 40 7 158 4401 3
2 1 play1 2 100 58 7 158 2401 3
3 1 play1 3 120 38 7 158 2403 3
4 1 play2 1 12 450 7 158 4401 3
5 1 play2 2 150 8 7 158 2401 3
6 1 play2 3 120 328 7 158 2403 3

Getting imported json data into a data frame

I have a file containing over 1500 json objects that I want to work with in R. I've been able to import the data as a list, but am having trouble coercing it into a useful structure. I want to create a data frame containing a row for each json object and a column for each key:value pair.
I've recreated my situation with this small, fake data set:
[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]
Some features of the data:
The objects all contain the same number of key:value pairs although
some of the values are null
There are two non-numeric columns per object (name and group)
name is the unique identifier, there are 10 or so groups
many of the name and group entires contain spaces, commas and other punctuation.
Based on this question: R list(structure(list())) to data frame, I tried the following:
json_file <- "test.json"
json_data <- fromJSON(json_file)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
With both my real data and this fake data, the last line give me this error:
Error in data.frame(name = "Doe, John", group = "Red", `age (y)` = 24, :
arguments imply differing number of rows: 1, 0
You just need to replace your NULLs with NAs:
require(RJSONIO)
json_file <- '[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]'
json_file <- fromJSON(json_file)
json_file <- lapply(json_file, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
Once you have a non-null value for each element, you can call rbind without getting an error:
do.call("rbind", json_file)
name group age (y) height (cm) wieght (kg) score
[1,] "Doe, John" "Red" "24" "182" "74.8" NA
[2,] "Doe, Jane" "Green" "30" "170" "70.1" "500"
[3,] "Smith, Joan" "Yellow" "41" "169" "60" NA
[4,] "Brown, Sam" "Green" "22" "183" "75" "865"
[5,] "Jones, Larry" "Green" "31" "178" "83.9" "221"
[6,] "Murray, Seth" "Red" "35" "172" "76.2" "413"
[7,] "Doe, Jane" "Yellow" "22" "164" "68" "902"
This is very simple if you use either library(jsonlite) or library(jsonify)
Both of these handle the null values and converts them to NA, and they preserve the data types.
Data
json_file <- '[{"name":"Doe, John","group":"Red","age (y)":24,"height (cm)":182,"wieght (kg)":74.8,"score":null},
{"name":"Doe, Jane","group":"Green","age (y)":30,"height (cm)":170,"wieght (kg)":70.1,"score":500},
{"name":"Smith, Joan","group":"Yellow","age (y)":41,"height (cm)":169,"wieght (kg)":60,"score":null},
{"name":"Brown, Sam","group":"Green","age (y)":22,"height (cm)":183,"wieght (kg)":75,"score":865},
{"name":"Jones, Larry","group":"Green","age (y)":31,"height (cm)":178,"wieght (kg)":83.9,"score":221},
{"name":"Murray, Seth","group":"Red","age (y)":35,"height (cm)":172,"wieght (kg)":76.2,"score":413},
{"name":"Doe, Jane","group":"Yellow","age (y)":22,"height (cm)":164,"wieght (kg)":68,"score":902}]'
jsonlite
library(jsonlite)
jsonlite::fromJSON( json_file )
# name group age (y) height (cm) wieght (kg) score
# 1 Doe, John Red 24 182 74.8 NA
# 2 Doe, Jane Green 30 170 70.1 500
# 3 Smith, Joan Yellow 41 169 60.0 NA
# 4 Brown, Sam Green 22 183 75.0 865
# 5 Jones, Larry Green 31 178 83.9 221
# 6 Murray, Seth Red 35 172 76.2 413
# 7 Doe, Jane Yellow 22 164 68.0 902
str( jsonlite::fromJSON( json_file ) )
# 'data.frame': 7 obs. of 6 variables:
# $ name : chr "Doe, John" "Doe, Jane" "Smith, Joan" "Brown, Sam" ...
# $ group : chr "Red" "Green" "Yellow" "Green" ...
# $ age (y) : int 24 30 41 22 31 35 22
# $ height (cm): int 182 170 169 183 178 172 164
# $ wieght (kg): num 74.8 70.1 60 75 83.9 76.2 68
# $ score : int NA 500 NA 865 221 413 902
jsonify
library(jsonify)
jsonify::from_json( json_file )
# name group age (y) height (cm) wieght (kg) score
# 1 Doe, John Red 24 182 74.8 NA
# 2 Doe, Jane Green 30 170 70.1 500
# 3 Smith, Joan Yellow 41 169 60.0 NA
# 4 Brown, Sam Green 22 183 75.0 865
# 5 Jones, Larry Green 31 178 83.9 221
# 6 Murray, Seth Red 35 172 76.2 413
# 7 Doe, Jane Yellow 22 164 68.0 90
str( jsonify::from_json( json_file ) )
# 'data.frame': 7 obs. of 6 variables:
# $ name : chr "Doe, John" "Doe, Jane" "Smith, Joan" "Brown, Sam" ...
# $ group : chr "Red" "Green" "Yellow" "Green" ...
# $ age (y) : int 24 30 41 22 31 35 22
# $ height (cm): int 182 170 169 183 178 172 164
# $ wieght (kg): num 74.8 70.1 60 75 83.9 76.2 68
# $ score : int NA 500 NA 865 221 413 902
To remove null values use parameter nullValue
json_data <- fromJSON(json_file, nullValue = NA)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
this way there won´t be any unnecessary quotes in your output
library(rjson)
Lines <- readLines("yelp_academic_dataset_business.json")
business <- as.data.frame(t(sapply(Lines, fromJSON)))
You may try this to load JSON data into R
dplyr::bind_rows(fromJSON(file_name))
Changing the package from rjson to jsonlite fixed it for me.
So instead of this:
fromAPIPlantsPages <- rjson::fromJSON(content(apiGetPlants,type="text",encoding = "UTF-8"))
dfPlantenAPI <- as.data.frame(fromAPIPlantsPages)
I changed it to this:
fromAPIPlantsPages <- jsonlite::fromJSON(content(apiGetPlants,type="text",encoding = "UTF-8"))
dfPlantenAPI <- as.data.frame(fromAPIPlantsPages)