JSON dataset features not being loaded properly? - json

I have this code:
import pandas as pd
import json
file = "/Users/mickelborg/Desktop/Dataset/2018/Carbon_Minoxide_(CO)_2018.json"
with open(file, 'r') as j:
contents = json.loads(j.read())
oxide = pd.DataFrame.from_dict(contents, orient='index')
oxide
I'm trying to get a readout of the JSON dataset by the features/columns, but they don't seem to load properly.
Currently this is the output that I have:
LINK
As can be seen from the image, the data loads incorrectly. "county_code" should each have their own row in the dataset, along with all the other following features.
What am I doing wrong in this regard?
Thanks a lot for your help!

Related

Storing data, originally in a dictionary sequence to a dataframe in the json format of a webpage

I'm new to pandas. How to store data, originally in a dictionary sequence to a DataFrame in the json format of a webpage?
I am interpreting the question keeping in mind that you have the url of the webpage you want to read. Inspect that url and check if the data needed, is available in the json format. If present, an url will be provided containing all the data. We need that url in the following code:
First, import the pandas module.
import pandas as pd
import requests
import json
URL="url of the webpage having the json file"
r=requests.get(URL)
data= r.json()
Create the dataframe df.
df=pd.io.json.json_normalize(data)
Print the dataframe to check whether you have received the required one.
print(df)
I hope this answers your question.

Pull Data from TMX Using Python 3.6.8

About two months ago I asked a question about pulling data from the CME in the json format. I was successfully able to pull the appropriate data with your help.
Want to remind everyone that I am still pretty new to Python, so please bear with me if my question is relatively straightforward.
I am trying to pull data again again in json format but from a different website and things do not appear to be cooperating. In particular I am trying to pull the following data:
https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742
This is what I have tried.
import pandas as pd
import json
import requests
cadGovt = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
sample_data = requests.get(cadGovt)
sample_data.encoding = 'utf-8'
test = sample_data.json()
print(test)
I would like to get a json of the information (which is literally just a table that has term, description, bid yield, ask yield, change, bid price, ask price, change).
Instead I am getting 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'.
If anyone has any guidance or advice that would be greatly appreciated.
It's cause the page you're getting is not returning JSON but an HTML page. So when you try to use
test = sample_data.json()
You're trying to parse HTML as JSON which won't work. You can scrape the data off of the page though, here's an example in bs4 you can try, it's a bit rusty on the edges but it should work.
import requests as r
from bs4 import beautifulsoup
url = 'https://api.tmxmoney.com/marketactivity/candeal?ts=1567086212742'
response = r.get(url)
soup = BeautifulSoup(response.text, 'lxml')
for tr in soup.find_all('tr'):
print(tr.text+"\n")
you can get the TD such as this
for tr in soup.find_all('tr'):
tds = tr.find_all('td')

Json Parsing from API With Dicts

I am writing a piece of code to retrieve certain information from the League of Legends api.
I have everything working fine and printing to my console, I have even managed to access the data and print off only the information that I need, the only issue is there are 299 values which I would like printed off and I can only manage to print one at a time. This would obviously be the worst way to sort through it as it would take forever to write the program. I have spent over 3 days researching and watching videos with no success so far.
Below is the code I currently have (minus imports).
url =('https://na1.api.riotgames.com/lol/league/v4/challengerleagues/by-
queue/RANKED_SOLO_5x5?api_key=RGAPI-b5187110-2f16-48b4-8b0c-938ae5bddccb')
r = requests.get(url)
response_dict = r.json()
print(response_dict['entries'][0]['summonerName'])
print(response_dict['entries'][1]['summonerName'])
When I attempt to index entries like '[0:299]' I get the following error: list indices must be integers or slices, not str.
I would simply convert the list of dictionaries within entries into a dataframe. You have all the info nicely organised and can access specific items easily including your column for summonerName .
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
#url = yourURL
res = requests.get(url, headers = {'user-agent' : 'Mozilla/5.0'})
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('p').text)
df = pd.DataFrame(data['entries'])
print(df)
You can loop over the index, that'll print them all out
for i in range(300):
print(response_dict['entries'][i]['summonerName'])
When you use response_dict['entries'][M:N]
You create a new list of dictionaries that have to be extracted before you can reference ['summonerName'] directly
If you print(response_dict['entries'][0:3])
You'll see what I mean

how to load large csv with many fields to Spark

Happy New Year!!!
I know this type of similar question has been asked/answered before, however, mine is different:
I have large size csv with 100+ fields and 100MB+, I want to load it to Spark (1.6) for analysis, the csv's header looks like the attached sample (only one line of the data)
Thank you very much.
UPDATE 1(2016.12.31.1:26pm EST):
I use the following approach and was able to load data (sample data with limited columns), however, I need to auto assign the header (from the csv) as the field's name in the DataFrame, BUT, the DataFrame looks like:
Can anyone tell me how to do it? Note, any manual manner is what I want to avoid.
>>> import csv
>>> rdd = sc.textFile('file:///root/Downloads/data/flight201601short.csv')
>>> rdd = rdd.mapPartitions(lambda x: csv.reader(x))
>>> rdd.take(5)
>>> df = rdd.toDF()
>>> df.show(5)
As noted in the comments you can use spark.read.csv for spark 2.0.0+ (https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
df = spark.read.csv('your_file.csv', header=True, inferSchema=True)
Setting header to True will parse the header to column names of the dataframe. Setting inferSchema to True will get the table schema (but will slow down reading).
See also here:
Load CSV file with Spark

"Not JSON compliant number" building a geojson file

I'm trying to build a GeoJSON file with Python geojson module consisting on a regular 2-d grid of points whose 'properties' are associated to geophysical variables (velocity,temperature, etc). The information comes from a netcdf file.
So the code is something like that:
from netCDF4 import Dataset
import numpy as np
import geojson
ncfile = Dataset('20140925-0332-n19.nc', 'r')
u = ncfile.variables['Ug'][:,:] # [T,Z,Y,X]
v = ncfile.variables['Vg'][:,:]
lat = ncfile.variables['lat'][:]
lon = ncfile.variables['lon'][:]
features=[]
for i in range(0,len(lat)):
for j in range(0,len(lon)):
coords = (lon[j],lat[i])
features.append(geojson.Feature(geometry = geojson.Point(coords),properties={"u":u[i,j],"v":v[i,j]}))
In this case the point has velocity components in the 'properties' object. The error I receive is on the features.append() line with the following message:
*ValueError: -5.4989638 is not JSON compliant number*
which corresponds to a longitude value. Can someone explains me whatcan be wrong ?
I have used simply conversion to float and it eliminated that error without need of numpy.
coords = (float(lon[j]),float(lat[i]))
I found the solution. The geojson module only supports standard Python data classes, while numpy extends up to 24 types. Unfortunately netCDF4 module needs numpy to load arrays from netCDF files. I solved using numpy.asscalar() method as explained here. So in the code above for example:
coords = (lon[j],lat[i])
is replaced by
coords = (np.asscalar(lon[j]),np.asscalar(lat[i]))
and works also for the rest of variables coming from the netCDF file.
Anyway, thanks Bret for your comment that provide me the clue to solve it.