I have a two-fold issue and looking for clues as to how to approach it.
I have a json file that is formatted as such:
{
"code": 2000,
"data": {
"1": {
"attribute1": 40,
"attribute2": 1.4,
"attribute3": 5.2,
"attribute4": 124
"attribute5": "65.53%"
},
"94": {
"attribute1": 10,
"attribute2": 4.4,
"attribute3": 2.2,
"attribute4": 12
"attribute5": "45.53%"
},
"96": {
"attribute1": 17,
"attribute2": 9.64,
"attribute3": 5.2,
"attribute4": 62
"attribute5": "51.53%"
}
},
"message": "SUCCESS"
}
My goals are to:
I would first like to sort the data by any of the attributes.
There are around 100 of these, I would like to grab the top 5 (depending on how they are sorted), then...
Output the data in a table e.g.:
These are sorted by: attribute5
---
attribute1 | attribute2 | attribute3 | attribute4 | attribute5
40 |1.4 |5.2|124|65.53%
17 |9.64|5.2|62 |51.53%
10 |4.4 |2.2|12 |45.53%
*also, attribute5 above is a string value
Admittedly, my knowledge here is very limited.
I attempted to mimick the method used here:
python sort list of json by value
I managed to open the file and I can extract the key values from a sample row:
import json
jsonfile = path-to-my-file.json
with open(jsonfile) as j:
data=json.load(j)
k = data["data"]["1"].keys()
print(k)
total=data["data"]
for row in total:
v = data["data"][str(row)].values()
print(v)
this outputs:
dict_keys(['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5'])
dict_values([1, 40, 1.4, 5.2, 124, '65.53%'])
dict_values([94, 10, 4.4, 2.2, 12, '45.53%'])
dict_values([96, 17, 9.64, 5.2, 62, '51.53%'])
Any point in the right direction would be GREATLY appreciated.
Thanks!
If you don't mind using pandas you could do it like this
import pandas as pd
rows = [v for k,v in data["data"].items()]
df = pd.DataFrame(rows)
# then to get the top 5 values by attribute can choose either ascending
# or descending with the ascending keyword and head prints the top 5 rows
df.sort_values('attribute1', ascending=True).head()
This will allow you to sort by any attribute you need at any time and print out a table.
Which will produce output like this depending on what you sort by
attribute1 attribute2 attribute3 attribute4 attribute5
0 40 1.40 5.2 124 65.53%
1 10 4.40 2.2 12 45.53%
2 17 9.64 5.2 62 51.53%
I'll leave this answer here in case you don't want to use pandas but the answer from #MatthewBarlowe is way less complicated and I recommend that.
For sorting by a specific attribute, this should work:
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
Now, sorted_keys is a list of the keys in order of the attribute they were sorted by.
Then, to print this as a table, I used the tabulate library. The final code for me looked like this:
from tabulate import tabulate
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
print(f"\nSorted by: {SORT_BY}")
print(
tabulate(
[
[sorted_keys[i], *items[sorted_keys[i]].values()]
for i, _ in enumerate(items)
],
headers=["Column", *items["1"].keys()],
)
)
When sorting by 'attribute5', this outputs:
Sorted by: attribute5
Column attribute1 attribute2 attribute3 attribute4 attribute5
-------- ------------ ------------ ------------ ------------ ------------
1 40 1.4 5.2 124 65.53%
96 17 9.64 5.2 62 51.53%
94 10 4.4 2.2 12 45.53%
How can I fix to save, write command in octave?
Import_Data
error: octave_base_value::save_ascii(): wrong type argument 'object'
DATASET.TSERIES = csvread(’MR_AER_DATASET1.csv’);
DATASET.LABEL = {’DATES’,’T_PI’,’T_CI’,’m_PI’,’m_CI’,’APITR’,’ACITR’,’PITB’,’CITB’,’GOV’};
DATASET.VALUE = [ 1, 2, 3, 4, 5 , 6 , 7 , 8 , 9 , 10 ];
DATASET.UNIT = [ 0, 2, 2, 2, 2 , 2, 2 , 1 , 1 , 1 ];
save (’DATASET’, ’DATASET’);
I am making a predictive model to predict revenue and trying to parse this 'cast' value from the data frame as it is not a list or a dict
x['cast']
And the output is
0 [{'cast_id': 4, 'character': 'Lou', 'credit_id...
1 [{'cast_id': 1, 'character': 'Mia Thermopolis'...
2 [{'cast_id': 5, 'character': 'Andrew Neimann',...
3 [{'cast_id': 1, 'character': 'Vidya Bagchi', '...
4 [{'cast_id': 3, 'character': 'Chun-soo', 'cred...
5 [{'cast_id': 6, 'character': 'Pinocchio (voice...
6 [{'cast_id': 23, 'character': 'Clyde', 'credit...
7 [{'cast_id': 2, 'character': 'Himself', 'credi...
8 [{'cast_id': 1, 'character': 'Long John Silver...
9 [{'cast_id': 24, 'character': 'Jonathan Steinb...
Name: cast, dtype: object
I need all the 'character' values in a list.
but when I try
x['cast'][0]['character']
It throws this error
TypeError: string indices must be integers
Help me out with this please.
First convert json to list of dictionaries and then get values from first list by key of dict:
import ast
mask = x['cast'].notna()
x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(ast.literal_eval)
#alternative
#x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(pd.io.json.loads)
x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(lambda x: x[0].get('character', 'not match data'))
EDIT:
If still problem use Series.str.extract:
x = pd.DataFrame({'cast':[[{'cast_id': 4, 'character': 'Lou'}], np.nan]})
x['cat'] = x['cast'].astype(str).str.extract("'character': '([^'']+)'")
print (x)
cast cat
0 [{'cast_id': 4, 'character': 'Lou'}] Lou
1 NaN NaN
Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.
Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.
Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66
I have a JSON data source that is a list of objects. Some of the object properties are themselves lists. I want to turn the whole thing into a data frame, preserving the lists as data frame values.
Example JSON data:
[{
"id": "A",
"p1": [1, 2, 3],
"p2": "foo"
},{
"id": "B",
"p1": [4, 5, 6],
"p2": "bar"
}]
Desired data frame:
id p2 p1
1 A foo 1, 2, 3
2 B bar 4, 5, 6
Failed attempt 1
I have found this nicely straightforward way of parsing my JSON:
unlisted_data <- lapply(fromJSON(json_str), function(x){unlist(x)})
data.frame(do.call("rbind", unlisted_data))
However, the unlisting process spreads my repeated value across multiple columns:
id p11 p12 p13 p2
1 A 1 2 3 foo
2 B 4 5 6 bar
I expected that calling unlist with the recursive = FALSE option would take care of this, but it doesn't.
Failed attempt 2
I noticed that I can almost do this with the I function:
> data.frame(I(parsed_json[[1]]))
parsed_json..1..
id A
p1 1, 2, 3
p2 foo
But the rows and columns are reversed. Transposing the result mangles the repeated data:
> t(data.frame(I(parsed_json[[1]])))
id p1 p2
parsed_json..1.. "A" Numeric,3 "foo"
The jsonlite package can handle this just fine:
library(jsonlite)
fromJSON(txt)
# id p1 p2
#1 A 1, 2, 3 foo
#2 B 4, 5, 6 bar
fromJSON(txt)$p1
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 4 5 6