Convert EIA Json to DataFrame - Python 3.6 - json

I was trying to convert Json File from http://api.eia.gov/bulk/INTL.zip to dataframe.
Below is my code
import os, sys,json
import pandas as pd
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data']) # Delete if blank/NA
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]) # DF2.data contains list, converting to Data Frame
Error:-
Traceback (most recent call last):
File "D:\python\pyCharm\EIA\EIAINTL2018May.py", line 11, in
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data])
File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2326, in setitem
self._setitem_array(key, value)
File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2350, in _setitem_array
raise ValueError('Columns must be same length as key')
ValueError: Columns must be same length as key
I stuck, Please help on this.
I need results like below: Date & Values present in List in DF.data column
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]).iloc[:,0:2] # This not working
New Code changes after jezrael solution:
import os, sys, ast
import pandas as pd
sourcePath = r"C:\sunil_plus\dataset\EIAINTL2018May\8_updation2018Aug2\source\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
DF_All = pd.DataFrame(); DF4 = pd.DataFrame()
for series_id in DF2['series_id']:
DF3 = DF2.loc[DF2['series_id'] == series_id]
DF4['DateF'] = [item for item in DF3.Date] # Here I need to convert List values to Rows
DF4['ValuesF'] = [item for item in DF3.Values] # Here I need to convert List values to Rows
# Above code not working as expected
DF3 = DF3[['series_id', 'name', 'units', 'geography', 'f']] # Need only these columns
DF5 = pd.concat([DF3, DF4], axis=1).ffill() # Concat to get DateF & ValuesF Values
DF_All = DF_All.append(DF5)

You can use 2 list comprehensions for match first and second value of nested lists:
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
print (DF2.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-SRB-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-SSD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-SUN-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-SVK-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f \
0 Million Metric Tons MKD A
1 Million Metric Tons SRB A
2 Million Metric Tons SSD A
3 Million Metric Tons SUN A
4 Million Metric Tons SVK A
data \
0 [[2015, 0.1], [2014, (s)], [2013, (s)], [2012,...
1 [[2015, 4.1], [2014, 3.5], [2013, 4.2], [2012,...
2 [[2011, --], [2010, --], [2006, --], [2003, --...
3 [[2006, --], [2003, --], [2002, --], [2001, --...
4 [[2015, 9.1], [2014, 8.8], [2013, 11], [2012, ...
Date \
0 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
1 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
2 [2011, 2010, 2006, 2003, 2002, 2001, 2000, 199...
3 [2006, 2003, 2002, 2001, 2000, 1999, 1998, 199...
4 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
Values
0 [0.1, (s), (s), 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, ...
1 [4.1, 3.5, 4.2, 5.2, 4.4, 4.1, 3.2, 4.2, 4.1, ...
2 [--, --, --, --, --, --, --, --, --, --, --, -...
3 [--, --, --, --, --, --, --, --, --, --, --, -...
4 [9.1, 8.8, 11, 10, 11, 12, 10, 12, 12, 13, 14,...
EDIT: You can repeat rows and create new 2 columns:
sourcePath = 'INTL.txt'
DF = pd.read_json(sourcePath, lines=True)
cols = ['series_id', 'name', 'units', 'geography', 'f', 'data']
DF2 = DF[cols].dropna(subset=['data'])
DF3 = DF2.join(pd.DataFrame(DF2.pop('data').values.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('data')
).reset_index(drop=True)
DF3[['Date', 'Value']] = pd.DataFrame(DF3['data'].values.tolist())
#if want remove original data column
#DF3[['Date', 'Value']] = pd.DataFrame(DF3.pop('data').values.tolist())
print (DF3.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f data Date Value
0 Million Metric Tons MKD A [2015, 0.1] 2015 0.1
1 Million Metric Tons MKD A [2014, (s)] 2014 (s)
2 Million Metric Tons MKD A [2013, (s)] 2013 (s)
3 Million Metric Tons MKD A [2012, 0.2] 2012 0.2
4 Million Metric Tons MKD A [2011, 0.2] 2011 0.2

Related

Dynamically Flatten JSON response from API gives one Huge row

I am trying to dynamically flatten a json response for an API request but getting only one row with all the record back. kindly assist or point me in the right direction.
My json response looks like this
import requests, json
URL='https://data.calgary.ca/resource/848s-4m4z.json'
data = json.loads(requests.get(URL).text)
data
[{'sector': 'NORTH',
'community_name': 'THORNCLIFFE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '9',
'resident_count': '8474',
'date': '2018-03-01T12:00:00.000',
'year': '2018',
'month': 'MAR',
'id': '2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9',
'geocoded_column': {'latitude': '51.103099554741',
'longitude': '-114.068779421169',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '2',
':#computed_region_p8tp_5dkv': '4',
':#computed_region_4b54_tmc4': '2',
':#computed_region_kxmf_bzkv': '192'},
{'sector': 'SOUTH',
'community_name': 'WOODBINE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '3',
'resident_count': '8866',
'date': '2019-11-01T00:00:00.000',
'year': '2019',
'month': 'NOV',
'id': '2019-NOV-WOODBINE-Theft FROM Vehicle-3',
'geocoded_column': {'latitude': '50.939610852207664',
'longitude': '-114.12962865374453',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '1',
':#computed_region_p8tp_5dkv': '6',
':#computed_region_4b54_tmc4': '5',
':#computed_region_kxmf_bzkv': '43'}
]
Here is my code
``
`# Function for flattening
# json
def flatten_json(y):
out = {}
def flatten(x, name=''):
# If the Nested key-value
# pair is of dict type
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
# If the Nested key-value
# pair is of list type
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
# Driver code
# print(flatten_json(data))
newf=flatten_json(data)
pd.json_normalize(newf)`
``
It returns
[enter image description here](https://i.stack.imgur.com/i6mUe.png)
While am expecting the data in the following format
[enter image description here](https://i.stack.imgur.com/mXNtU.png).
json_normalize gives me the data in expected format but I need a way to dynamically parse different json request format (programmatically).
To get your dataframe in correct form you can use this example (data is your list from the question):
import requests
import pandas as pd
from ast import literal_eval
url = "https://data.calgary.ca/resource/848s-4m4z.json"
df = pd.DataFrame(requests.get(url).json())
df = pd.concat(
[
df,
df.pop("geocoded_column")
.apply(pd.Series)
.add_prefix("geocoded_column_"),
],
axis=1,
)
df["geocoded_column_human_address"] = df["geocoded_column_human_address"].apply(
literal_eval
)
df = pd.concat(
[
df,
df.pop("geocoded_column_human_address")
.apply(pd.Series)
.add_prefix("addr_"),
],
axis=1,
)
print(df.head().to_markdown(index=False))
Prints:
sector
community_name
group_category
category
count
resident_count
date
year
month
id
:#computed_region_4a3i_ccfj
:#computed_region_p8tp_5dkv
:#computed_region_4b54_tmc4
:#computed_region_kxmf_bzkv
geocoded_column_latitude
geocoded_column_longitude
addr_address
addr_city
addr_state
addr_zip
NORTH
THORNCLIFFE
Crime
Theft FROM Vehicle
9
8474
2018-03-01T12:00:00.000
2018
MAR
2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9
2
4
2
192
51.1031
-114.069
SOUTH
WOODBINE
Crime
Theft FROM Vehicle
3
8866
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WOODBINE-Theft FROM Vehicle-3
1
6
5
43
50.9396
-114.13
SOUTH
WILLOW PARK
Crime
Theft FROM Vehicle
4
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Theft FROM Vehicle-4
3
5
6
89
50.9566
-114.056
SOUTH
WILLOW PARK
Crime
Commercial Robbery
1
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Commercial Robbery-1
3
5
6
89
50.9566
-114.056
WEST
LINCOLN PARK
Crime
Commercial Break & Enter
5
2617
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-LINCOLN PARK-Commercial Break & Enter-5
1
2
8
42
51.0101
-114.13

Insert list json objects into row based on other column values in dataframe

I have dataframe with the following columns:
ID A1 B1 C1 A2 B2 C2 A3 B3 C3
AA 1 3 6 4 0 6
BB 5 5 4 6 7 9
CC 5 5 5
I want to create a new column called Z that takes each row, groups them into a JSON list of records, and renames the column as their key. After the JSON column is constructed, I want to drop all the columns and keep Z and ID only.
Here is the output desired:
ID Z
AA [{"A":1, "B":3,"C":6},{"A":4, "B":0,"C":6}]
BB [{"A":5, "B":5,"C":4},{"A":6, "B":7,"C":9}]
CC [{"A":5, "B":5,"C":5}]
Here is my current attempt:
df2 = df.groupby(['ID']).apply(lambda x: x[['A1', 'B1', 'C1',
'A2', 'B2', 'C2', 'A3', 'B3', 'C3']].to_dict('records')).to_frame('Z').reset_index()
The problem is that I cannot rename the columns so that only the letter remains and the number is removed like the example above. Running the code above also does not separate each group of 3 into one object as opposed to creating two objects in my list. I would like to accomplish this in Pandas if possible. Any guidance is greatly appreciated.
Pandas solution
Convert the columns to MultiIndex by splitting and expanding around a regex delimiter, then stack the dataframe to convert the dataframe to multiindex series, then group the dataframe on level=0 and apply the to_dict function to create records per ID
s = df.set_index('ID')
s.columns = s.columns.str.split(r'(?=\d+$)', expand=True)
s.stack().groupby(level=0).apply(pd.DataFrame.to_dict, 'records').reset_index(name='Z')
Result
ID Z
0 AA [{'A': 1.0, 'B': 3.0, 'C': 6.0}, {'A': 4.0, 'B': 0.0, 'C': 6.0}]
1 BB [{'A': 5.0, 'B': 5.0, 'C': 4.0}, {'A': 6.0, 'B': 7.0, 'C': 9.0}]
2 CC [{'A': 5.0, 'B': 5.0, 'C': 5.0}]
Have you tried to go line by line?? I am not very good with pandas and python. But I have me this code. Hope it works for you.
toAdd = []
for row in dataset.values:
toAddLine = {}
i = 0
for data in row:
if data is not None:
toAddLine["New Column Name "+dataset.columns[i]] = data
i = i +1
toAdd.append(toAddLine)
dataset['Z'] = toAdd
dataset['Z']
# create a columns name map for chang related column
columns = dataset.columns
columns_map = {}
for i in columns:
columns_map[i] = f"new {i}"
def change_row_to_json(row):
new_dict = {}
for index, value in enumerate(row):
new_dict[columns_map[columns[index]]] = value
return json.dumps(new_dict, indent = 4)
dataset.loc[:,'Z'] = dataset.apply(change_row_to_json, axis=1)
dataset= dataset[["ID", "Z"]]
I just add a few lines on subham codes and it worked for me
import pandas as pd
from numpy import nan
data = pd.DataFrame({'ID': {0: 'AA', 1: 'BB', 2: 'CC'}, 'A1': {0: 1, 1: 5, 2: 5}, 'B1': {0: 3, 1: 5, 2: 5}, 'C1': {0: 6, 1: 4, 2: 5}, 'A2': {0: nan, 1: 6.0, 2: nan}, 'B2': {0: nan, 1: 7.0, 2: nan}, 'C2': {0: nan, 1: 9.0, 2: nan}, 'A3': {0: 4.0, 1: nan, 2: nan}, 'B3': {0: 0.0, 1: nan, 2: nan}, 'C3': {0: 6.0, 1: nan, 2: nan}} )
data
data.index = data.ID
data.drop(columns=['ID'],inplace=True)
data
data.columns = data.columns.str.split(r'(?=\d+$)', expand=True)
d = data.stack().groupby(level=0).apply(pd.DataFrame.to_dict, 'records').reset_index(name='Z')
d.index = d.ID
d.drop(columns=['ID'],inplace=True)
d.to_dict()['Z']
Now we can see we get desired output thanks, #shubham Sharma, for the answer I think this might help

Sort and Select Top 5 JSON values

I have a two-fold issue and looking for clues as to how to approach it.
I have a json file that is formatted as such:
{
"code": 2000,
"data": {
"1": {
"attribute1": 40,
"attribute2": 1.4,
"attribute3": 5.2,
"attribute4": 124
"attribute5": "65.53%"
},
"94": {
"attribute1": 10,
"attribute2": 4.4,
"attribute3": 2.2,
"attribute4": 12
"attribute5": "45.53%"
},
"96": {
"attribute1": 17,
"attribute2": 9.64,
"attribute3": 5.2,
"attribute4": 62
"attribute5": "51.53%"
}
},
"message": "SUCCESS"
}
My goals are to:
I would first like to sort the data by any of the attributes.
There are around 100 of these, I would like to grab the top 5 (depending on how they are sorted), then...
Output the data in a table e.g.:
These are sorted by: attribute5
---
attribute1 | attribute2 | attribute3 | attribute4 | attribute5
40 |1.4 |5.2|124|65.53%
17 |9.64|5.2|62 |51.53%
10 |4.4 |2.2|12 |45.53%
*also, attribute5 above is a string value
Admittedly, my knowledge here is very limited.
I attempted to mimick the method used here:
python sort list of json by value
I managed to open the file and I can extract the key values from a sample row:
import json
jsonfile = path-to-my-file.json
with open(jsonfile) as j:
data=json.load(j)
k = data["data"]["1"].keys()
print(k)
total=data["data"]
for row in total:
v = data["data"][str(row)].values()
print(v)
this outputs:
dict_keys(['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5'])
dict_values([1, 40, 1.4, 5.2, 124, '65.53%'])
dict_values([94, 10, 4.4, 2.2, 12, '45.53%'])
dict_values([96, 17, 9.64, 5.2, 62, '51.53%'])
Any point in the right direction would be GREATLY appreciated.
Thanks!
If you don't mind using pandas you could do it like this
import pandas as pd
rows = [v for k,v in data["data"].items()]
df = pd.DataFrame(rows)
# then to get the top 5 values by attribute can choose either ascending
# or descending with the ascending keyword and head prints the top 5 rows
df.sort_values('attribute1', ascending=True).head()
This will allow you to sort by any attribute you need at any time and print out a table.
Which will produce output like this depending on what you sort by
attribute1 attribute2 attribute3 attribute4 attribute5
0 40 1.40 5.2 124 65.53%
1 10 4.40 2.2 12 45.53%
2 17 9.64 5.2 62 51.53%
I'll leave this answer here in case you don't want to use pandas but the answer from #MatthewBarlowe is way less complicated and I recommend that.
For sorting by a specific attribute, this should work:
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
Now, sorted_keys is a list of the keys in order of the attribute they were sorted by.
Then, to print this as a table, I used the tabulate library. The final code for me looked like this:
from tabulate import tabulate
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
print(f"\nSorted by: {SORT_BY}")
print(
tabulate(
[
[sorted_keys[i], *items[sorted_keys[i]].values()]
for i, _ in enumerate(items)
],
headers=["Column", *items["1"].keys()],
)
)
When sorting by 'attribute5', this outputs:
Sorted by: attribute5
Column attribute1 attribute2 attribute3 attribute4 attribute5
-------- ------------ ------------ ------------ ------------ ------------
1 40 1.4 5.2 124 65.53%
96 17 9.64 5.2 62 51.53%
94 10 4.4 2.2 12 45.53%

Convert R data table column from JSON to data table

I have a column that contains JSON data as in the following example,
library(data.table)
test <- data.table(a = list(1,2,3),
info = list("{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}"))
I want to convert the last column to equivalent R storage, which would look similar to,
res <- data.table(a = list(1, 2, 3),
duration = list(10, 20, 30),
country = list('US', 'US', 'GB'),
width = list(NA, NA, 20))
Since I have 500K rows with different contents I would look for a quick way to do this.
A variation without the need to separate out the JSON string
library(data.table)
library(jsonlite)
test[, info := gsub("'", "\"", info)]
test[, rbindlist(lapply(info, fromJSON), use.names = TRUE, fill = TRUE)]
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20
Parse the JSON first, then build the data.frame (or data.table):
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB'}",
"{'width': '20'}]"), collapse=", ")
# JSON standard requires double quotes
json_string <- gsub("'", "\"", json_string)
library("jsonlite")
fromJSON(json_string)
# duration country width
# 1 10 US <NA>
# 2 20 US <NA>
# 3 30 GB <NA>
# 4 <NA> <NA> 20
This isn't exactly what you asked for as your JSON doesn't associate 'width' with the previous record, you might need to do some manipulation first:
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}]"),
collapse=", ")
json_string <- gsub("'", "\"", json_string)
df <- jsonlite::fromJSON(json_string)
data.table::as.data.table(df)
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20

Manipulating data in CSV

Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.
Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.
Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66