I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)
Related
Looking for a way to modify the script, below to produce a single CSV from multiple JSON files. It should include multiple rows, each row returning values for the same fields, but tied to a single JSON (ROW 1 = JSON 1, ROW 2 = JSON 2, etc.). The following produces a CSV with one row of data.
import pandas as pd
df = pd.read_json("pywu.cache.json")
df = df.loc[["station_id", "observation_time", "weather", "temperature_string", "display_location"],"current_observation"].T
df = df.append(pd.Series([df["display_location"]["latitude"], df["display_location"]["longitude"]], index=["latitude", "longitude"]))
df = df.drop("display_location")
print(df['latitude'], df['longitude'])
df = pd.to_numeric(df, errors="ignore")
pd.DataFrame(df).T.to_csv("CurrentObs.csv", index=False, header=False, sep=",")
I want to output empty dataframe to csv file. I use these codes:
df.repartition(1).write.csv(path, sep='\t', header=True)
But due to there is no data in dataframe, spark won't output header to csv file.
Then I modify the codes to:
if df.count() == 0:
empty_data = [f.name for f in df.schema.fields]
df = ss.createDataFrame([empty_data], df.schema)
df.repartition(1).write.csv(path, sep='\t')
else:
df.repartition(1).write.csv(path, sep='\t', header=True)
It works, but I want to ask whether there are a better way without count function.
df.count() == 0 will make your driver program retrieve the count of all your dataframe partitions across the executors.
In your case I would use df.take(1).isEmpty (Spark >= 2.1). Still slow, but preferable to a raw count().
Only header:
cols = '\t'.join(df.columns)
with open('./cols.csv', 'w') as f:
f.write(cols)
I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs
I have seen similar questions being asked and responded to. However, no answer seems to address my specific needs.
The following code, which I took and adapted to suit my needs, successfully imports the files and relevant columns. However it appends rows onto the df and does not merge those columns based on keys.
import glob
import pandas as pd
import os
path = r'./csv_weather_data'
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat(pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],) for f in all_files)
Typical data structure is the following:
Year Month Day Hour DBT
1989 1 1 0 7.8
1989 1 1 100 8.6
1989 1 1 200 9.2
I would like to achieve the following:
import all csv files contained in a folder into a pandas df
merge first 4 columns into 1 column of datetime values
merge all imported csv, using newly created datetime value as an index, and adding DBT columns to that, with each DBT column taking the name of the imported csv (it is the Dry Bulb Temperature DBT of that weather file).
Any advice?
You should divide the problem in two steps:
First, define your import function. Here you need to define datetime and set is as index.
def my_import(f):
df = pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],)
df.loc[:, 'Date'] = pd.to_datetime(df.apply(lambda x : str(int(x['Year']))+str(int(x['Month']))+str(int(x['Day']))+str(int(x['Hour'])), axis = 1), format = '%Y%m%d%H')
df.drop(['Year', 'Month', 'Day', 'Hour'], axis = 1, inplace = True)
df.set_index('Date')
return df
Then you concatenate by columns (axis = 1)
df = pd.concat({f : my_import(f) for f in all_files}, axis = 1)
I have a CSV file formatted as follows:
somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature
0,0,0,1,1,2,4,5
And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7).
I tried the following:
import pandas as pd
types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0)
and:
types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True)
But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing.
It seems like pandas can only recognize as a Series a CSV formatted as follows:
f1, value
f2, value2
f3, value3
But when the features keys are in the first row instead of column, pandas does not want to squeeze it.
Is there something else I can try? Is this behaviour intended?
Here is the way I've found:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.ix[0,:]
Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something?
/EDIT: Best way to do it:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.transpose()[0] # here we convert the DataFrame into a Serie
This is the most stable way to get a row-oriented CSV line into a pandas Series.
BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc:
http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below
series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv')
In [29]: df.ix[0]
Out[29]:
somefeature 0
anotherfeature 0
f3 0
f4 1
f5 1
f6 2
f7 4
lastfeature 5
Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0);
X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is :
DataFrame -> Series=DataFrame[Column] -> Values=Series[Index]
So I suggest :
df=pandas.read_csv("csvfile.csv")
s=df[df.columns[0]]
from pandas import read_csv
series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame.
# create example series
series = pd.Series([0, 1, 2], index=["a", "b", "c"])
series.index.name = "idx"
print(series)
print()
# create csv
series_csv = series.to_csv()
print(series_csv)
# read csv
df = pd.read_csv(io.StringIO(series_csv), index_col=0)
indx = df.index
vals = [df.iloc[i, 0] for i in range(len(indx))]
series_again = pd.Series(vals, index=indx)
print(series_again)
Output:
idx
a 0
b 1
c 2
dtype: int64
idx,0
a,0
b,1
c,2
idx
a 0
b 1
c 2
dtype: int64