I have a JSON file like below, how can I make a dataframe out of this. I want to make the main key an index and subkey as a column.
{
"PACK": {
"labor": "Recycle",
"actual": 0,
"Planned": 2,
"max": 6
},
"SORT": {
"labor": "Mix",
"actual": 10,
"Planned": 4,
"max": 3
}
}
The expected output is something like, I tried to use df.T but does not work. Any help on this is appreciated.
actual planned
PACK 0 2
SORT 10 4
You can read your json file to dict. Then create dataframe with dict values as data and dict keys as index.
import json
import pandas as pd
with open('test.json') as f:
data = json.load(f)
df = pd.DataFrame(data.values(), index=data.keys())
print(df)
labor actual Planned max
PACK Recycle 0 2 6
SORT Mix 10 4 3
The select columns with
df = df[['actual', 'planned']]
Pandas can read JSON files in many formats. For your use case, the following option should read your data the way you want:
pd.read_json(json_file, orient="index")
More information about the orient option can be found at the official documentation.
Related
I was wondering if there is a way to remove/replace null/empty square brackets in json or pandas dataframe. I have tried to replace them after converting into string via .astype(str) and it is successful and/but it seems it converts all json values into string and I can not process further with the same structure. I would appreciate any solution/recommendation. thanks...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame({"col1": ["a", [1, 2, 3], [], "d"], "col2": ["e", [], "f", "g"]})
print(df)
# Output
Here is one way to do it:
df = df.applymap(lambda x: pd.NA if isinstance(x, list) and not x else x)
print(df)
# Output
I have multiple JSON File that need to be converted in one CSV File
These are the example JSON code
tryout1.json
{
"Product":{
"one":"Desktop Computer",
"two":"Tablet",
"three":"Printer",
"four":"Laptop"
},
"Price":{
"five":700,
"six":250,
"seven":100,
"eight":1200
}}
tryout2.json
{
"Product":{
"one":"dell xps tower",
"two":"ipad",
"three":"hp office jet",
"four":"macbook"
},
"Price":{
"five":500,
"six":200,
"seven":50,
"eight":1000
}}
This is the python code that I wrote for converting those 2 json files
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
print(df1)
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
print(df2)
df = pd.concat([df1, df2])
df.to_csv ('/home/mich/Documents/tryout.csv', index = None)
result = pd.read_csv('/home/mich/Documents/tryout.csv')
print(result)
But I didn't get the result I need. How can I print the first json file in one column (for both product and price) and the second in the next column? (view Image via Link)
The result I got
[]
The result that I need
[]
You can first create a combined column of product and prices then concat them.
I am using axis = 1 since i want them to be combined side by side.(columns)
axis = 0 will combine by rows.
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
df1['product_price'] = df1['Product'].fillna(df1['Price'])
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
df2['product_price'] = df2['Product'].fillna(df2['Price'])
pd.concat([df1['product_price'], df2['product_price']],axis=1)
I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)
I'm currently working with pyspark and the great language game dataset which contains several samples as json objects like the one shown below.
Each of this samples represents an instance of the game, where some person has listened an audio file with some spoken language and afterwards should choose out of four possible languages which one she just heard.
I want now to aggreagte all this games on let's say the "target" field and the "guess" field and afterwards count the amount of games for each pair ("target","guess").
Can someone give me some help on how to get this done?
I already had a look at the pyspark documentation, but as I'm quite new to to python/pyspark it didn't really understand how the aggregate funciton works.
{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}
The process of converting json data into pyspark dataframe can be done this way.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string
json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
"choices": ["Hindi", "Lao", "Maltese", "Turkish"],
"guess": "Maltese", "date": "2013-08-19", "country": "AU"}]
json_string = convert_single_object_per_line(json_list)
df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])
[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
| choices|country| date| guess| sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...| AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+
I have a DataFrame df is the result of some pre-processing. The size of df is around 10,000 rows.
I save this DataFrame in CSV as follows:
df.coalesce(1).write.option("sep",";").option("header","true").csv("output/path")
Now I want to save this DataFrame as txt file in which is row is a JSON string. So, the column names should be passed to attributes in JSON strings.
For example:
df =
col1 col2 col3
aa 34 55
bb 13 77
json_txt =
{"col1": "aa", "col2": "34", "col3": "55"}
{"col1": "bb", "col2": "13", "col3": "77"}
Which is the best way to do it?
You can use write.json api to save a dataframe in json format as
df.coalesce(1).write.json("output path of json file")
Above code would create a json file. But if you want a text format (json text) then you can use toJSON api as
df.toJSON.rdd.coalesce(1).saveAsTextFile("output path to text file")
I hope the answer is helpful