I have a dataframe df:
d = {'col1': [1, 2,0,55,12,3], 'col3': ['A','A','A','B','B','B'] }
df = pd.DataFrame(data=d)
df
col1 col3
0 1 A
1 2 A
2 0 A
3 55 B
4 12 B
6 3 B
and want to build a Json from it, as the results looks like this :
json_result = { 'A' : [1,2,0], 'B': [55,12,3] }
basically, I would like for each group of the col3 to affect an array of its corresponding values from the dataframe
Aggregate list and then use Series.to_json:
print (df.groupby('col3')['col1'].agg(list).to_json())
{"A":[1,2,0],"B":[55,12,3]}
or if need dictionary use Series.to_dict:
print (df.groupby('col3')['col1'].agg(list).to_dict())
{'A': [1, 2, 0], 'B': [55, 12, 3]}
Related
I have dataframe with the following columns:
ID A1 B1 C1 A2 B2 C2 A3 B3 C3
AA 1 3 6 4 0 6
BB 5 5 4 6 7 9
CC 5 5 5
I want to create a new column called Z that takes each row, groups them into a JSON list of records, and renames the column as their key. After the JSON column is constructed, I want to drop all the columns and keep Z and ID only.
Here is the output desired:
ID Z
AA [{"A":1, "B":3,"C":6},{"A":4, "B":0,"C":6}]
BB [{"A":5, "B":5,"C":4},{"A":6, "B":7,"C":9}]
CC [{"A":5, "B":5,"C":5}]
Here is my current attempt:
df2 = df.groupby(['ID']).apply(lambda x: x[['A1', 'B1', 'C1',
'A2', 'B2', 'C2', 'A3', 'B3', 'C3']].to_dict('records')).to_frame('Z').reset_index()
The problem is that I cannot rename the columns so that only the letter remains and the number is removed like the example above. Running the code above also does not separate each group of 3 into one object as opposed to creating two objects in my list. I would like to accomplish this in Pandas if possible. Any guidance is greatly appreciated.
Pandas solution
Convert the columns to MultiIndex by splitting and expanding around a regex delimiter, then stack the dataframe to convert the dataframe to multiindex series, then group the dataframe on level=0 and apply the to_dict function to create records per ID
s = df.set_index('ID')
s.columns = s.columns.str.split(r'(?=\d+$)', expand=True)
s.stack().groupby(level=0).apply(pd.DataFrame.to_dict, 'records').reset_index(name='Z')
Result
ID Z
0 AA [{'A': 1.0, 'B': 3.0, 'C': 6.0}, {'A': 4.0, 'B': 0.0, 'C': 6.0}]
1 BB [{'A': 5.0, 'B': 5.0, 'C': 4.0}, {'A': 6.0, 'B': 7.0, 'C': 9.0}]
2 CC [{'A': 5.0, 'B': 5.0, 'C': 5.0}]
Have you tried to go line by line?? I am not very good with pandas and python. But I have me this code. Hope it works for you.
toAdd = []
for row in dataset.values:
toAddLine = {}
i = 0
for data in row:
if data is not None:
toAddLine["New Column Name "+dataset.columns[i]] = data
i = i +1
toAdd.append(toAddLine)
dataset['Z'] = toAdd
dataset['Z']
# create a columns name map for chang related column
columns = dataset.columns
columns_map = {}
for i in columns:
columns_map[i] = f"new {i}"
def change_row_to_json(row):
new_dict = {}
for index, value in enumerate(row):
new_dict[columns_map[columns[index]]] = value
return json.dumps(new_dict, indent = 4)
dataset.loc[:,'Z'] = dataset.apply(change_row_to_json, axis=1)
dataset= dataset[["ID", "Z"]]
I just add a few lines on subham codes and it worked for me
import pandas as pd
from numpy import nan
data = pd.DataFrame({'ID': {0: 'AA', 1: 'BB', 2: 'CC'}, 'A1': {0: 1, 1: 5, 2: 5}, 'B1': {0: 3, 1: 5, 2: 5}, 'C1': {0: 6, 1: 4, 2: 5}, 'A2': {0: nan, 1: 6.0, 2: nan}, 'B2': {0: nan, 1: 7.0, 2: nan}, 'C2': {0: nan, 1: 9.0, 2: nan}, 'A3': {0: 4.0, 1: nan, 2: nan}, 'B3': {0: 0.0, 1: nan, 2: nan}, 'C3': {0: 6.0, 1: nan, 2: nan}} )
data
data.index = data.ID
data.drop(columns=['ID'],inplace=True)
data
data.columns = data.columns.str.split(r'(?=\d+$)', expand=True)
d = data.stack().groupby(level=0).apply(pd.DataFrame.to_dict, 'records').reset_index(name='Z')
d.index = d.ID
d.drop(columns=['ID'],inplace=True)
d.to_dict()['Z']
Now we can see we get desired output thanks, #shubham Sharma, for the answer I think this might help
I am quite new to Pyspark, here is what I try to do, below is the table, type is ArrayType(DoubleType), ArrayType(DecimalType)
A
B
[1,2]
[2,4]
[1,2,4]
[1,3,3]
What I want to do is treat A and B as np.array, then pass a function to do calculation.
def func(row):
a = row.A
b = row.B
res = some-function(a,b)
return res
What I am trying now is
res = a.rdd.map(func)
resDF = res.toDF(res)
resDF.show()
But I am receiving the following error, could someone guide me a bit here? Thank you.
TypeError: schema should be StructType or list or None, but got: PythonRDD[167] at RDD at PythonRDD.scala:53
You can use pandas_udf
sample data
df = spark.createDataFrame([
([1,2], [2,4]),
([1,2,4], [1,3,3]),
], 'a array<int>, b array<int>')
df.show()
+---------+---------+
|a |b |
+---------+---------+
|[1, 2] |[2, 4] |
|[1, 2, 4]|[1, 3, 3]|
+---------+---------+
create new column with pandas_udf
#F.pandas_udf("array<int>")
def func(a, b):
return a * b
df.withColumn('c', func('a', 'b')).show()
+---------+---------+----------+
| a| b| c|
+---------+---------+----------+
| [1, 2]| [2, 4]| [2, 8]|
|[1, 2, 4]|[1, 3, 3]|[1, 6, 12]|
+---------+---------+----------+
I'm unable to make this JSON:
{
“profiles”: {
“1”: {
“id”: “1”,
“property1”: “value1”,
“property2”: “value2”
},
“2”: {
“id”: “2”,
“property1”: “value21”,
“property2”: “value22”
}
}}
To this format
Desired output
Id Property1 Property2
1 Value1 Value2
2 Value21 Value22
I've attempted different approaches, that just result in one col all data.
Can someone please orient me on this?
Based on this example:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
I would suggest something like:
your_json = {<your_json>}
property1 = []
property2 = []
for key, value in your_json.items():
for k, v in value.items():
property1.append(v['property1'])
property2.append(v['property2'])
data = {'property1': property1, 'property2': property2}
tt = pd.DataFrame.from_dict(data)
print(tt)
my test code:
local jsonc = require "jsonc"
local x = {
a = 1,
b = 2,
c = 3,
d = 4,
e = 5,
}
for k, v in pairs(x) do
print(k,v)
end
print(jsonc.stringify(x))
output:
a 1
c 3
b 2
e 5
d 4
{"a":1,"c":3,"b":2,"e":5,"d":4}
someone help:
from for pairs output, lua store table by key hash order, how can i change it?
i need output: {"a":1,"b":2,"c":3,"d":4,"e":5}
thanks
Lua tables can't preserve the order of their keys. There are two possible solutions.
You can store the keys in a separate array and iterate through that whenever you need to iterate through the table:
local keys = {'a', 'b', 'c', 'd', 'e'}
Or, instead of a hash table, you can use an array of pairs:
local x = {
{'a', 1},
{'b', 2},
{'c', 3},
{'d', 4},
{'e', 5},
}
I have a JSON data source that is a list of objects. Some of the object properties are themselves lists. I want to turn the whole thing into a data frame, preserving the lists as data frame values.
Example JSON data:
[{
"id": "A",
"p1": [1, 2, 3],
"p2": "foo"
},{
"id": "B",
"p1": [4, 5, 6],
"p2": "bar"
}]
Desired data frame:
id p2 p1
1 A foo 1, 2, 3
2 B bar 4, 5, 6
Failed attempt 1
I have found this nicely straightforward way of parsing my JSON:
unlisted_data <- lapply(fromJSON(json_str), function(x){unlist(x)})
data.frame(do.call("rbind", unlisted_data))
However, the unlisting process spreads my repeated value across multiple columns:
id p11 p12 p13 p2
1 A 1 2 3 foo
2 B 4 5 6 bar
I expected that calling unlist with the recursive = FALSE option would take care of this, but it doesn't.
Failed attempt 2
I noticed that I can almost do this with the I function:
> data.frame(I(parsed_json[[1]]))
parsed_json..1..
id A
p1 1, 2, 3
p2 foo
But the rows and columns are reversed. Transposing the result mangles the repeated data:
> t(data.frame(I(parsed_json[[1]])))
id p1 p2
parsed_json..1.. "A" Numeric,3 "foo"
The jsonlite package can handle this just fine:
library(jsonlite)
fromJSON(txt)
# id p1 p2
#1 A 1, 2, 3 foo
#2 B 4, 5, 6 bar
fromJSON(txt)$p1
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 4 5 6