I have list of JSON values (actually it's a text file where every line is one JSON object). Like this:
{ "id": 1, "name": "john", "age": 18, "education": "master" }
{ "id": 2, "name": "jack", "job": "clerk" }
...
Some of the values can be missing (e.g. first item doesn't have "job" value and second item doesn't have "education" and "age").
I need to create data frame in R and fill all missing column values as NAs (if field with unique name exists in at least one row). How to achieve this easier?
What I already done - I installed "rjson" package and parsed these lines to R lists. Let's assume that lines variable is a character vector of lines.
library(rjson)
lines <- // initialize "lines" var here
jsons <- sapply(lines, fromJSON)
"jsons" variable became "list of lists" (every JSON object is converted to list in R terminology). How to convert it to data.frame?
I want to see the following data frame for the example I provided:
"id" | "name" | "age" | "education" | "job"
-------------------------------------------
1 | "john" | 18 | "master" | NA
2 | "jack | NA | NA | "clerk"
From plyr you can use rbind.fill to add the NAs for you
library(plyr)
rbind.fill(sapply(jsons, data.frame), jsons)
# id name age education job
# 1 1 john 18 master <NA>
# 2 2 jack NA <NA> clerk
or from data.table
library(data.table)
rbindlist(jsons, fill=T)
and dplyr
library(dplyr)
bind_rows(sapply(jsons, data.frame))
Future me, correcting past me's mistakes. It would make more sense to use jsonlite's stream_in
stream_in(txtfile)
# To test on `txt` from below, try:
# stream_in(textConnection(txt))
# Found 2 records...
# Imported 2 records. Simplifying...
# id name age education job
#1 NA john 18 master <NA>
#2 2 jack NA <NA> clerk
Use the jsonlite package's fromJSON function, after making a few inline edits to your original text data (I've also edited the first piece of id data to include an explicit null value, to show that it deals with this):
fromJSON(paste0("[", gsub("}\n", "},\n", txt), "]"))
# id name age education job
#1 NA john 18 master <NA>
#2 2 jack NA <NA> clerk
All I did was add a little formatting to wrap all the JSON lines together in [ and ] and add a comma at the end of each closing } - resulting in an output like the below which can be processed all at once by jsonlite::fromJSON:
[{"1":"one"},{"2":"two"}]
Where txt was your lines of data as presented, with a null in the id variable:
txt <- "{ \"id\": null, \"name\": \"john\", \"age\": 18, \"education\": \"master\" }
{ \"id\": 2, \"name\": \"jack\", \"job\": \"clerk\" }"
Related
I have a dataframe in below format.
Input:
id
Name_type
Name
Car
1
First
rob
Nissan
2
First
joe
Hyundai
1
Last
dent
Infiniti
2
Last
Kent
Genesis
need to transform into a json column by appending a row value below format for a given key column as shown below.
Result expected:
id
json_column
1
{"First_Name":"rob","First_*Car", "Nissan","Last_Name":"dent","Last_Car", "Infiniti"}
2
{"First_Name":"joe","First_Car", "Hyundai","Last_Name":"kent","Last_Car", "Genesis"}
with below piece of code
column_set = ['Name','Car']
df = df.withColumn("json_data", to_json(struct(\[df\[x\] for x in column_set\])))
I was able to generate data as
id
Name_type
Json_data
1
First
{"Name":"rob", "Car": "Nissan"}
2
First
{"Name":"joe", "Car": "Hyundai"}
1
Last
{"Name":"dent", "Car": "infiniti"}
2
Last
{"Name":"kent", "Car": "Genesis"}
I was able to create a json column using to_json for a given row.
But not able to figure out how to append the row value to a column name and convert to nested json for a given key column.
To do what you want, you first need to manipulate your input dataframe a little bit. You can do this by grouping by the id column, and pivoting around the Name_type column like so:
from pyspark.sql.functions import first
df = spark.createDataFrame(
[
("1", "First", "rob", "Nissan"),
("2", "First", "joe", "Hyundai"),
("1", "Last", "dent", "Infiniti"),
("2", "Last", "Kent", "Genesis")
],
["id", "Name_type", "Name", "Car"]
)
output = df.groupBy("id").pivot("Name_type").agg(first("Name").alias('Name'), first("Car").alias('Car'))
output.show()
+---+----------+---------+---------+--------+
| id|First_Name|First_Car|Last_Name|Last_Car|
+---+----------+---------+---------+--------+
| 1| rob| Nissan| dent|Infiniti|
| 2| joe| Hyundai| Kent| Genesis|
+---+----------+---------+---------+--------+
Then you can use the exact same code as what you used to get your wanted result, but using 4 columns instead of 2:
from pyspark.sql.functions import to_json, struct
column_set = ['First_Name','First_Car', 'Last_Name', 'Last_Car']
output = output.withColumn("json_data", to_json(struct([output[x] for x in column_set])))
output.show(truncate=False)
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|id |First_Name|First_Car|Last_Name|Last_Car|json_data |
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|1 |rob |Nissan |dent |Infiniti|{"First_Name":"rob","First_Car":"Nissan","Last_Name":"dent","Last_Car":"Infiniti"}|
|2 |joe |Hyundai |Kent |Genesis |{"First_Name":"joe","First_Car":"Hyundai","Last_Name":"Kent","Last_Car":"Genesis"}|
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
In Groovy and running on a jenkins pipeline, I am using the readFile function from jenkins to read the csv file.
Example csv:
name
val1
val2
John
2
122
John
2
012
Bertha
2
0021
John
3
20
Philip
3
12022
Bertha
3
162021
John
3
2022
What I am trying to achieve is call another function for each different value in column "name".
The Groovy script flow would be something like:
call functionX (name, rest of values) with:
name
val1
val2
John
2
122
John
2
012
John
3
20
John
3
2022
then call functionX (name, rest of values) with:
name
val1
val2
Philip
3
12022
then call functionX (name, rest of values) with:
name
val1
val2
Bertha
2
0021
Bertha
3
162021
Note:
The order (John, Philip, Bertha) is not important!
I think i can achieve this with closures but I'm not quite sure since I'm pretty new to the topic
Is this something like what you are looking for?
def functionX(name,val1,val2) {
if (name == 'name') return
println ( "Name: $name, V1: $val1, V2: $val2" )
}
new File( 'names.csv' ).readLines().sort{ it }.each {
println it
functionX( *( it.split( ',' ) ) )
}
Output:
Bertha,2,21
Name: Bertha, V1: 2, V2: 21
Bertha,3,162021
Name: Bertha, V1: 3, V2: 162021
John,2,12
Name: John, V1: 2, V2: 12
John,2,122
Name: John, V1: 2, V2: 122
John,3,20
Name: John, V1: 3, V2: 20
John,3,2022
Name: John, V1: 3, V2: 2022
Philip,3,12022
Name: Philip, V1: 3, V2: 12022
name,val1,val2
I have a two-fold issue and looking for clues as to how to approach it.
I have a json file that is formatted as such:
{
"code": 2000,
"data": {
"1": {
"attribute1": 40,
"attribute2": 1.4,
"attribute3": 5.2,
"attribute4": 124
"attribute5": "65.53%"
},
"94": {
"attribute1": 10,
"attribute2": 4.4,
"attribute3": 2.2,
"attribute4": 12
"attribute5": "45.53%"
},
"96": {
"attribute1": 17,
"attribute2": 9.64,
"attribute3": 5.2,
"attribute4": 62
"attribute5": "51.53%"
}
},
"message": "SUCCESS"
}
My goals are to:
I would first like to sort the data by any of the attributes.
There are around 100 of these, I would like to grab the top 5 (depending on how they are sorted), then...
Output the data in a table e.g.:
These are sorted by: attribute5
---
attribute1 | attribute2 | attribute3 | attribute4 | attribute5
40 |1.4 |5.2|124|65.53%
17 |9.64|5.2|62 |51.53%
10 |4.4 |2.2|12 |45.53%
*also, attribute5 above is a string value
Admittedly, my knowledge here is very limited.
I attempted to mimick the method used here:
python sort list of json by value
I managed to open the file and I can extract the key values from a sample row:
import json
jsonfile = path-to-my-file.json
with open(jsonfile) as j:
data=json.load(j)
k = data["data"]["1"].keys()
print(k)
total=data["data"]
for row in total:
v = data["data"][str(row)].values()
print(v)
this outputs:
dict_keys(['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5'])
dict_values([1, 40, 1.4, 5.2, 124, '65.53%'])
dict_values([94, 10, 4.4, 2.2, 12, '45.53%'])
dict_values([96, 17, 9.64, 5.2, 62, '51.53%'])
Any point in the right direction would be GREATLY appreciated.
Thanks!
If you don't mind using pandas you could do it like this
import pandas as pd
rows = [v for k,v in data["data"].items()]
df = pd.DataFrame(rows)
# then to get the top 5 values by attribute can choose either ascending
# or descending with the ascending keyword and head prints the top 5 rows
df.sort_values('attribute1', ascending=True).head()
This will allow you to sort by any attribute you need at any time and print out a table.
Which will produce output like this depending on what you sort by
attribute1 attribute2 attribute3 attribute4 attribute5
0 40 1.40 5.2 124 65.53%
1 10 4.40 2.2 12 45.53%
2 17 9.64 5.2 62 51.53%
I'll leave this answer here in case you don't want to use pandas but the answer from #MatthewBarlowe is way less complicated and I recommend that.
For sorting by a specific attribute, this should work:
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
Now, sorted_keys is a list of the keys in order of the attribute they were sorted by.
Then, to print this as a table, I used the tabulate library. The final code for me looked like this:
from tabulate import tabulate
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
print(f"\nSorted by: {SORT_BY}")
print(
tabulate(
[
[sorted_keys[i], *items[sorted_keys[i]].values()]
for i, _ in enumerate(items)
],
headers=["Column", *items["1"].keys()],
)
)
When sorting by 'attribute5', this outputs:
Sorted by: attribute5
Column attribute1 attribute2 attribute3 attribute4 attribute5
-------- ------------ ------------ ------------ ------------ ------------
1 40 1.4 5.2 124 65.53%
96 17 9.64 5.2 62 51.53%
94 10 4.4 2.2 12 45.53%
I have list of objects as JSON. Each object has two properties: id(string) and arg(number).
When I use pandas.read_json(...), the resulting DataFrame has the id interpreted as number as well, which causes problems, since information is lost.
import pandas as pd
json = '[{ "id" : "1", "arg": 1 },{ "id" : "1_1", "arg": 2}, { "id" : "11", "arg": 2}]'
df = pd.read_json(json)
I'd expect to have a DataFrame like this:
id arg
0 "1" 1
1 "1_1" 2
2 "11" 2
I get
id arg
0 1 1
1 11 2
2 11 2
and suddenly, the once unique id is not so unique anymore.
How can I tell pandas to stop doing that?
My search so far only yielded results, where people where trying to achive the opposite - having columns of string beeing interpreted as numbers. I totally don't want to achive that in this case!
If you set the dtype parameter to False, read_json will not infer the types automatically:
df = pd.read_json(json, dtype=False)
Use dtype parameter for preventing cast id to numbers:
df = pd.read_json(json, dtype={'id':str})
print (df)
id arg
0 1 1
1 1_1 2
2 11 2
print (df.dtypes)
id object
arg int64
dtype: object
I have a text file '\t' separated. First two columns are text and third one is in JSON format like {type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]}
How can i put it into DF correctly?
I would like to parse line like factor1\tparam1\t{type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]} into DF like
factor_column param_column a_column ts_column
factor1 param1 a1 1
factor1 param1 a2 2
I have saved that one line of text you have provided into a file called 'parseJSON.txt'. You can then read the file in as per usual using read.table, then make use of library(jsonlite) to parse the 3rd column.
I've also formatted the line of text to include quotes around the JSON code:
factor1 param1 {"type": [{"a": "a1", "timestamp": 1}, {"a":"a2", "timestamp": 2}]}
library(jsonlite)
dat <- read.table("parseJSON.txt",
sep="\t",
header=F,
quote="")
#parse 3rd column using jsonlite
js <- fromJSON(as.character(dat[1,3]))
js is now a list
> js
$type
a timestamp
1 a1 1
2 a2 2
which can be combined with the first two columns of dat
res <- cbind(dat[,1:2],js$type)
names(res) <- c("factor_column", "param_column", "a_column", "ts_column")
which gives
> res
factor_column param_column a_column ts_column
1 factor1 param1 a1 1
2 factor1 param1 a2 2