I have a text file '\t' separated. First two columns are text and third one is in JSON format like {type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]}
How can i put it into DF correctly?
I would like to parse line like factor1\tparam1\t{type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]} into DF like
factor_column param_column a_column ts_column
factor1 param1 a1 1
factor1 param1 a2 2
I have saved that one line of text you have provided into a file called 'parseJSON.txt'. You can then read the file in as per usual using read.table, then make use of library(jsonlite) to parse the 3rd column.
I've also formatted the line of text to include quotes around the JSON code:
factor1 param1 {"type": [{"a": "a1", "timestamp": 1}, {"a":"a2", "timestamp": 2}]}
library(jsonlite)
dat <- read.table("parseJSON.txt",
sep="\t",
header=F,
quote="")
#parse 3rd column using jsonlite
js <- fromJSON(as.character(dat[1,3]))
js is now a list
> js
$type
a timestamp
1 a1 1
2 a2 2
which can be combined with the first two columns of dat
res <- cbind(dat[,1:2],js$type)
names(res) <- c("factor_column", "param_column", "a_column", "ts_column")
which gives
> res
factor_column param_column a_column ts_column
1 factor1 param1 a1 1
2 factor1 param1 a2 2
Related
I have a PySpark dataframe where columns have JSON string values like this:
col1 col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}
I want to update "col1" JSON string values with "col2" JSON string values to get this:
col1 col2
{"d1":"2343","v1":"3434","id1":"123"} {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"} {"id1":"124"}
How to do this in PySpark?
Since you're dealing with string type columns, you can remove the last } from "col1", remove the first { from "col2" and join the strings together with comma , as delimiter.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
["col1", "col2"])
Script:
df = df.withColumn(
"col1",
F.concat_ws(
",",
F.regexp_replace("col1", r"}$", ""),
F.regexp_replace("col2", r"^\{", "")
)
)
df.show(truncate=0)
# +-------------------------------------+-------------+
# |col1 |col2 |
# +-------------------------------------+-------------+
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
# +-------------------------------------+-------------+
I have a file look like this:
'{"Name": "John", "Age": 23}'
'{"Name": "Mary", "Age": 21}'
How can I read this file and get a pyspark dataframe like this:
Name | Age
"John" | 23
"Mary" | 21
First read in the file in text format, and then use the from_json function to convert the row to two columns.
df = spark.read.load(path_to_your_file, format='text')
df = df.selectExpr("from_json(trim('\\'' from value), 'Name string,Age int') as data").select('data.*')
df.show(truncate=False)
I have list of objects as JSON. Each object has two properties: id(string) and arg(number).
When I use pandas.read_json(...), the resulting DataFrame has the id interpreted as number as well, which causes problems, since information is lost.
import pandas as pd
json = '[{ "id" : "1", "arg": 1 },{ "id" : "1_1", "arg": 2}, { "id" : "11", "arg": 2}]'
df = pd.read_json(json)
I'd expect to have a DataFrame like this:
id arg
0 "1" 1
1 "1_1" 2
2 "11" 2
I get
id arg
0 1 1
1 11 2
2 11 2
and suddenly, the once unique id is not so unique anymore.
How can I tell pandas to stop doing that?
My search so far only yielded results, where people where trying to achive the opposite - having columns of string beeing interpreted as numbers. I totally don't want to achive that in this case!
If you set the dtype parameter to False, read_json will not infer the types automatically:
df = pd.read_json(json, dtype=False)
Use dtype parameter for preventing cast id to numbers:
df = pd.read_json(json, dtype={'id':str})
print (df)
id arg
0 1 1
1 1_1 2
2 11 2
print (df.dtypes)
id object
arg int64
dtype: object
I have a json data as below:
{
"X": "abc",
"Y": 1,
"Z": 4174,
"t_0":
{
"M": "bm",
"T": "sp",
"CUD": 4,
"t_1": '
{
"CUD": "1",
"BBC": "09",
"CPR": -127
},
"EVV": "10.7000",
"BBC": -127,
"CMIX": "25088"
},
"EYR": "sp"
}
The problem is converting to python data-frame creates two columns of same name CUD. One is under t_0 and another is under t_1. But both are different events. How can I append json tag name to column names so that I can differentiate two columns of same name. Something like t_0_CUD , t_1_CUD.
My code is below:
df = pd.io.json.json_normalize(json_data)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
If use only first part of solution it return what you need, only instead _ are used .:
df = pd.io.json.json_normalize(json_data)
print (df)
X Y Z EYR t_0.M t_0.T t_0.CUD t_0.t_1.CUD t_0.t_1.BBC t_0.t_1.CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0.EVV t_0.BBC t_0.CMIX
0 10.7000 -127 25088
If need _:
df.columns = df.columns.str.replace('\.','_')
print (df)
X Y Z EYR t_0_M t_0_T t_0_CUD t_0_t_1_CUD t_0_t_1_BBC t_0_t_1_CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0_EVV t_0_BBC t_0_CMIX
0 10.7000 -127 25088
I have list of JSON values (actually it's a text file where every line is one JSON object). Like this:
{ "id": 1, "name": "john", "age": 18, "education": "master" }
{ "id": 2, "name": "jack", "job": "clerk" }
...
Some of the values can be missing (e.g. first item doesn't have "job" value and second item doesn't have "education" and "age").
I need to create data frame in R and fill all missing column values as NAs (if field with unique name exists in at least one row). How to achieve this easier?
What I already done - I installed "rjson" package and parsed these lines to R lists. Let's assume that lines variable is a character vector of lines.
library(rjson)
lines <- // initialize "lines" var here
jsons <- sapply(lines, fromJSON)
"jsons" variable became "list of lists" (every JSON object is converted to list in R terminology). How to convert it to data.frame?
I want to see the following data frame for the example I provided:
"id" | "name" | "age" | "education" | "job"
-------------------------------------------
1 | "john" | 18 | "master" | NA
2 | "jack | NA | NA | "clerk"
From plyr you can use rbind.fill to add the NAs for you
library(plyr)
rbind.fill(sapply(jsons, data.frame), jsons)
# id name age education job
# 1 1 john 18 master <NA>
# 2 2 jack NA <NA> clerk
or from data.table
library(data.table)
rbindlist(jsons, fill=T)
and dplyr
library(dplyr)
bind_rows(sapply(jsons, data.frame))
Future me, correcting past me's mistakes. It would make more sense to use jsonlite's stream_in
stream_in(txtfile)
# To test on `txt` from below, try:
# stream_in(textConnection(txt))
# Found 2 records...
# Imported 2 records. Simplifying...
# id name age education job
#1 NA john 18 master <NA>
#2 2 jack NA <NA> clerk
Use the jsonlite package's fromJSON function, after making a few inline edits to your original text data (I've also edited the first piece of id data to include an explicit null value, to show that it deals with this):
fromJSON(paste0("[", gsub("}\n", "},\n", txt), "]"))
# id name age education job
#1 NA john 18 master <NA>
#2 2 jack NA <NA> clerk
All I did was add a little formatting to wrap all the JSON lines together in [ and ] and add a comma at the end of each closing } - resulting in an output like the below which can be processed all at once by jsonlite::fromJSON:
[{"1":"one"},{"2":"two"}]
Where txt was your lines of data as presented, with a null in the id variable:
txt <- "{ \"id\": null, \"name\": \"john\", \"age\": 18, \"education\": \"master\" }
{ \"id\": 2, \"name\": \"jack\", \"job\": \"clerk\" }"