I have a Pandas DataFrame that I need to convert to JSON. The to_json() DataFrame method results in an acceptable format, but it converts my DataFrame index to strings (e.g. 0 becomes "0.0"). I need "0".
The DataFrame comes from JSON using the pd.io.json.read_json() method, which sets the index to float64.
Input JSON:
{"chemical": {"1": "chem2", "0": "chem1"},
"type": {"1": "pesticide", "0": "pesticide"}}
DataFrame (from read_json()):
chemical type
0 chem1 pesticide
1 chem2 pesticide
Produced JSON (from to_json()):
{"chemical": {"0.0": "chem1", "1.0": "chem2"},
"type": {"0.0": "pesticide", "1.0": "pesticide"}}
Needed JSON:
{"chemical": {"0": "chem1", "1": "chem2"},
"type": {"0": "pesticide", "1": "pesticide"}}
#shx2 pointed me in the right direction, but I changed my approach to creating the DataFrame from JSON.
Instead of using the to_json() method on a JSON string, I used the pd.DataFrame.from_dict() method on the JSON as a Python dictionary to create the DataFrame. This results in df.index.dtype == dtype('O')
I had to set dtype='float64' in the from_dict() method to set the correct dtype for the non-string entries.
pd_obj = pd.DataFrame.from_dict(request.json["inputs"], dtype='float64')
Seems like the dtype of the index is float (check df.index.dtype). You need to convert it to int:
df.index = df.index.astype(int)
df.to_json()
=> {"chemical": {"0": "chem1", "1": "chem2"}, "type": {"0": "pesticide", "1": "pesticide"}}
Related
Say I have a simple json object like
{"a" :"1", "b":"2", "c":"3"}
In ruby, how can I iterate through that object just so I can get the values 1 2 and 3.
assuming you have the object as a string.
require 'json'
json_obj = '{"a" :"1", "b":"2", "c":"3"}'
values = JSON.parse(json_obj).values
will provide you with the array
["1", "2", "3"]
JSON.parse , parses the json string into a ruby object, in this case an instance of a Hash. The Hash class has a method values which returns an array containing the values or each hash entry.
Here's a pure ruby approach (does not require JSON module) that can be used for parsing a simple json string like that:
First convert your string to an array by using a regex that matches and captures anything enclosed in quotation marks:
json_str = '{"a" :"1", "b":"2", "c":"3"}'
json_arr = json_str.scan(/"(.*?)"/).map(&:join)
#=> ["a", "1", "b", "2", "c", "3"]
You could stop there and just iterate through that array to grab every second element using something like each_slice(2).map(&:last)or you could finish the conversion to a hash. There are several ways to do this, but here's one using Hash[] along with the splat operator:
json_hash = Hash[*json_arr]
#=> {"a"=>"1", "b"=>"2", "c"=>"3"}
You can then simply access the values:
json_hash.values
#=> ["1", "2", "3"]
All in one line:
Hash[*'{"a\'a\'" :"1", "b":"2", "c":"3"}'.scan(/"(.*?)"/).map(&:join)].values
I have a Mongo change stream (a pymongo application) that is continuously getting the changes in collections. These change documents as received by the program are sent to Azure Event Hubs. A Spark notebook has to read the documents as they get into Event Hub and do Schema matching (match the fields in the document with spark table columns) with the spark table for that collection. If there are fewer fields in the document than in the table, columns have to be added with Null.
I am reading the events from Event Hub like below.
spark.readStream.format("eventhubs").option(**config).load().
As said in the documentation, the original message is in the 'body' column of the dataframe that I am converting to string. Now I have got the Mongo document as a JSON string in a streaming dataframe. I am facing below issues.
I need to extract the individual fields in the mongo document. This is needed to compare what fields are present in the spark table and what is not in Mongo document. I saw a function called get_json_object(col,path). This essentially returns a string again and I cannot individually select all the columns.
If from_json can be used to convert the JSON string to Struct type, I cannot specify the Schema because we have close to 70 collections (corresponding number of spark tables as well) each sending Mongo docs with fields from 10 to 450.
If I can convert the JSON string in streaming dataframe to a JSON object whose schema can be inferred by the dataframe (something like how read.json can do), I can use SQL * representation to extract the individual columns, do few manipulations & then save the final dataframe to the spark table. Is it possible to do that? What is the mistake I am making?
Note: Stram DF doesn't support collect() method to individually extract the JSON string from underlying rdd and do the necessary column comparisons. Using Spark 2.4 & Python in Azure Databricks environment 4.3.
Below is the sample data I get in my notebook after reading the events from event hub and casting it to string.
{
"documentKey": "5ab2cbd747f8b2e33e1f5527",
"collection": "configurations",
"operationType": "replace",
"fullDocument": {
"_id": "5ab2cbd747f8b2e33e1f5527",
"app": "7NOW",
"type": "global",
"version": "1.0",
"country": "US",
"created_date": "2018-02-14T18:34:13.376Z",
"created_by": "Vikram SSS",
"last_modified_date": "2018-07-01T04:00:00.000Z",
"last_modified_by": "Vikram Ganta",
"last_modified_comments": "Added new property in show_banners feature",
"is_active": true,
"configurations": [
{
"feature": "tip",
"properties": [
{
"id": "tip_mode",
"name": "Delivery Tip Mode",
"description": "Tip mode switches the display of tip options between percentage and amount in the customer app",
"options": [
"amount",
"percentage"
],
"default_value": "tip_percentage",
"current_value": "tip_percentage",
"mode": "multiple or single"
},
{
"id": "tip_amount",
"name": "Tip Amounts",
"description": "List of possible tip amount values",
"default_value": 0,
"options": [
{
"display": "No Tip",
"value": 0
}
]
}
]
}
]
}
}
I would like to separate and take out the full_document in the sample above. When I use get_json_object, I am getting the full_document in another streaming dataframe as JSON string and not as an object. As you can see, there are some array types in full_document which I can explode (documentation says that explode is supported in streaming DF, but havent tried) but there are some objects also (like struct type) which I would like to extract the individual fields. I cannot use the SQL '*' notation because what get_json_object returns is a string and not the object itself.
Its convincing that this much varied Schema of the JSON would be better with schema mentioned explicitly. So I took it like, in a streaming environment with very different Schema of the incoming stream, its always better to specify the schema. So I am proceeding with get_json_object and from_json and reading the schema through a file.
My requirement is to pass dataframe as input parameter to a scala class which saves the data in json format to hdfs.
The input parameter looks like this:
case class ReportA(
parm1: String,
parm2: String,
parm3: Double,
parm4: Double,
parm5: DataFrame
)
I have created a JSON object for this parameter like:
def write(xx: ReportA) = JsObject(
"field1" -> JsString(xx.parm1),
"field2" -> JsString(xx.parm2),
"field3" -> JsNumber(xx.parm3),
"field4" -> JsNumber(xx.parm4),
"field5" -> JsArray(xx.parm5)
)
parm5 is a dataframe and wanted to convert as Json array.
How can I convert the dataframe to Json array?
Thank you for your help!!!
A DataFrame can be seen to be the equivalent of a plain-old table in a database, with rows and columns. You can't just get a simple array from it, the closest you woud come to an array would be with the following structure :
[
"col1": [val1, val2, ..],
"col2": [val3, val4, ..],
"col3": [val5, val6, ..]
]
To achieve a similar structure, you could use the toJSON method of the DataFrame API to get an RDD<String> and then do collect on it (be careful of any OutOfMemory exceptions).
You now have an Array[String], which you can simply transform in a JsonArray depending on the JSON library you are using.
Beware though, this seems like a really bizarre way to use Spark, you generally don't output and transform an RDD or a DataFrame directly into one of your objects, you usually spill it out onto a storage solution.
I have a huge JSON file, a small part from it as follows:
{
"socialNews": [{
"adminTagIds": "",
"fileIds": "",
"departmentTagIds": "",
........
........
"comments": [{
"commentId": "",
"newsId": "",
"entityId": "",
....
....
}]
}]
.....
}
I have applied lateral view explode on socialNews as follows:
val rdd = sqlContext.jsonFile("file:///home/ashish/test")
rdd.registerTempTable("social")
val result = sqlContext.sql("select * from social LATERAL VIEW explode(socialNews) social AS comment")
Now I want to convert back this result (DataFrame) to JSON and save into a file, but I am not able to find any Scala API to do the conversion.
Is there any standard library to do this or some way to figure it out?
val result: DataFrame = sqlContext.read.json(path)
result.write.json("/yourPath")
The method write is in the class DataFrameWriter and should be accessible to you on DataFrame objects. Just make sure that your rdd is of type DataFrame and not of deprecated type SchemaRdd. You can explicitly provide type definition val data: DataFrame or cast to dataFrame with toDF().
If you have a DataFrame there is an API to convert back to an RDD[String] that contains the json records.
val df = Seq((2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
df.toJSON.saveAsTextFile("/tmp/jsonRecords")
df.toJSON.take(2).foreach(println)
This should be available from Spark 1.4 onward. Call the API on the result DataFrame you created.
The APIs available are listed here
sqlContext.read().json(dataFrame.toJSON())
When you run your spark job as
--master local --deploy-mode client
Then,
df.write.json('path/to/file/data.json') works.
If you run on cluster [on header node], [--master yarn --deploy-mode cluster] better approach is to write data to aws s3 or azure blob and read from it.
df.write.json('s3://bucket/path/to/file/data.json') works.
If you still can't figure out a way to convert Dataframe into JSON, you can use to_json or toJSON inbuilt Spark functions.
Let me know if you have a sample Dataframe and a format of JSON to convert.
I have a 3rd party service which returns me the following response:
JSON 1
{"Bag":{"Type":{"$":"LIST"},"Source":{"$":"ABC"},"Id":{"$":"151559458"},"Name":{"$":"Bag list"},"Source":{"$":"ABC"},"CustomerId":{"$":"abc#gmail.com"},"DateTime":{"$":"2014-07-17T12:36:01Z"}}}
But I have to format this JSON into the following format:
JSON2
{"Bag":{"Type":"LIST","Source":"ABC","Id":"151559458","Name":"Bag list","Source":"ABC","CustomerId":"abc#gmail.com","DateTime":"2014-07-17T12:36:01Z"}}
And Vice versa like from client I get JSON2 and I have to send this response to service in JSON1 format.
Given the input in the first JSON schema, transform it to a language specific data structure. You can typically find a library to do this.
Transform your language specific data structure, using the idioms of whichever language you are using, to the equivalent of the second JSON schema.
Transform the language specific data structure to JSON text. You can typically find a library to do this.
You can use jq tool http://stedolan.github.io/jq
Then the conversion is console one-liner:
$ jq '{Bag: .Bag | with_entries({key, value: .value."$"})}' file.json
And the result is
{
"Bag": {
"Type": "LIST",
"Source": "ABC",
"Name": "Bag\nlist",
"Id": "151559458",
"DateTime": "2014-07-17T12:36:01Z",
"CustomerId": "abc#gmail.com"
}
}