I am working with a database which is using json as a data type in a Postgres DB and am having issues trying to extract values from the json document. I've done some researching around and have tried a variety of solutions including using
json_array_elements
response ->> 'filterEntryId'
json_populate_recordset(null::obj, table.column -> 'filterEntryId'
but have not been successful. I am starting to think that it is the way that the json is being stored in the column, ie that it starts with a '[' instead of a '{'.
Below is an example of the value of the json field.
[{
"filterEntryId":373,
"length":3,
"locale":"en",
"matched":"dog",
"quality":1.0,
"root":"dog",
"severity":"mild",
"start":2,
"tags":["Vulgarity"],
"type":"blacklist"
}]
Just figured it out. I was mis-using the json_array_elements function.
In the event that anyone stumbles across this, here is the correct way to query the json
select
json_array_elements(column) ->> 'filterEntryId'
from table
Essentially you are first accessing the document and then grabbing what you need from it. I think this had to be done this way due to the '[' around the data in the column.
Feel free, anyone, to expand on my explanation.
Related
I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.
I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)
I am asked to enter a stringified json value in a column named show_cast in the format of Release: and Date: but I don't have any idea how to do that.
The table simply contains of three columns a token , time and genre. The time has to be a stringified json.
I tried to simply put that using { and simple : but that doesn't seem to work. They say that {{release:,time:},{release:,date:},,,,} should be converted in stringify and then inserted into the database. I don't know how to do that and I don't see any resource like this out there. To be honest I didn't even know about this until I was given this task.
insert into show_reality values("project_123","{{release:2017,date:04-11},{release:2019,date:12-03}}","Action");
I have done this but I don't think it is a stringified json.
Thanks in Advance.
We use presto JSON capabilities quit heavily and on thing the is missing for us is to be able to null when JSON is not a valid on this way SQL statement that use JSON functions will not break if there is a problem with the JSON format.
Initially I thought it can be done with some combination of JSON_PARSE and NULLIF but couldn't manage to pull this of..
is there a way to do make this kind of validation?
Thanks
You can use the try function to prevent the json functions from failing the query. For example, SELECT try(json_parse('bad json')) will return null instead of failing the query.
I am parsing JSON into ABAP structures, and it works:
DATA cl_oops TYPE REF TO cx_dynamic_check.
DATA(text) = `{"TEXT":"Hello ABAP, I'm JSON!","CODE":"123"}`.
TYPES: BEGIN OF ty_structure,
text TYPE string,
code TYPE char3,
END OF ty_structure.
DATA : wa_structure TYPE ty_structure.
TRY.
text = |\{"DATA":{ text }\}|.
CALL TRANSFORMATION id OPTIONS clear = 'all'
SOURCE XML text
RESULT data = wa_structure.
WRITE: wa_structure-text , wa_structure-code.
CATCH cx_transformation_error INTO cl_oops.
WRITE cl_oops->get_longtext( ).
ENDTRY.
The interesting part is that the CODE and TEXT are case sensitive. For most external systems, having all CAPS identifiers is ugly, so I have been trying to parse {"text":"Hello ABAP, I'm JSON!","code":"123"} without any success. I looked into the options, I looked whether a changed copy of id migh accomplish this, I googled it and have no idea how to accomplish this.
Turns out that SAP has a sample program on how to do this.
There is basically an out of the box transformation that does this for you called demo_json_xml_to_upper. The name is a bit unfortunate, so I would suggest renaming this transformation and adding it to the customer namespace.
I am a bit bummed that this only works through xstrings, so debugging it becomes a pain. But, it works perfeclty and solved my problem.
My solution to this is low tech. I spent hours looking for a simple way to get out of this mess that the JSON response could have the fieldnames in lower or camel case. Here it is: if you know the field names - obviously you do because your table has the same column names - just replace the lower case name with an upper case one in your xstring.
If in your table the field is USERS_ID and in the JSON xstring it is users_ID - go for that:
replace all occurrences of 'users_ID' in ls_string with 'USERS_ID'.
Do the same for all fields and the object name and call transformation ID.
Say I have a text field with JSON data like this:
{
"id": {
"name": "value",
"votes": 0
}
}
Is there a way to write a query which would find id and then would increment votes value?
I know i could just retrieve the JSON data update what I need and reinsert updated version, but i wonder is there a way to do this without running two queries?
UPDATE `sometable`
SET `somefield` = JSON_REPLACE(`somefield`, '$.id.votes', JSON_EXTRACT(`somefield` , '$.id.votes')+1)
WHERE ...
Edit
As of MySQL 5.7.8, MySQL supports a native JSON data type that enables efficient access to data in JSON documents.
JSON_EXTRACT will allow you to access a particular JSON element in a JSON field, while JSON_REPLACE will allow you to update it.
To specify the JSON element you wish to access, use a string with the format
'$.[top element].[sub element].[...]'
So in your case, to access id.votes, use the string '$.id.votes'.
The SQL code above demonstrates putting all this together to increment the value of a JSON field by 1.
I think for a task like this you're stuck using a plain old SELECT followed by an UPDATE (after you parse the JSON, increment the value you want, and then serialize the JSON back).
You should wrap these operations in a single transaction, and if you're using InnoDB then you might also consider using SELECT ... FOR UPDATE : http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
This is sort of a tangent, but I thought I'd also mention that this is the type of operation that a NoSQL database like MongoDB is quite good at.