Parsing json into data structures with lower case field names - json

I am parsing JSON into ABAP structures, and it works:
DATA cl_oops TYPE REF TO cx_dynamic_check.
DATA(text) = `{"TEXT":"Hello ABAP, I'm JSON!","CODE":"123"}`.
TYPES: BEGIN OF ty_structure,
text TYPE string,
code TYPE char3,
END OF ty_structure.
DATA : wa_structure TYPE ty_structure.
TRY.
text = |\{"DATA":{ text }\}|.
CALL TRANSFORMATION id OPTIONS clear = 'all'
SOURCE XML text
RESULT data = wa_structure.
WRITE: wa_structure-text , wa_structure-code.
CATCH cx_transformation_error INTO cl_oops.
WRITE cl_oops->get_longtext( ).
ENDTRY.
The interesting part is that the CODE and TEXT are case sensitive. For most external systems, having all CAPS identifiers is ugly, so I have been trying to parse {"text":"Hello ABAP, I'm JSON!","code":"123"} without any success. I looked into the options, I looked whether a changed copy of id migh accomplish this, I googled it and have no idea how to accomplish this.

Turns out that SAP has a sample program on how to do this.
There is basically an out of the box transformation that does this for you called demo_json_xml_to_upper. The name is a bit unfortunate, so I would suggest renaming this transformation and adding it to the customer namespace.
I am a bit bummed that this only works through xstrings, so debugging it becomes a pain. But, it works perfeclty and solved my problem.

My solution to this is low tech. I spent hours looking for a simple way to get out of this mess that the JSON response could have the fieldnames in lower or camel case. Here it is: if you know the field names - obviously you do because your table has the same column names - just replace the lower case name with an upper case one in your xstring.
If in your table the field is USERS_ID and in the JSON xstring it is users_ID - go for that:
replace all occurrences of 'users_ID' in ls_string with 'USERS_ID'.
Do the same for all fields and the object name and call transformation ID.

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How do I convert a column of JSON strings into a parquet table

I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.
I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)

MYSQL REGEXP with JSON array

I have an JSON string stored in the database and I need to SQL COUNT based on the WHERE condition that is in the JSON string. I need it to work on the MYSQL 5.5.
The only solution that I found and could work is to use the REGEXP function in the SQL query.
Here is my JSON string stored in the custom_data column:
{"language_display":["1","2","3"],"quantity":1500,"meta_display:":["1","2","3"]}
https://regex101.com/r/G8gfzj/1
I now need to create a SQL sentence:
SELECT COUNT(..) WHERE custom_data REGEXP '[HELP_HERE]'
The condition that I look for is that the language_display has to be either 1, 2 or 3... or whatever value I will define when I create the SQL sentence.
So far I came here with the REGEX expression, but it does not work:
(?:\"language_display\":\[(?:"1")\])
Where 1 is replaced with the value that I look for. I could in general look also for "1" (with quotes), but it will also be found in the meta_display array, that will have different values.
I am not good with REGEX! Any suggestions?
I used the following regex to get matches on your test string
\"language_display\":\[(:?\"[0-9]\"\,)*?\"3\"(:?\,\"[0-9]\")*?\]
https://regex101.com/ is a free online regex tester, it seems to work great. Start small and work big.
Sorry it doesn't work for you. It must be failing on the non greedy '*?' perhaps try without the '?'
Have a look at how to serialize this data, with an eye to serializing the language display fields.
How to store a list in a column of a database table
Even if you were to get your idea working it will be slow as fvck. Better off to process through each row once and generate something more easily searched via sql. Even a field containing the comma separated list would be better.

Accessing data from JSON field in Postgres

I am working with a database which is using json as a data type in a Postgres DB and am having issues trying to extract values from the json document. I've done some researching around and have tried a variety of solutions including using
json_array_elements
response ->> 'filterEntryId'
json_populate_recordset(null::obj, table.column -> 'filterEntryId'
but have not been successful. I am starting to think that it is the way that the json is being stored in the column, ie that it starts with a '[' instead of a '{'.
Below is an example of the value of the json field.
[{
"filterEntryId":373,
"length":3,
"locale":"en",
"matched":"dog",
"quality":1.0,
"root":"dog",
"severity":"mild",
"start":2,
"tags":["Vulgarity"],
"type":"blacklist"
}]
Just figured it out. I was mis-using the json_array_elements function.
In the event that anyone stumbles across this, here is the correct way to query the json
select
json_array_elements(column) ->> 'filterEntryId'
from table
Essentially you are first accessing the document and then grabbing what you need from it. I think this had to be done this way due to the '[' around the data in the column.
Feel free, anyone, to expand on my explanation.

Search in mysql database - unserialized data

Situation:
I have user model. attribute "meta_data" in db represents "text" type field.
In model it seriazized by custom class. ( serialize :meta_data, CustomJsonSerializer.new )
It means, when I have an instance of user, I can work with meta_data like with Hash.
User.first.meta_data['username']
Problem:
I need to write a search function, which will search users by given string. I can do it by manual building search query in rails ex. User.where("email LIKE '%#{string}%'")...
But what about meta_data ? Should I search in this field by LIKE statement too? If I will do so, it will decrease relevance of found record.
For example:
I have 2 users. One of them has username "patrick", another one is "sergio"
meta data in db will look like this:
1) {username: patrick}
2) {username: sergio}
I want to find sergio , I enter a search string "ser" => but I have 2 results, instead of one. This meta_data string "{uSERname: Patrick}" also has "ser", so it makes this record irrelevant.
Do you have any idea how to solve it?
That's really the problem with serialized data. In theory, the serialization could be an algorithm that is very unsearchable. It could do a Hoffman encoding, or other compression, and store the serialization in binary. You are relying on the assumption that the serialization uses JSON and your string will still be findable as a sub-string in the serialization.
Then the problem you are having is another issue. Other data in the serialization can mess up your results.
In general, if you serialize data, you are making a choice to not be searchable.
So a solution would be to add an additional field that you populate in a way that you control. Have a values field and store a pipe (|) delimited value that you can search. So if the data is {firstname: "Patrick", lastname: "Stern"}, your meta_values field might be "Patrick|Stern".
Also, don't use the where method with a string with #{} expansion of input values. The makes it vulnerable to SQL attacks. Instead use:
where("meta_values is like :pattern", pattern: "%#{string}%")
I know that may not look very different, but ActiveRecord will go through a sanitizing this way. If someone has a semi-colon in string, then ActiveRecord will escape the semi-colon in the search condition.