How do I convert a column of JSON strings into a parquet table - json

I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.

I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How to convert date from csv file into integer

I have to send data from csv into SQL DB.
Problem starts when I try to convert data into Int. It wasnt my idea and I really cant do much with this datatype. When I'm trying to achieve this problem pop up:
Data Conversion 2: Data conversion failed while converting column
"pr_czas" (387) to column "C pr_dCz_id" (14). The conversion returned
status value 2 and status text "The value could not be converted
because of a potential loss of data.".
Tried already to ignore this problem but then another problems came up so there is no other way than solving this.
I have to convert this data from csv file which is str 50 into int 4
It must be int4. One of the requirements Dont know what t odo.
This is data I'm trying to put into int4. Look on pr_czas
This is data's datatype
Before I tried to do same thing with just DD.MM.YYYY but got same result...
Given an input column named [pr_czas] that contain string values that look like 31.01.2020 00:00 which appears to be a formatted date time represented in the format "DD.mm.YYYY HH:MM", I would like to express that as a whole number DDMMYYHHMM
Add a derived column to your data flow and call this new_pr_czas
The logic I'm going to use is a series of REPLACE statements and cast the final result to an integer. Replace the period, replace the colon and the space - all with nothing
(DT_I8)REPLACE(REPLACE(REPLACE([pr_czas], ".", ""), ":", ""), " ", "")
This is an easy case but things to note.
An integer/int32/I4 has a maximum value of 2 billion.
310120200000 is too large to fit into that space so you would need to make that an bigint/int64/I8. If I remember your previous question, you were having troubles with a lookup task so this data type mismatch might hurt you there.
The other thing to be aware of is that leading zeros will be dropped when converted to a number because they are not significant. If you need to retain the leading zeros, then you're working with string data type. This is an advantage to working with the ISO standard but if your data expects DD, then far be it for me to say otherwise.
If you need to slice your date into another format, then you'll want to have a few derived columns. The first one will generate a string column for each piece of pr_czas - year, month, day, hour and minute. You'll use the substring method for this and findstring to find the period space and colon.
The next data flow will be used to put those string pieces back into the new format and cast that to I8. Why? Because you can't debug doing it all in one shot but you can put a data viewer between two derived columns to figure out where a slice went awry.

Can I select all same fields of a json field regardless of their path

I have a lot of json files stored in a json data type column in postgres. Now there are plenty of places where the key "warning" can apply. Unfortunately I can not get a json schema so I can not know in advance where exactly all the warning keys can show up. So I would like to do something like this:
select report #> '{*,warning}' from foo;
Is there some way to us wildcards in paths? Or is the only way to dynamically traverse a json value lets say key by key recursively in pl/sql function? (if even possible to return a set of cursors as one big cursor).
EDIT:
Interestingly the good old xml data type can do exactly what I need. So I am a bit puzzled why we can not do the same operations on json documents like:
select xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>');
select * from xmltable('//town' PASSING by ref '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>' columns town varchar path 'text()')

Parsing json into data structures with lower case field names

I am parsing JSON into ABAP structures, and it works:
DATA cl_oops TYPE REF TO cx_dynamic_check.
DATA(text) = `{"TEXT":"Hello ABAP, I'm JSON!","CODE":"123"}`.
TYPES: BEGIN OF ty_structure,
text TYPE string,
code TYPE char3,
END OF ty_structure.
DATA : wa_structure TYPE ty_structure.
TRY.
text = |\{"DATA":{ text }\}|.
CALL TRANSFORMATION id OPTIONS clear = 'all'
SOURCE XML text
RESULT data = wa_structure.
WRITE: wa_structure-text , wa_structure-code.
CATCH cx_transformation_error INTO cl_oops.
WRITE cl_oops->get_longtext( ).
ENDTRY.
The interesting part is that the CODE and TEXT are case sensitive. For most external systems, having all CAPS identifiers is ugly, so I have been trying to parse {"text":"Hello ABAP, I'm JSON!","code":"123"} without any success. I looked into the options, I looked whether a changed copy of id migh accomplish this, I googled it and have no idea how to accomplish this.
Turns out that SAP has a sample program on how to do this.
There is basically an out of the box transformation that does this for you called demo_json_xml_to_upper. The name is a bit unfortunate, so I would suggest renaming this transformation and adding it to the customer namespace.
I am a bit bummed that this only works through xstrings, so debugging it becomes a pain. But, it works perfeclty and solved my problem.
My solution to this is low tech. I spent hours looking for a simple way to get out of this mess that the JSON response could have the fieldnames in lower or camel case. Here it is: if you know the field names - obviously you do because your table has the same column names - just replace the lower case name with an upper case one in your xstring.
If in your table the field is USERS_ID and in the JSON xstring it is users_ID - go for that:
replace all occurrences of 'users_ID' in ls_string with 'USERS_ID'.
Do the same for all fields and the object name and call transformation ID.

MySqlImport - Import a date field not in the proper format

I have a csv file that has a date field in a format like (among other fields):
17DEC2009
When I do a mysqlimport, the other fields are imported properly, but this field remains 0000-00-00 00:00:00
How can I import this date properly? Do I have to run a sed/awk command on the file first to put it into a proper format? If so, what would that be like? Does the fact that the month is spelled out instead of a number matter?
STR_TO_DATE() enables you to convert a string to a proper DATE within the query. It expects the date string, and a format string.
Check the examples in the manual entry to figure out the correct format.
I think it should be along the lines of %d%b%Y (However the %b is supposed to produce Strings like Dec instead of DEC so you will have to try out whether it works).
I had this issue in the past. What I had to do was to utilize LOAD DATA and set the appropriate expression here -
[SET col_name = expr,...]
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Here is the approach I took to solve similar problem. My use case was bit complex with so many columns, but making here simple to present the solution.
I have Persons table with (Id int autogen, name varchar(100),DOB date), and few million of data(name,DOB) needs to be populated from CSV file.
Created additional column in persons table with name like (varchar_DOB varchar(25)).
Imported data using mysqlimport utility into columns(name,varchar_DOB).
Executed update query that updated DOB column using str_to_date(varchar_DOB,'format') function.
Now, I have expected data populated DOB column.
The same logic could be applied in doing even other kind of data formatting like double,time_stamp etc.