MemSQL get keys from array objects as CSV - mysql

I have a JSON field with value in format like below.
[
{"key1": 100},
{"key2": 101},
{"key3": 102},
{"key4": 103}
]
I want to convert to value like key1,key1,key3,key4

SingleStore (formerly MemSQL) is ANSI SQL compliant and MySQL wire protocol compatible, making it easy to connect with existing tools.
Depending on your JSON use case, SingleStore supports Persisted Computed Columns for JSON blobs.
The JSON Standard is fully supported by SingleStore.
SingleStore also supports built-in JSON functions to streamline JSON blob extraction.

Related

How to query json file located at s3 using presto

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?
Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.
Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

Multi-line JSON file querying in hive

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
Is there a SerDe format out there that is able to parse multi-line indented .json files?
If there isn't a SerDe format to do this:
Is there a best practice for dealing with files like this?
Should I plan on flattening these records out using a different tool like python?
Is there a standard way of writing custom SerDe formats, so I can write one myself?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]
There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

NiFi non-Avro JSON Reader/Writer

It appears that the standard Apache NiFi readers/writers can only parse JSON input based on Avro schema.
Avro schema is limiting for JSON, e.g. it does not allow valid JSON properties starting with digits.
JoltTransformJSON processor can help here (it doesn't impose Avro limitations to how the input JSON may look like), but it seems that this processor does not support batch FlowFiles. It is also not based on the readers and writers (maybe because of that).
Is there a way to read arbitrary valid batch JSON input, e.g. in multi-line form
{"myprop":"myval","12345":"12345",...}
{"myprop":"myval2","12345":"67890",...}
and transform it to other JSON structure, e.g. defined by JSON schema, and e.g. using JSON Patch transformation, without writing my own processor?
Update
I am using Apache NiFi 1.7.1
Update 2
Unfortunately, #Shu's suggestion did work. I am getting same error.
Reduced the case to a single UpdateRecord processor that reads JSON with numeric properties and writes to a JSON without such properties using
myprop : /data/5836c846e4b0f28d05b40202
mapping. Still same error :(
it does not allow valid JSON properties starting with digits?
This bug NiFi-4612 fixed in NiFi-1.5 version, We can use AvroSchemaRegistry to defined your schema and change the
Validate Field Names
false
Then we can have avro schema field names starting with digits.
For more details refer to this link.
Is there a way to read arbitrary valid batch JSON input, e.g. in multi-line form?
This bug NiFi-4456 fixed in NiFi-1.7, if you are not using this version of NiFi then we can do workaround to create an array of json messages with ,(comma delimiter) by using.
Flow:
1.SplitText //split the flowfile with 1 line count
2.MergeRecord //merge the flowfiles into one
3.ConvertRecord
For more details regards to this particular issues refer to this link(i have explained with the flow).

Avro Schema: force to interpret value (map, array) as string

I want to convert JSON to Avro via NiFi. Unfortunately the JSON has complex types as values that I want to see as a simple string!
JSON:
"FLAGS" : {"FLAG" : ["STORED","ACTIVE"]}
How can I tell AVRO to simply store "{"FLAG" : ["STORED","ACTIVE"]}" or "[1,2,3,"X"]" as a string?
Thank you sincerely!
The JSON to Avro conversion performed in NiFi's ConvertJSONToAvro processor does not really do transformation in the same step. There is a very limited ability to transform based on the Avro schema, mostly omitting input data fields in the output. But it won't coerce a complex structure to a string.
Instead, you should do a JSON-to-JSON transformation first, then convert your summarized JSON to Avro. I think what you are looking for is a structure like this:
{
"FLAGS": "{\"FLAG\":[\"STORED\",\"ACTIVE\"]}"
}
NiFi's JoltTransformJSON, ExecuteScript processors are great for this. If your records are simple enough, maybe even a combination of EvaluateJsonPath $.FLAGS and ReplaceText { "FLAGS": "${flags:escapeJson()}" }.

How does a client know the datatype of JSON RestResponse

While developing a client application using one of our existing REST services, I have the choice for using JSON or XML responses. The XML responses are described by XSD files with schema information.
With these XML Schemas I can determine what datatype a certain result must be, and the client can use that information when presenting the data to the user, or when the client asks the user to change a property. (How is quit another question btw as I cannot find any multiplatform Delphi implementation of XML that supports XSD schemas... but like i said: that's another question).
The alternative is to use a JSON response type, but then the client cannot determine the specific datatype of a property because everything is send as a string.
How would a client know that one of those properties is a index from an enumerated type, or a integer number, or an amount or a reference to another object by its ID maybe? (These are just examples)
I would think that the client should not contain "hardcoded" info on the structure of the response, or am I wrong in assuming that?
JSON doesn't have a rich type system like XML does, and JSON doesn't have a schema system for describing things like enumerations and references like XML does. But JSON has only a few data types, and the general formatting of the JSON is self-describing in terms of what data type any given value is using (see the official JSON spec for more details):
a string is always wrapped in quotation marks:
"fieldname": "fieldvalue"
a numeric value is digit characters without quotations:
"fieldname": 12345
an object is always wrapped in curly braces:
"fieldname": { ... object data ... }
an array is always wrapped in square braces:
"fieldname": [ ... array data ... ]
a boolean is always a fixed true or false without quotations:
"name": true
"name": false
a null is always a fixed null without quotations:
"name": null
Anything beyond that will require the client to have external knowledge of the data that is being sent (like a schema in XML, since XML itself does not describe data types at all).