Multi-line JSON file querying in hive - json

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
Is there a SerDe format out there that is able to parse multi-line indented .json files?
If there isn't a SerDe format to do this:
Is there a best practice for dealing with files like this?
Should I plan on flattening these records out using a different tool like python?
Is there a standard way of writing custom SerDe formats, so I can write one myself?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]

There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

Related

MemSQL get keys from array objects as CSV

I have a JSON field with value in format like below.
[
{"key1": 100},
{"key2": 101},
{"key3": 102},
{"key4": 103}
]
I want to convert to value like key1,key1,key3,key4
SingleStore (formerly MemSQL) is ANSI SQL compliant and MySQL wire protocol compatible, making it easy to connect with existing tools.
Depending on your JSON use case, SingleStore supports Persisted Computed Columns for JSON blobs.
The JSON Standard is fully supported by SingleStore.
SingleStore also supports built-in JSON functions to streamline JSON blob extraction.

Transform Avro file with Wrangler into JSON in cloud Datafusion

I try to read an Avro file, make a basic transformation (remove records with name = Ben) using Wrangler and write the result as JSON file into google cloud storage.
The Avro file has the following schema:
{
"type": "record",
"name": "etlSchemaBody",
"fields": [
{
"type": "string",
"name": "name"
}
]
}
Transformation in wrangler is the following:
transformation
The following is the output schema for JSON file:
output schema
When I run the pipeline it runs successfully and the JSON file is created in cloud storage. But the JSON output is empty.
When trying a preview run I get the following message:
warning message
Why is the JSON output file in gcloud storage empty?
When using the Wrangler to make transformations, the default values for the GCS source are format: text and body: string (data type); however, to properly work with an Avro file in the Wrangler you need to change that, you need to set the format to blob and the body data type to bytes, as follows:
After that, the preview for your pipeline should produce output records. You can see my working example next:
Sample data
Transformations
Input records preview for GCS sink (final output)
Edit:
You need to set the format: blob and the output schema as body: bytes if you want to parse the file to Avro within the Wrangler, as described before, because it needs the content of the file in a binary format.
On the other hand if you only want to apply filters (within the Wrangler), you could do the following:
Open the file using format: avro, see img.
Set the output schema according to the fields that your Avro file has, in this case name with string data type, see img.
Use only filters on the Wrangler (no parsing to Avro here), see img.
And this way you can also get the desired result.

In greenplum pxf external table get empty string while fetching element from json array of object

while accessing json data by creating external table using pxf json plugin in multiline json table example
when use following column definition
"coordinates.values[0]" INTEGER,
easily fetch 8 from below json
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
but if we change the json to something like this
"coordinates": {
"type": "geoloc",
"values":[
{
"latitude" : 72,
"longtitue" : 80
}
]
}
and change the column definition like this
"coordinates.values[0].latitude" INTEGER,
fetches empty string....
Unfortunately the JSON profile in PXF does not support accessing JSON objects inside arrays. However, Greenplum has very good support for JSON and you can achieve the same result by doing the following:
CREATE EXTERNAL TABLE pxf_read_json (j1 json)
LOCATION ('pxf://tmp/file.json?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
The pxf_read_json table will access JSON files on the external system. Each file is read as multi-line text files, each file represents a single table row on Greenplum. You can then query the external data as follows:
SELECT values->>'latitude' as latitude, values->>'longtitue' as longitude
FROM pxf_read_json
JOIN LATERAL json_array_elements(j1->'coordinates'->'values') values
ON true;
With this approach, you can still take advantage of PXF's support to access external system as well as leveraging the powerful JSON support in Greenplum.
Additional information about reading a multi-Line text file into a single table row can be found here. And information about Greenplum support for JSON can be found here.

How to query json file located at s3 using presto

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?
Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.
Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

Efficient Portable Database for Hierarchical Dataset - Json, Sqlite or?

I need to make a file that contains a hierarchical dataset. The dataset in question is a file-system listing (directory names, file name/sizes in each directory, sub-directories, ...).
My first instinct was to use Json and flatten the hierarchy using paths so the parser doesn't have to recurse so much. As seen in the example below, each entry is a path ("/", "/child01", "/child01/gchild01",...) and it's files.
{
"entries":
[
{
"path":"/",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child01/gchild01",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
},
{
"path":"/child02",
"files":
[
{"name":"File1", "size":1024},
{"name":"File2", "size":1024}
]
}
]
}
Then I thought that repeating the keys over and over ("name", "size") for each file kind of sucks. So I found this article about how to use Json as if it were a database - http://peter.michaux.ca/articles/json-db-a-compressed-json-format
Using that technique I'd have a Json table like "Entry" with columns "Id", "ParentId", "EntryType", "Name", "FileSize" where "EntryType" would be 0 for Directory and 1 for File.
So, at this point, I'm wondering if sqlite would be a better choice. I'm thinking that the file size would be a LOT smaller than a Json file, but it might only be negligible if I use Json-DB-compressed format from the article. Besides size, are there any other advantages that you can think of?
I think a Javascript object for datasource, loaded as a file stream into the browser and then used in javascript logic in the browser would consume the least time and have good performance.. BUT only until a limited hierarchy size of the content.
Also, not storing the hierarchy anywhere else and keeping it only as a JSON file badly limits your data source's use in your project to client-side technologies.. or forces conversions to other technologies.
If you are building a pure javascript based application (html, js, css only app), then you could keep it as JSON object alone.. and limit your hierarchy sizes.. you could split bigger hierarchies into multiple files linking json objects.
If you will have server-side code like php, in your project,
Considering managebility of code, and scaling, you should ideally store the data in SQLite DB, at runtime create your json hierarchies for limited levels as ajax loads from your page.
If this is the only data your application stores then you can do something really simple like just store the data in an easy to parse/read text file like this:
File1:1024
File2:1024
child01
File1:1024
File2:1024
gchild01
File1:1024
File2:1024
child02
File1:1024
File2:1024
Files get File:Size and directories get just their name. Indentation gives structure. For something slightly more standard but just as easy to read, use yaml.
http://www.yaml.org/
Both can benefit from decreased file size (but decreased user readability) by gzipping the file.
And if you have more data to store, then use SQLite. SQLite is great.
Don't use JSON for data persistence. It's wasteful.