How to query json file located at s3 using presto - json

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?

Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.

Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

Related

MemSQL get keys from array objects as CSV

I have a JSON field with value in format like below.
[
{"key1": 100},
{"key2": 101},
{"key3": 102},
{"key4": 103}
]
I want to convert to value like key1,key1,key3,key4
SingleStore (formerly MemSQL) is ANSI SQL compliant and MySQL wire protocol compatible, making it easy to connect with existing tools.
Depending on your JSON use case, SingleStore supports Persisted Computed Columns for JSON blobs.
The JSON Standard is fully supported by SingleStore.
SingleStore also supports built-in JSON functions to streamline JSON blob extraction.

Multi-line JSON file querying in hive

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
Is there a SerDe format out there that is able to parse multi-line indented .json files?
If there isn't a SerDe format to do this:
Is there a best practice for dealing with files like this?
Should I plan on flattening these records out using a different tool like python?
Is there a standard way of writing custom SerDe formats, so I can write one myself?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]
There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

How to bind dynamic JSON objects to PostgreSQL, using mongodb_fdw?

The Foreign Data Wrapper for MongoDB is pretty awesome! I've gotten it to work using these instructions, apart from:
an object with dynamic fields within it - which PostgreSQL type to use for such?
{
"key1": some,
...
}
an array of objects - which PostgreSQL type to use for such? The length of the array may vary, but the objects are uniform in their inner structure.
[ { "a": 1 }, { "a": 2 }, { "a": 3 } ]
I found these slides on JSON capabilities in recent PostgreSQL versions. Neat. But BSON, JSON or JSONB don't seem to be recognized by the FDW as SQL data types.
If I use:
CREATE FOREIGN TABLE t6
(
"aaa.bbb" JSON -- 'bbb' is an array of JSON objects
)
SERVER mongo_server OPTIONS(...);
SELECT "aaa.bbb" AS bbb FROM t6;
I get:
psql:6.sql:152: ERROR: cannot convert bson type to column type
HINT: Column type: 114
The normal types TEXT, FLOAT etc. work.
The EnterpriseDB fork does it, as #pozs was pointing out. Just mark your data as JSON type.
However, the build system is rather bizarre to my taste, and does not really give you right errors for missing build components (it's obviously Linux-based and simply expects you to have a bunch of tools without properly checking for them).
Here's how I managed to build it on OS X + Homebrew:
$ brew install libtool libbson autoconf automake
$ ./autogen.sh --with-legacy
Note that the --with-meta variant does not provide JSON support, which was the reason I went for this fork anyways.
ref. https://github.com/EnterpriseDB/mongo_fdw/issues/20

Load different json schemas in PIG

I would like to know how to read different Json schemes from one File In PIG.
In Hadoop I would use an jsonparser and with if questions I would find out what kind of json Element it is.
The Json Elements inside one Doccument Are:
{"a": "bla", "e": 123, "f": 333}
{ "a": "bla", "c": "aa"}
I Tried to load the first Json Array with the following Command:
A = load '/usr/local/hadoop/stuff.net' USING USING JsonLoader('a:chararray, e:int, f:int');
DUMP A;
It Throws The Error: ERROR 2088: Fetch failed. Couldn't retrieve result
The Second query is working:
B = load '/home/hadoop/Desktop/aaa' USING JsonLoader('a:chararray, c:chararray');
DUMP B;
But it also shows me results from the first statement.
So I wanted to ask how to load different Json schemas from the same file or isn't that possible?
I think that you can use Twitter's elephantbird project. Some examples you can find here.
Usage is quite easy, you just register jar file and than you can use elephant UDF function to load nested json:
REGISTER 'elephant-bird.jar';
json_file_00 = LOAD 'json_file.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
json_file_01 = FOREACH json_file_00 GENERATE json_file_00#'fieldName' AS field_name;
I am also using akela project from mozilla which is great but outdated.

How to store JSON efficiently?

I'm working on a project which needs to store its configuration in a JSON database.
The problem is how to store efficiently that database, I mean:
don't rewrite the whole JSON tree in a file for each modification
manage multiple access in read/write at the same time
all of this without using an external server to the project (which is itself a server)
take a peek to MongoDB, wich uses Bson (binary json) to store data. http://www.mongodb.org/display/DOCS/BSON
http://www.mongodb.org/display/DOCS/Inserting
Edit 2021:
Today I better recommend to use postgresql to store json :
https://info.crunchydata.com/blog/using-postgresql-for-json-storage
I had an idea which fits to my needs :
For the in-memory configuration, I use a JSON tree (with the jansson library)
When I need to save the configuration, I retrieve the XPath of each elements in the JSON-tree, to use it as a key and store the key/value pair in a BerkeleyDB database
For example :
{'test': {
'option': true,
'options': [ 1, 2 ]
}}
Will give the following key/value pairs :
Key | Value
-----------------+-----------
/test/option | true
/test/options[1] | 1
/test/options[2] | 2