Is Athena a viable/sensible choice for infrequently searching *unstructured* JSON? - json

I'm recording request and response headers and bodies for all traffic to our API and from our API to 3rd party services into S3 as into tiny objects.
I want to be able to query this data infrequently. For example (pseudo-code):
select $.cars[0].color from "objects" where object_path in (....);
Other info:
Many "objects" in S3 won't have a valid path to $.cars[0].color (it's just one example).
I hope to not use Glue.
Cost is important - this is something that will be queried very infrequently. Configuring some ElasticSearch/similar solution is terribly out of budget for the use case.
I hope to not define my own set of schemas (this is simply not feasible).
Athena says it can search unstructured JSON. I'm having trouble creating a proof-of-concept to show this is true.
Is Athena right fr me? Am I missing a better solution?

I think Athena will work for your case.
Athena handles missing properties in JSON objects. For example, if you define the cars column as array<struct<color:string>>:
the property can be missing ⇒ SELECT cars … will be NULL
it can be an empty list ⇒ SELECT cars[1] … (Athena arrays start at 1) will result in an error, but element_at(cars, 1) and try(cars[1]) will return NULL
the object may be missing the color property ⇒ SELECT cars[1].color … will be NULL
for completely free-form JSON define the column as string and use the JSON functions to query it.
Glue is not necessary. Create the table manually, from your application, or with CloudFormation, and configure it to use partition projection and you will not have to think about using Glue crawlers at all.
Athena doesn't cost anything when you don't use it, and if you will query only infrequently this is key. Make sure to compress your data, and partition it in a way that supports your query patterns (e.g. by date or month if you most often will query recent data).
Not sure what you mean by having to define your "own set of schemas", so perhaps you can clarify that part?

Related

Loading Raw JSON Into Delta Lake (Like in Snowflake)

I am testing Delta Lake for a simple use case that is very easy in Snowflake, but I'm having a heck of a time understanding if it can be done, much less actually doing it.
I want to be able to load a JSON file "raw," without specifying a schema, and I want to be able to query and flatten it later. In Snowflake, I can create a column of type VARIANT and load the JSON text there, and later I can ask for the different parts by using :: and lateral flatten, etc.
The examples I've seen so far about Delta Lake have had "schema inference" or "autoloading" stipulations, and with those it seems that even if I don't specify a schema, one is created for me and then I still have to guess (or look up) what columns Delta Lake created for me so I can query those parts of the JSON. It seems a little too complicated.
This page has the following comment:
When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type.
... but it provides no example of how to do that. To me this suggests that you can somehow store the raw JSON and query it later, but I don't know how. Just make a STRING column and insert the JSON as string? Can someone post an example?
Am I trialing the wrong tool for what I need, or am I missing something? Thank you for your help.
As far as I'm aware, there is no direct equivalent to the VARIANT column in Snowflake. What that page is suggesting is storing the data as a string, and then using the semi-structured access operators to parse it as JSON on the fly.
e.g. given a table named devices with a column named specifications of type string with value
"""{
"device": "potato phone",
"sku": "POTATO0001",
}"""
Then you can query it like this:
SELECT specifications:device, specifications:sku from devices
edit: to address some of your other questions
This doesn't do schema enforcement. It's possible to create a Struct column in delta lake that can store structured data, but all the data in that column need to be compatible with the Struct schema. If you are querying a JSON string column, you are on your own for schema management.

What's the difference between json data being encoded or not

What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?
Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.

How do I identify this JSON-like data structure?

I just came across a JSON wannabe that decides to "improve" it by adding datatypes... of course, the syntax makes it nearly impossible to google.
a:4:{
s:3:"cmd";
s:4:"save";
s:5:"token";
s:22:"5a7be6ad267d1599347886";
}
Full data is... much larger...
The first letter seems to be a for array, s for string, then the quantity of data (# of array items or length of string), then the actual piece of data.
With this type of syntax, I currently can't Google meaningful results. Does anyone recognize what god-forsaken language or framework this is from?
Note: some genius decided to stuff this data as a single field inside a database, and it included critical fields that I need to perform aggregate functions on. The rest I can handle if I can get a way to parse this data without resorting to ugly serial processing.
If this can be parsed using MSSQL 2008 that results in a view, I'll throw in a bounty...
I would parse it with a UDF written in .NET - https://learn.microsoft.com/en-us/sql/relational-databases/clr-integration-database-objects-user-defined-functions/clr-user-defined-functions
You can either write a custom aggregate function to parse and calculate these nutty fields, or a scalar value function that returns the field as JSON.
I'd probably opt for the latter in the name of separation of concerns.

Change resultset structure elasticsearch

is it possible to change the resultset structure when retrieving data from elastic search?
the problem ist, the timeseries data are sometimes from 3000-8000 records which are a json array with json objects in it ... parsing it in this case not really efficient or necessary so i thought - could a resultset be transformed to just lets say a simple json object with an array of time and array of values? nothing more?
i could do this in java or php but since we want to have an efficent way of dealing with large datasets we are currently evaluating our options.
You can control what elasticsearch returns using source filtering:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-source-filtering.html
It can let you pick which part of the indexed document it will return, which, depending on your index structure could be an array of times and values, or at least, very easily mapped to it using the language of your choice.
Another possibility is to use scripting to control the results. If you map the result in this way, you should be able to get the hits object to be a JSON array of key: value.

If JSON represents the 'object', what represents the 'class'?

JSON appears to be a nice way to represent a complex data structure in plain text. If we think of this complex data structure as analogous to an OOP object - an instance of a class - then is there a commonly used JSON-like format that represents the class itself (just the data part - forget methods)? Can JSON itself be used for this?
To put it another way, if JSON encodes name-value pairs, what should I use if I want to encode only the names?
The reason I want this is that I am designing a protocol to use with jQuery (to which I am a complete novice by the way). The client will communicate to the server the structure of the JSON object it wants back, and the server will return a JSON object of that structure with the values added.
The key point is that it is the client that is in full control of what data fields (name-value pairs) the server returns. It's a bit different from all the examples of jQuery that I've found so far on the web where the client makes a request (which usually includes a very limited set of parameters, if any) and the server makes the decision as to what fields to return in the JSON reply.
(Obviously, what the client asks for must be congruent with the server's data model; if the server has an array of widgets each with its own price, the client can't ask for an array of prices each with its own widget.)
This must be a common problem, and I don't want to reinvent the wheel. I want to adopt a solution that is already in common use across the web.
Edit
I just found JSON Schema. This is not what I am looking for. It contains way more than I need.
Edit
I'm looking more for a 'this is how it is usually done' answer, rather than a 'you could try…' answer. (I can invent dozens of possible answers myself.)
To encode only names within JSON, you could use a key/value pair where the key is either the class name or just a key named 'values' - with the value being an array of strings that are the names to be returned by the server. For example:
{ 'class_name' : [ "name1", "name2", "name3" ] }
The server can then either detect the class name from the key used and return the supplied values for the names in the array if the class supports it or ignore if it does not.
I'm looking more for a 'this is how it is usually done' answer
There is no single "correct" way to do what you want. Many people have their implementation. It depends on various factors -- what you want to do, where you want to do, how efficiently you want it to do?
For simple structures I would prefer and suggest the answer given by #dbr9979.
For nested structures, you can have nested arrays. Something like:
{
"nestedfield1": {
"nestedfield11":["nestedfield111", "nestedfield112"],
"nestedfield12":["nestedfield121", "nestedfield122"],
"__SIMPLE_FIELDS__": ["simplefield13", "simplefield14"]
}
}
The point is, if the key is __SIMPLE_FIELDS__, the value is an array of simple fields (string, numbers etc..), else the key stands for the key in the object.
For something more complex, what I would suggest is you have predefined structures, that both the server and the client know of. This is particularly useful when you have to make multiple identical requests. Assign some unique number for each of them. Something like:
1 => <the structure above>
2 => ["simplefield1", "simplefield2" ..]
3 => etc .. etc
The server stores the above structure and the relevant number in the database or something. And now, as it may be obvious by now, client sends across the id of the required structure, and the server responds in the appropriate fashion.
I think what you meant by this:
the client that is in full control of what data fields (name-value pairs) the server returns.
is like the difference between SELECT * FROM Bags and SELECT color, price FROM Bag in SQL. Am I interpreting you correctly?
You could query with:
{
'resource': 'Bag',
'field_names': ['color', 'price']
}
which will return the response:
{
'status': 'success',
'result': [
{'color': 'red', 'price': 50},
{'color': 'blue', 'price': 45},
]
}
most likely though, you may not actually need your request to be a JSON object; I've seen implementations where the field names is taken from the query string, like http://foo.com/bag?fields=color,price
I was looking for Partial Response.
RESTful API Design: can your API give developers just the information they need? explains it all and gives examples from LinkedIn, Facebook, and Google. Google and Facebook both have similar approaches. Here's how Lie Ryan's example would look using Google's approach:
url?fields=status,result(color,price)
Since Google and Facebook are behind this, I would not be surprised to see this become a de facto standard.
In my case I am likely to run into a length limitation on the URL and so have to use POST instead, but this is an excellent starting point for me.