What's the difference between json data being encoded or not - json

What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?

Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.

Related

Loading Raw JSON Into Delta Lake (Like in Snowflake)

I am testing Delta Lake for a simple use case that is very easy in Snowflake, but I'm having a heck of a time understanding if it can be done, much less actually doing it.
I want to be able to load a JSON file "raw," without specifying a schema, and I want to be able to query and flatten it later. In Snowflake, I can create a column of type VARIANT and load the JSON text there, and later I can ask for the different parts by using :: and lateral flatten, etc.
The examples I've seen so far about Delta Lake have had "schema inference" or "autoloading" stipulations, and with those it seems that even if I don't specify a schema, one is created for me and then I still have to guess (or look up) what columns Delta Lake created for me so I can query those parts of the JSON. It seems a little too complicated.
This page has the following comment:
When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type.
... but it provides no example of how to do that. To me this suggests that you can somehow store the raw JSON and query it later, but I don't know how. Just make a STRING column and insert the JSON as string? Can someone post an example?
Am I trialing the wrong tool for what I need, or am I missing something? Thank you for your help.
As far as I'm aware, there is no direct equivalent to the VARIANT column in Snowflake. What that page is suggesting is storing the data as a string, and then using the semi-structured access operators to parse it as JSON on the fly.
e.g. given a table named devices with a column named specifications of type string with value
"""{
"device": "potato phone",
"sku": "POTATO0001",
}"""
Then you can query it like this:
SELECT specifications:device, specifications:sku from devices
edit: to address some of your other questions
This doesn't do schema enforcement. It's possible to create a Struct column in delta lake that can store structured data, but all the data in that column need to be compatible with the Struct schema. If you are querying a JSON string column, you are on your own for schema management.

How do you represent a list in a CSV?

What is the standard way to represent a list/array value in CSV? For example, given this source data in JSON:
[
{
'name': 'Harry',
'subjects': ['math', 'english', 'history']
}
]
My guess as to a CSV representation would be:
name,subjects
Harry,["math","english","history"]
However that doesn't get parsed correctly (with the standard Python CSV parser).
One option, though this is almost always a hack and should be avoided unless truly necessary, is to choose a delimiter that you know will never show up in your data. For example:
name,subject
Harry,math|english|history
Of course you will have to manually handle splitting this string and turning it back into a list. Existing CSV parsers should not support this, because this concept fundamentally does not make sense in CSV.
And of course, this does not generalize well - what happens in the future when you need to store a 2D list, or a dict, or you realize you do need that delimiter character after all?
The root problem here is that CSV is a tabular format, whereas JSON is a hierarchical format. Rather than trying to "squeeze" one format to fit into a fundamentally incompatible format, you should instead normalize your data into a tabular representation. One example of how this could look:
name,subject
Harry,math
Harry,english
Harry,history

Remove JSON keys with wildcards from a MySQL field

I have a MySQL 8.0.20 database with a table that describes metadata about uploaded image files. One column contains a JSON object with a whole bunch of auto-generated data that I'm trying to clean up.
This JSON object sometimes contains one or more variable key names that match a specific pattern. Something like
{
"image_name": "P10043983",
"image_size": "60138",
"image_original_exifdata": "{
'FileName':'P10043983.jpg',
'MimeType':'image/jpeg',
'UndefinedTag:0xA435':'\u0000\u0000\u0000\u0000\u0000\u0000'
}"
}
That UndefinedTag:0xA435 (with many permutations) is the problem. It's referring to various image Exif details like lens type, GPS data, etc. It's stuff that I'm not interested in and that these cameras mostly don't provide, so I've ended up with a table full of long strings of useless characters just taking up space. I want those JSON fields gone for performance and cleanliness.
Is there a way to run a SQL query that would use wildcards or regular expressions to find (and, ideally, remove) all of these pesky variable keys? I'd like to avoid manually making a list of all of the possible "UndefinedTag" keys to search against, and I also didn't like the results when I just treated the whole thing as a string and did REGEXP_REPLACE calls (it sometimes left trailing commas that broke my JSON and were difficult for me to avoid/resolve).
I know some of the JSON functions like JSON_SEARCH() accept wildcards, but it explicitly says the search path can't end in a wildcard (so no UndefinedTag:0x** allowed). Many of the functions I'm after (e.g., JSON_REMOVE()) don't accept wildcards at all. Hell, I've even had trouble finding known keys, and I suspect that silly colon in the key name might have something to do with it.
So, how can I clean up my table and remove the many forms of this UndefinedTag problem? Maybe it's easier to just go back to the regex_replace plan and deal instead with the trailing commas?

How do I identify this JSON-like data structure?

I just came across a JSON wannabe that decides to "improve" it by adding datatypes... of course, the syntax makes it nearly impossible to google.
a:4:{
s:3:"cmd";
s:4:"save";
s:5:"token";
s:22:"5a7be6ad267d1599347886";
}
Full data is... much larger...
The first letter seems to be a for array, s for string, then the quantity of data (# of array items or length of string), then the actual piece of data.
With this type of syntax, I currently can't Google meaningful results. Does anyone recognize what god-forsaken language or framework this is from?
Note: some genius decided to stuff this data as a single field inside a database, and it included critical fields that I need to perform aggregate functions on. The rest I can handle if I can get a way to parse this data without resorting to ugly serial processing.
If this can be parsed using MSSQL 2008 that results in a view, I'll throw in a bounty...
I would parse it with a UDF written in .NET - https://learn.microsoft.com/en-us/sql/relational-databases/clr-integration-database-objects-user-defined-functions/clr-user-defined-functions
You can either write a custom aggregate function to parse and calculate these nutty fields, or a scalar value function that returns the field as JSON.
I'd probably opt for the latter in the name of separation of concerns.

Using regex to extract data from structured data

The problem I'm facing here is that I have a blob of text which contains structured data (in the form of a JSON payload) and I'm interested in extracting the value of one of the keys for a specific JSON instance, picture the structured data inside as the following:
"Item 1": {"key1":"item1_key1_value", "key2":"item1_key2_value", "key3":"item1_key3_value"}, "Item 2": {"key1":"item2_key1_value", "key2":"item2_key2_value", "key3":"item2_key3_value"}
What I would like to use is use regex to grab item1_key2_value for instance. The keys all have the same name but the items are different. So I know which key for which Item I need but am not quite sure of the regex to retrieve that value. I've tried a few approaches to some basic matching but was wondering if any other more experienced regex users could direct me a bit here and explain what I'm doing wrong
1(.)(?=item1_key2_value.) will match a chunk of data from here but I'm not sure of the best way to reduce it to the value that I need.
The regex syntax for JSON is clearly specified at http://www.json.org. If you scroll down a little to where it says "A string is a sequence of", you will find the proper string structure.
Assuming the string follows the correct JSON structure, you could use
"key2"\s*:\s*"((\\.|[^\\"])*)"
where \s means whitespace and * means 0 or more times. \\ means a slosh (backslash) character and can be followed by . (any character). If it does not encounter a slosh, then it instead looks for [^\\"], which means not slosh nor quote.
If you want to be a little more strict to the exact JSON form, you could try
"key2"\s*:\s*"((\\["\\/bfnrtu]|[^\\"])*)"
which you can see follows the string form on the webpage more closely.