Remove JSON keys with wildcards from a MySQL field - mysql

I have a MySQL 8.0.20 database with a table that describes metadata about uploaded image files. One column contains a JSON object with a whole bunch of auto-generated data that I'm trying to clean up.
This JSON object sometimes contains one or more variable key names that match a specific pattern. Something like
{
"image_name": "P10043983",
"image_size": "60138",
"image_original_exifdata": "{
'FileName':'P10043983.jpg',
'MimeType':'image/jpeg',
'UndefinedTag:0xA435':'\u0000\u0000\u0000\u0000\u0000\u0000'
}"
}
That UndefinedTag:0xA435 (with many permutations) is the problem. It's referring to various image Exif details like lens type, GPS data, etc. It's stuff that I'm not interested in and that these cameras mostly don't provide, so I've ended up with a table full of long strings of useless characters just taking up space. I want those JSON fields gone for performance and cleanliness.
Is there a way to run a SQL query that would use wildcards or regular expressions to find (and, ideally, remove) all of these pesky variable keys? I'd like to avoid manually making a list of all of the possible "UndefinedTag" keys to search against, and I also didn't like the results when I just treated the whole thing as a string and did REGEXP_REPLACE calls (it sometimes left trailing commas that broke my JSON and were difficult for me to avoid/resolve).
I know some of the JSON functions like JSON_SEARCH() accept wildcards, but it explicitly says the search path can't end in a wildcard (so no UndefinedTag:0x** allowed). Many of the functions I'm after (e.g., JSON_REMOVE()) don't accept wildcards at all. Hell, I've even had trouble finding known keys, and I suspect that silly colon in the key name might have something to do with it.
So, how can I clean up my table and remove the many forms of this UndefinedTag problem? Maybe it's easier to just go back to the regex_replace plan and deal instead with the trailing commas?

Related

Best Practices Ingesting JSON into BigQuery

bq --location=US load --source_format=NEWLINE_DELIMITED_JSON --autodetect ERIC_KOLOTYLUK_BQ_POC_DATASET.Test2 small_data_clean.jsonl
seems to work well if the JSON is very clean, but very fragile with abstruse diagnostics when the JSON is not very clean. Well, the feature is still experimental, so no point in complaining. For example, if JSON property names contain - characters, while these are valid JSON, they are not valid BigQuery Column Names.
My questions is, are there some existing tools/utilities for ingesting generic JSON into BigQuery that works better than --autodetect?
Presumably Google will improve --autodetect over time, but for now I am looking for any advice/experience people may have. I have already written some code to replace - with _ in property names, so I was wondering if other people have similarly creates tools/utilities...
As you've mentioned for a sample scenario, you have a JSON property that has "-" for a column name, however per this Specifying a Schema Documentation,
A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore.
Any characters non-compliant with the above will result to error on creation of columns during the definition of schema.
On the other hand, when using --autodetect, BigQuery's official documentation on Auto-Detection already has a disclaimer saying,
When BigQuery detects schemas, it might, on rare occasions, change a
field name to make it compatible with BigQuery SQL syntax.
Since there is no other tools yet available to auto-correct/format JSON data to fit the BigQuery requirements for schema definition, the best approach for this kind of scenario is to write a code to replace unwanted characters on JSON data's column names, in which you already did.

What's the difference between json data being encoded or not

What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?
Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.

Finding common phrases in rows that have dynamic content

I'm using MySQL, and I am trying to find common strings over a given character length within a series of messages that are highly dynamic, Each message may have a common phrase, but they will be appended with reference codes or names that don't match a specific format on either side of the string. for example, this is an example of the types of common phrases I'm trying to scan for, but has dynamic content embedded as well, and in different formats (https://screencast.com/t/rlABTWitQ)
The end result I am looking for is something akin to this (https://screencast.com/t/qXzrGNFuf)
Because of the highly variable nature of the formats of these messages, uses of substring_index and regexp (as much as my amateur familiarity with REGEXP has taken me), I can't seem to get anything going
SELECT LEFT("first_middle_last", CHAR_LENGTH("first_middle_last") - LOCATE('_', REVERSE("first_middle_last")));
I can't use something like this, as it would just strip out on a specific type of character. As you can see, the types of strings are too variant in format

Regex for matching with and without quotes for dynamic JSON

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.
How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

Using regex to extract data from structured data

The problem I'm facing here is that I have a blob of text which contains structured data (in the form of a JSON payload) and I'm interested in extracting the value of one of the keys for a specific JSON instance, picture the structured data inside as the following:
"Item 1": {"key1":"item1_key1_value", "key2":"item1_key2_value", "key3":"item1_key3_value"}, "Item 2": {"key1":"item2_key1_value", "key2":"item2_key2_value", "key3":"item2_key3_value"}
What I would like to use is use regex to grab item1_key2_value for instance. The keys all have the same name but the items are different. So I know which key for which Item I need but am not quite sure of the regex to retrieve that value. I've tried a few approaches to some basic matching but was wondering if any other more experienced regex users could direct me a bit here and explain what I'm doing wrong
1(.)(?=item1_key2_value.) will match a chunk of data from here but I'm not sure of the best way to reduce it to the value that I need.
The regex syntax for JSON is clearly specified at http://www.json.org. If you scroll down a little to where it says "A string is a sequence of", you will find the proper string structure.
Assuming the string follows the correct JSON structure, you could use
"key2"\s*:\s*"((\\.|[^\\"])*)"
where \s means whitespace and * means 0 or more times. \\ means a slosh (backslash) character and can be followed by . (any character). If it does not encounter a slosh, then it instead looks for [^\\"], which means not slosh nor quote.
If you want to be a little more strict to the exact JSON form, you could try
"key2"\s*:\s*"((\\["\\/bfnrtu]|[^\\"])*)"
which you can see follows the string form on the webpage more closely.