Using regex to extract data from structured data

Using regex to extract data from structured data - json

The problem I'm facing here is that I have a blob of text which contains structured data (in the form of a JSON payload) and I'm interested in extracting the value of one of the keys for a specific JSON instance, picture the structured data inside as the following:
"Item 1": {"key1":"item1_key1_value", "key2":"item1_key2_value", "key3":"item1_key3_value"}, "Item 2": {"key1":"item2_key1_value", "key2":"item2_key2_value", "key3":"item2_key3_value"}
What I would like to use is use regex to grab item1_key2_value for instance. The keys all have the same name but the items are different. So I know which key for which Item I need but am not quite sure of the regex to retrieve that value. I've tried a few approaches to some basic matching but was wondering if any other more experienced regex users could direct me a bit here and explain what I'm doing wrong
1(.)(?=item1_key2_value.) will match a chunk of data from here but I'm not sure of the best way to reduce it to the value that I need.

The regex syntax for JSON is clearly specified at http://www.json.org. If you scroll down a little to where it says "A string is a sequence of", you will find the proper string structure.
Assuming the string follows the correct JSON structure, you could use
"key2"\s*:\s*"((\\.|[^\\"])*)"
where \s means whitespace and * means 0 or more times. \\ means a slosh (backslash) character and can be followed by . (any character). If it does not encounter a slosh, then it instead looks for [^\\"], which means not slosh nor quote.
If you want to be a little more strict to the exact JSON form, you could try
"key2"\s*:\s*"((\\["\\/bfnrtu]|[^\\"])*)"
which you can see follows the string form on the webpage more closely.

Related

Is there a JOLT documentation? What's the meaning of the &, # etc. operators? (NiFi, JoltTransformJSON)

Yeah there is! I made this question to share my knowledge, Q&A style since I had a hard time finding it myself :)
Thanks to https://stackoverflow.com/a/67821482/1561441 (Barbaros Özhan, see comments) for pointing me into the correct direction

The answer is: look here and here
Correct me if I'm wrong, but: Wow, currently to my knowledge a single .java file on GitHub, last commit in 2017, holds relevant parts of the official documentation of the JOLT syntax. I had to use its syntax since I'm working with NiFi and applied its JoltTransformJSON processor (hence the SEO abuses in my question, so more people find the answer)
Here are some of the most relevant parts copied from https://github.com/bazaarvoice/jolt/blob/master/jolt-core/src/main/java/com/bazaarvoice/jolt/Shiftr.java and slightly edited. The documentation itself is more extensive and also shows examples.
'*' Wildcard
Valid only on the LHS ( input JSON keys ) side of a Shiftr Spec
The '*' wildcard can be used by itself or to match part of a key.
'&' Wildcard
Valid on the LHS (left hand side - input JSON keys) and RHS (output data path)
Means, dereference against a "path" to get a value and use that value as if were a literal key.
The canonical form of the wildcard is "&(0,0)".
The first parameter is where in the input path to look for a value, and the second parameter is which part of the key to use (used with * key).
There are syntactic sugar versions of the wildcard, all of the following mean the same thing; Sugar : '&' = '&0' = '&(0)' = '&(0,0)
The syntactic sugar versions are nice, as there are a set of data transforms that do not need to use the canonical form, eg if your input data does not have any "prefixed" keys.
'$' Wildcard
Valid only on the LHS of the spec.
The existence of this wildcard is a reflection of the fact that the "data" of the input JSON, can be both in the "values" and the "keys" of the input JSON
The base case operation of Shiftr is to copy input JSON "values", thus we need a way to specify that we want to copy the input JSON "key" instead.
Thus '$' specifies that we want to use an input key, or input key derived value, as the data to be placed in the output JSON.
'$' has the same syntax as the '&' wildcard, and can be read as, dereference to get a value, and then use that value as the data to be output.
There are two cases where this is useful
when a "key" in the input JSON needs to be a "id" value in the output JSON, see the ' "$": "SecondaryRatings.&1.Id" ' example above.
you want to make a list of all the input keys.
'#' Wildcard
Valid both on the LHS and RHS, but has different behavior / format on either side.
The way to think of it, is that it allows you to specify a "synthentic" value, aka a value not found in the input data.
On the RHS of the spec, # is only valid in the the context of an array, like "[#2]".
What "[#2]" means is, go up the three levels and ask that node how many matches it has had, and then use that as an index in the arrays.
This means that, while Shiftr is doing its parallel tree walk of the input data and the spec, it tracks how many matches it has processed at each level of the spec tree.
This useful if you want to take a JSON map and turn it into a JSON array, and you do not care about the order of the array.
On the LHS of the spec, # allows you to specify a hard coded String to be place as a value in the output.
The initial use-case for this feature was to be able to process a Boolean input value, and if the value is boolean true write out the string "enabled". Note, this was possible before, but it required two Shiftr steps.
'#' Wildcard
Valid on both sides of the spec.
The basic '#' on the LHS.
This wildcard is necessary if you want to put both the input value and the input key somewhere in the output JSON.
Thus the '#' wildcard is the mean "copy the value of the data at this level in the tree, to the output".
Advanced '#' sign wildcard
The format is lools like "#(3,title)", where "3" means go up the tree 3 levels and then lookup the key "title" and use the value at that key.

I would love to know if there is an alternative to JoltTransformJSON simply because I'm struggling a lot with understanding it (not coming from a programming background myself). When it works (thanks to all the help here) it does simplify things a lot!
Here are a few other sites that help:
https://intercom.help/godigibee/en/articles/4044359-transformer-getting-to-know-jolt
https://erbalvindersingh.medium.com/applying-jolttransform-on-json-object-array-and-fetching-specific-fields-48946870b4fc
https://cool-cheng.blogspot.com/2019/12/json-jolt-tutorial.html

Remove JSON keys with wildcards from a MySQL field

I have a MySQL 8.0.20 database with a table that describes metadata about uploaded image files. One column contains a JSON object with a whole bunch of auto-generated data that I'm trying to clean up.
This JSON object sometimes contains one or more variable key names that match a specific pattern. Something like
{
"image_name": "P10043983",
"image_size": "60138",
"image_original_exifdata": "{
'FileName':'P10043983.jpg',
'MimeType':'image/jpeg',
'UndefinedTag:0xA435':'\u0000\u0000\u0000\u0000\u0000\u0000'
}"
}
That UndefinedTag:0xA435 (with many permutations) is the problem. It's referring to various image Exif details like lens type, GPS data, etc. It's stuff that I'm not interested in and that these cameras mostly don't provide, so I've ended up with a table full of long strings of useless characters just taking up space. I want those JSON fields gone for performance and cleanliness.
Is there a way to run a SQL query that would use wildcards or regular expressions to find (and, ideally, remove) all of these pesky variable keys? I'd like to avoid manually making a list of all of the possible "UndefinedTag" keys to search against, and I also didn't like the results when I just treated the whole thing as a string and did REGEXP_REPLACE calls (it sometimes left trailing commas that broke my JSON and were difficult for me to avoid/resolve).
I know some of the JSON functions like JSON_SEARCH() accept wildcards, but it explicitly says the search path can't end in a wildcard (so no UndefinedTag:0x** allowed). Many of the functions I'm after (e.g., JSON_REMOVE()) don't accept wildcards at all. Hell, I've even had trouble finding known keys, and I suspect that silly colon in the key name might have something to do with it.
So, how can I clean up my table and remove the many forms of this UndefinedTag problem? Maybe it's easier to just go back to the regex_replace plan and deal instead with the trailing commas?

What's the difference between json data being encoded or not

What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?

Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.

Regex for matching with and without quotes for dynamic JSON

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.

How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

What is the Solr field type for storing JSON object?

I want to store Json data as a value to an field, currently I have tried keeping that field type as string or text, which is fine but the Json is stored in string and hence the double quotes(") are escaped inside that, so what field type need to be kept for Json Object ?
Eg, suppose my field is Thumbnail and it should be stored and display as json object instead of string like below
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": "100"
}

You can store it as String. The issue you are facing - I assume - is when you want to return that JSON as structure within the Solr JSON response. By default Solr does not know that you want to embed the content rather than present it.
To tell it otherwise, try using a [json] document transformer in your fl parameter.

String type stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.
Text typically performs tokenization, and secondary processing (such as lower-casing etc.). More of use when you want to match a word which is part of that a sentence.
With the text you can use tokenizers and filter made available by solr in order to create more useful token in order have proper search.
I would suggest to store it as the text.
It would be helpful for you to tokenize the data and can be helpful in the search as well.
Also as its a Json, it would be bigger in size and may grow bigger in size, so again can be helpful to be stored as text instead of string.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008