Regex for matching with and without quotes for dynamic JSON - json

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.

How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!
Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.
After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

Remove JSON keys with wildcards from a MySQL field

I have a MySQL 8.0.20 database with a table that describes metadata about uploaded image files. One column contains a JSON object with a whole bunch of auto-generated data that I'm trying to clean up.
This JSON object sometimes contains one or more variable key names that match a specific pattern. Something like
{
"image_name": "P10043983",
"image_size": "60138",
"image_original_exifdata": "{
'FileName':'P10043983.jpg',
'MimeType':'image/jpeg',
'UndefinedTag:0xA435':'\u0000\u0000\u0000\u0000\u0000\u0000'
}"
}
That UndefinedTag:0xA435 (with many permutations) is the problem. It's referring to various image Exif details like lens type, GPS data, etc. It's stuff that I'm not interested in and that these cameras mostly don't provide, so I've ended up with a table full of long strings of useless characters just taking up space. I want those JSON fields gone for performance and cleanliness.
Is there a way to run a SQL query that would use wildcards or regular expressions to find (and, ideally, remove) all of these pesky variable keys? I'd like to avoid manually making a list of all of the possible "UndefinedTag" keys to search against, and I also didn't like the results when I just treated the whole thing as a string and did REGEXP_REPLACE calls (it sometimes left trailing commas that broke my JSON and were difficult for me to avoid/resolve).
I know some of the JSON functions like JSON_SEARCH() accept wildcards, but it explicitly says the search path can't end in a wildcard (so no UndefinedTag:0x** allowed). Many of the functions I'm after (e.g., JSON_REMOVE()) don't accept wildcards at all. Hell, I've even had trouble finding known keys, and I suspect that silly colon in the key name might have something to do with it.
So, how can I clean up my table and remove the many forms of this UndefinedTag problem? Maybe it's easier to just go back to the regex_replace plan and deal instead with the trailing commas?

Deal with long numbers in scientific notation in json string - Freemarker

I have a json string which contains a long number but in scientific notation (like 1.559101974041E12 instead of 1559101974041). Due to this, I am not able to parse it using ?eval as this value must be in double quotes in order to get parsed.
I thought of one solution like putting double quotes around them using regex and get them evaluated. After that, use some free marker method to convert value into long. But this solution is very risky and can alter other values as well.
I'm not sure how your template looks, but if you have variable s that contains the string "1.559101974041E12" (the quotation marks aren't part of the string value itself), then you can parse it like s?number. s?eval doesn't work because scientific notation is not part of the FreeMarker syntax (but ?number can parse more formats).
If you will re-print the number in the template, note that depending on locale and configuration settings, it might will look like 1,559,101,974,041. You can prevent that with ?c (for example like ${s?number?c}), in which case it will always look like 1559101974041.

Regex: Is it possible to do a substitution within a capture group?

I have this one line JSON text:
{"schemaText":{"fields":[{"name":"AX_SND_TYPE","type":"string"},{"name":"BWORK","type":"int"}],"name":"XXXSchema","type":"record"},"description":"Autogenerated by NiFi"}
As can be seen there is a property called "schemaText" that contains an object, I want to convert it to a string, so the 'only' thing I need to do is add quotes at the beginning and end of the property and escape the quotes inside.
Using the regular expression bellow (not that my regex knowledge is really low), I am able to do the first step:
({"schemaText":)(\{"fields":\[.*)(,"description.*)
Using the substitution
$1"$2"$3
gives the result:
{"schemaText":"{"fields":[{"name":"AX_SND_TYPE","type":"string"},{"name":"BWORK","type":"int"}],"name":"XXXSchema","type":"record"}","description":"Autogenerated by NiFi"}
But still remains to escape the quotes to get this:
{"schemaText":"{\"fields\":[{\"name\":\"AX_SND_TYPE\",\"type\":\"string\"},{\"name\":\"BWORK\",\"type\":\"int\"}],"name":"XXXSchema","type":"record"}","description":"Autogenerated by NiFi"}
That is have valid JSON format.
The question is: is there a way to escape the quotes inside $2 capture group in the same regular expression?
Thanks in advance.
The answer to your question is no, it's not possible. You're really trying to do two different, unrelated substitutions in a single regular expression. This is a feature that no regular expression engine supports.
Think about it: Your first requirement is for the engine to perform a substitution on the whole text (the quotes), and then, for your second requirement, the engine has to somehow backtrack and perform more substitutions on text which may or may not have already changed: e.g.: It would need to perform a new match on the already substituted text, which, depending on what the first substitution did, may not even exist anymore!
If, as you say, you already have an aproach that works, keep that. A single regular expression is simply not a good fit for what you are trying to do.
I'd recommend tackling this problem using code e.g. with vanilla JavaScript:
let json = '{"schemaText":{"fields":[{"name":"AX_SND_TYPE","type":"string"},{"name":"BWORK","type":"int"}],"name":"XXXSchema","type":"record"},"description":"Autogenerated by NiFi"}';
let obj = JSON.parse(json);
let schemaTextAsString = JSON.stringify(obj.schemaText)
obj.schemaText = schemaTextAsString
var result = JSON.stringify(obj)
You can see this working here.
Note that in your desired output you were not escaping the quotes in schemaText's name field, but this code does.
Finally whenever I use regular expressions I always think of this classic article "Regular Expressions: Now You Have Two Problems"!
Just for your information, you can actually match at every position where a substitution should occur, using an expression such as the following:
/({"schemaText":)|}(,"description")(.*)|([^"]*)"/g
The only issue, as others have mentioned, is that you want to do more than match; you want to perform a "conditional replacement" because there does not exist a single catch-all substitution that will cover all 3 cases you're dealing with (insert starting ", insert \ before quotes, and insert ending ").
You can in fact accomplish this with a single replace() call:
var test = "{\"schemaText\":{\"fields\":[{\"name\":\"AX_SND_TYPE\",\"type\":\"string\"},{\"name\":\"BWORK\",\"type\":\"int\"}],\"name\":\"XXXSchema\",\"type\":\"record\"},\"description\":\"Autogenerated by NiFi\"}";
window.alert(test.replace(/({"schemaText":)|}(,"description")(.*)|([^"]*)"/g, function(a,b,c,d,e){ return (b=="{\"schemaText\":"?b+"\"":(c==",\"description\""?"}\""+c+d:e+"\\\"")) })));
So it's technically "the same regex", but the substitution parameter uses an inline function as replacement rather than a static string.

Using regex to extract data from structured data

The problem I'm facing here is that I have a blob of text which contains structured data (in the form of a JSON payload) and I'm interested in extracting the value of one of the keys for a specific JSON instance, picture the structured data inside as the following:
"Item 1": {"key1":"item1_key1_value", "key2":"item1_key2_value", "key3":"item1_key3_value"}, "Item 2": {"key1":"item2_key1_value", "key2":"item2_key2_value", "key3":"item2_key3_value"}
What I would like to use is use regex to grab item1_key2_value for instance. The keys all have the same name but the items are different. So I know which key for which Item I need but am not quite sure of the regex to retrieve that value. I've tried a few approaches to some basic matching but was wondering if any other more experienced regex users could direct me a bit here and explain what I'm doing wrong
1(.)(?=item1_key2_value.) will match a chunk of data from here but I'm not sure of the best way to reduce it to the value that I need.
The regex syntax for JSON is clearly specified at http://www.json.org. If you scroll down a little to where it says "A string is a sequence of", you will find the proper string structure.
Assuming the string follows the correct JSON structure, you could use
"key2"\s*:\s*"((\\.|[^\\"])*)"
where \s means whitespace and * means 0 or more times. \\ means a slosh (backslash) character and can be followed by . (any character). If it does not encounter a slosh, then it instead looks for [^\\"], which means not slosh nor quote.
If you want to be a little more strict to the exact JSON form, you could try
"key2"\s*:\s*"((\\["\\/bfnrtu]|[^\\"])*)"
which you can see follows the string form on the webpage more closely.