Regex: Remove Commas within quotes - json

I'm using NiFi and I have a series of JSONs that look like this:
{
"url": "RETURNED URL",
"repository_url": "RETURNED URL",
"labels_url": "RETURNED URL",
"comments_url": "RETURNED URL",
"events_url": "RETURNED URL",
"html_url": "RETURNED URL",
"id": "RETURNED_ID",
"node_id": "RETURNED id",
"number": 10,
...
"author_association": "xxxx",
"active_lock_reason": null,
"body": "text text text, text text, text text text, text, text text",
"performed_via_github_app": null
}
My focus is on the "body" attribute. Because I'm merging them into one giant JSON to convert into a csv, I need the commas within the "body" text to go away (to help with possible NLP later down the road as well). I know I can just use the replace text, but capturing the commas themselves is the part I'm struggling with. So far I have the following:
((?<="body"\s:\s").*(?=",))
Every guide I look at, though, doesn't match the commas within the quotes. Any suggestions?

You can use
(\G(?!^)|\"body\"\s*:\s*\")([^\",]*),
In case there are escape sequences in the string use
(\G(?!^)|\"body\"\s*:\s*\")([^\",\\]*(?:\\.[^\",\\]*)*),
See the regex demo (and regex demo #2), replace with $1$2.
Details:
(\G(?!^)|\"body\"\s*:\s*\") - Group 1: end of the previous match or "body", zero or more whitespaces, :, zero or more whitespaces
([^\",]*) - Group 2 ($2): any zero or more chars other than " and ,
, - a comma (to be removed/replaced).

Related

How to prettify the json text in the Excel cell?

In the excel file, there are a lot of json texts like below
{"component":"{{$labels.component}}","container":"{{$labels.container}}","daemonset":"{{$labels.daemonset}}","directory":"{{$labels.directory}}","figure":"{{$value}}","instance":"{{$labels.instance}}","job":"{{$labels.job}}","name":"{{$labels.name}}","namespace":"{{$labels.namespace}}","pod":"{{$labels.pod}}","reason":"{{$labels.reason}}"}
I want to prettify the json text like this
{
"component": "{{$labels.component}}",
"container": "{{$labels.container}}",
"daemonset": "{{$labels.daemonset}}",
"directory": "{{$labels.directory}}",
"figure": "{{$value}}",
"instance": "{{$labels.instance}}",
"job": "{{$labels.job}}",
"name": "{{$labels.name}}",
"namespace": "{{$labels.namespace}}",
"pod": "{{$labels.pod}}",
"reason": "{{$labels.reason}}"
}
Is there any way to do this for any cells in my excel file?
Thanks!
If you have Excel 365 current channel then you can use this formula - where your json is in cell A1
=LET(json,A1,
r,SUBSTITUTE(json,",","," & CHAR(10)),
"{" & CHAR(10) & MID(r,3,LEN( r)-3) & CHAR(10) & "}")
First it replaces the comma by a comma plus a linebreak char(10). Then it handles the first and last linebreak.

Regex Operation to Extract html Strings

I have the following html code:
'"height": { "#type": "QuantitativeValue", "value": "6-1" },\n
"weight": {"#type": "QuantitativeValue", "value": "195 lbs" }\n}\n'
I want to create a Regex that'll extract the height and weight values (6-1 and 195 lbs). What re expression can do this?
If you don't have anything else with the pattern "value": "", then just use:
value":\s"?(.*)"
https://regex101.com/r/clWVkg/1
If you do, then you can specify that you only want the values from height and weight caught:
(height|weight).*"value":\s"?(.*)"
https://regex101.com/r/5ShdKO/1
This will check for the word height or weight first, then ignore everything until value before doing a lazy catch all to capture the value. You should be able to extract the value by extracting the group.

How to concatenate json result elements with JsonPath (JayWay)

I have this JSON response:
{
"agreedToTermsOfUse": true,
"firstName": "Admin",
"lastName": "iConsulto",
"middleName": "",
"status": 0,
"timeZoneId": "UTC",
}
and I am trying to concatenate the first name and the last name.
I have tried to do this:
$..concat($..firstName," ",$..lastName)
but it returns me an empty value.
I have also try this:
$..concat("+",$..lastName)
And it has returned me this:
+["lastNameUser"]
Any ideas about why the second one option returns me something (like a list) and the first one doesn't return anything??
I have tried also this:
$..concat("+",$..lastName[0])
but it doesn't return me just the last name with the plus symbol.
So... How I can concatenate both names?? Thanks in advance!!!
Do not use deep scan .. operator while concatenating.
Try the below jsonpath-expression
$.concat($.firstName," ",$.lastName)
Online Tool : https://jsonpath.herokuapp.com/

Regex Return First Match

I have a weather file where I would like to extract the first value for "air_temp" recorded in a JSON file. The format this HTTP retriever uses is regex (I know it is not the best method).
I've shortened the JSON file to 2 data entries for simplicity - there are usually 100.
{
"observations": {
"notice": [
{
"copyright": "Copyright Commonwealth of Australia 2017, Bureau of Meteorology. For more information see: http://www.bom.gov.au/other/copyright.shtml http://www.bom.gov.au/other/disclaimer.shtml",
"copyright_url": "http://www.bom.gov.au/other/copyright.shtml",
"disclaimer_url": "http://www.bom.gov.au/other/disclaimer.shtml",
"feedback_url": "http://www.bom.gov.au/other/feedback"
}
],
"header": [
{
"refresh_message": "Issued at 12:11 pm EST Tuesday 11 July 2017",
"ID": "IDN60901",
"main_ID": "IDN60902",
"name": "Canberra",
"state_time_zone": "NSW",
"time_zone": "EST",
"product_name": "Capital City Observations",
"state": "Aust Capital Territory"
}
],
"data": [
{
"sort_order": 0,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/12:00pm",
"local_date_time_full": "20170711120000",
"aifstime_utc": "20170711020000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 5.7,
"cloud": "Mostly clear",
"cloud_base_m": 1050,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 3.6,
"gust_kmh": 11,
"gust_kt": 6,
"air_temp": 9.0,
"dewpt": 0.2,
"press": 1032.7,
"press_qnh": 1031.3,
"press_msl": 1032.7,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 54,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "WNW",
"wind_spd_kmh": 7,
"wind_spd_kt": 4
},
{
"sort_order": 1,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/11:30am",
"local_date_time_full": "20170711113000",
"aifstime_utc": "20170711013000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 4.6,
"cloud": "Mostly clear",
"cloud_base_m": 900,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 2.9,
"gust_kmh": 9,
"gust_kt": 5,
"air_temp": 7.3,
"dewpt": 0.1,
"press": 1033.1,
"press_qnh": 1031.7,
"press_msl": 1033.1,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 60,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "NW",
"wind_spd_kmh": 4,
"wind_spd_kt": 2
}
]
}
}
The regex expression I am currently using is: .*air_temp": (\d+).* but this is returning 9 and 7.3 (entries 1 and 2). Could someone suggest a way to only return the first value?
I have tried using lazy quantifier group, but have had no luck.
This regex will help you. But I think you should capture and extract the first match with features of the programming language you are using.
.*air_temp": (\d{1,3}\.\d{0,3})[\s\S]*?},
To understand the regex better: take a look at this.
Update
The above solution works if you have only two data entries. For more than two entries, we should have used this one:
header[\s\S]*?"air_temp": (\d{1,3}\.\d{0,3})
Here we match the word header first and then match anything in a non-greedy way. After that, we match our expected pattern. thus we get the first match. Play with it here in regex101.
To capture the negative numbers, we need to check if there is any - character exists or not. We do this by ? which means 'The question mark indicates zero or one occurrence of the preceding element'.
So the regex becomes,
header[\s\S]*?"air_temp": (-?\d{1,3}\.\d{0,3}) Demo
But the use of \K without the global flag ( in another answer given by mickmackusa ) is more efficient. To detect negative numbers, the modified version of that regex is
air_temp": \K-?\d{1,2}\.\d{1,2} demo.
Here {1,2} means 1~2 occurance/s of the previous character. We use this as {min_occurance,max_occurance}
I do not know which language you are using, but it seems like a difference between the global flag and not using the global flag.
If the global flag is not set, only the first result will be returned. If the global flag is set on your regex, it will iterate through returning all possible results. You can test it easily using Regex101, https://regex101.com/r/x1bwg2/1
The lazy/greediness should not have any impact in regards to using/not using the global flag
If \K is allowed in your coding language, use this: Demo
/air_temp": \K[\d.]+/ (117steps) this will be highly efficient in searching your very large JSON text.
If no \K is allowed, you can use a capture group: (Demo)
/air_temp": ([\d.]+)/ this will still move with decent speed through your JSON text
Notice that there is no global flag at the end of the pattern, so after one match, the regex engine stops searching.
Update:
For "less literal" matches (but it shouldn't matter if your source is reliable), you could use:
Extended character class to include -:
/air_temp": \K[\d.-]+/ #still 117 steps
or change to negated character class and match everything that isn't a , (because the value always terminates with a comma):
/air_temp": \K[^,]+/ #still 117 steps
For a very strict match (if you are looking for a pattern that means you have ZERO confidence in the input data)...
It appears that your data doesn't go beyond one decimal place, temps between 0 and 1 prepend a 0 before the decimal, and I don't think you need to worry with temps in the hundreds (right?), so you could use:
/air_temp": \K-?[1-9]?\d(?:\.\d)? #200steps
Explanation:
Optional negative sign
Optional tens digit
Required ones digit
Optional decimal which must be followed by a digit
Accuracy Test Demo
Real Data Demo

Notepad++: What is the "opposite" format of JSFormat?

I'm looking for the "opposite" Format of JSFormat from the JSTools. Here an example:
JSON code example:
title = Automatic at 07.02.17 & appId = ID_1 & data = {
"base": "+:background1,background2",
"content": [{
"appTitle": "Soil",
"service": {
"serviceType": "AG",
"Url": "http://test.de/xxx"
},
"opacity": "1"]
}
],
"center": "4544320.372869264,5469450.086030475,31468"
}
& context = PARAMETERS
and I Need to convert the Format to the following format:
title=Automatic at 07.02.17 &appId=ID_1&data={"base":"+:background1,background2","content":[{"appTitle":"Soil","service":{"serviceType":"AG","Url":"http://test.de/xxx"},"opacity":"1"]}],"center":"4544320.372869264,5469450.086030475,31468"}&context=PARAMETERS
which is a decoded URL (with MIME Tools) from this html POST:
title%3DAutomatic%20at%2007.02.17%20%26appId%3DID_1%26data%3D%7B%22base%22%3A%22+%3Abackground1,background2%22,%22content%22%3A%5B%7B%22appTitle%22%3A%22Soil%22,%22service%22%3A%7B%22serviceType%22%3A%22AG%22,%22Url%22%3A%22http%3A%2F%2Ftest.de%2Fxxx%22%7D,%22opacity%22%3A%221%22%5D%7D%5D,%22center%22%3A%224544320.372869264,5469450.086030475,31468%22%7D%26context%3DPARAMETERS%0D%0A
which I have to come back after doing changes in the JSON code. From the second to the third Format I can use URL encode (MIME Tools), but what about the reformating from the first to the second Format.
My question: Do you have ideas how to turn the first (JSON) Format into the second (decoded URL) in Notepad++? Something like the "opposite" of JSFormat?
If I understand correctly you basically need to put your JSON on a single line removing new lines and spaces.
This should be achieved with these steps:
CTRL + H to replace occurrences of more than one space with empty string using this regex: [ ]{2,} (remember to select "Regular expression" radiobutton). If this is not exactly what you want you can adjust the regular expression to achieve desired output
select all your JSON CTRL + A
put everything on a single line with join CTRL + J
You can also record a macro to automate this process and run it with a keyboard shortcut.