How can I use RegEx to extract data within a JSON document - json

I am no RegEx expert. I am trying to understand if can use RegEx to find a block of data from a JSON file.
My Scenario:
I am using an AWS RDS instance with enhanced monitoring. The monitoring data is being sent to a CloudWatch log stream. I am trying to use the data posted in CloudWatch to be visible in log management solution Loggly.
The ingestion is no problem, I can see the data in Loggly. However, the whole message is contained in one big blob field. The field content is a JSON document. I am trying to figure out if I can use RegEx to extract only certain parts of the JSON document.
Here is an sample extract from the JSON payload I am using:
{
"engine": "MySQL",
"instanceID": "rds-mysql-test",
"instanceResourceID": "db-XXXXXXXXXXXXXXXXXXXXXXXXX",
"timestamp": "2017-02-13T09:49:50Z",
"version": 1,
"uptime": "0:05:36",
"numVCPUs": 1,
"cpuUtilization": {
"guest": 0,
"irq": 0.02,
"system": 1.02,
"wait": 7.52,
"idle": 87.04,
"user": 1.91,
"total": 12.96,
"steal": 2.42,
"nice": 0.07
},
"loadAverageMinute": {
"fifteen": 0.12,
"five": 0.26,
"one": 0.27
},
"memory": {
"writeback": 0,
"hugePagesFree": 0,
"hugePagesRsvd": 0,
"hugePagesSurp": 0,
"cached": 505160,
"hugePagesSize": 2048,
"free": 2830972,
"hugePagesTotal": 0,
"inactive": 363904,
"pageTables": 3652,
"dirty": 64,
"mapped": 26572,
"active": 539432,
"total": 3842628,
"slab": 34020,
"buffers": 16512
},
My Question
My question is, can I use RegEx to extract, say a subset of the document? For example, CPU Utilization, or Memory etc.? If that is possible, how do I write the RegEx? If possible, I can use it to drill down into the extracted document to get individual data elements as well.
Many thanks for your help.

First I agree with Sebastian: A proper JSON parser is better.
Anyway sometimes the dirty approach must be used. If your text layout will not change, then a regexp is simple:
E.g. "total": (\d+\.\d+) gets the CPU usage and "total": (\d\d\d+) the total memory usage (match at least 3 digits not to match the first total text, memory will probably never be less than 100 :-).
If changes are to be expected make it a bit more stable: ["']total["']\s*:\s*(\d+\.\d+).
It may also be possible to match agains return chars like this: "cpuUtilization"\s*:\s*\{\s*\n.*\n\s*"irq"\s*:\s*(\d+\.\d+) making it a bit more stable (this time for irq value).
And so on and so on.
You see that you can get fast into very complex expressions. That approach is very fragile!
P.S. Depending of the exact details of the regex of loggy, details may change. Above examples are based on Perl.

Related

Read Dynamic JSON Property

So I'm currently using JSON.NET in Visual Studio to parse my JSON since using deserialization is too slow for what I'm trying to do. I'm pulling stock information from TD Ameritrade and can request multiple stocks at the same time. The JSON result below is from pulling only 1. As you can see, the first line is "TQQQ". If I were to pull more than one stock, I'd have "TQQQ", then "CEI" in separate blocks representing different objects.
Under normal deserialization, I could just say to deserialize a dictionary and it would put them into the dictionary accordingly with whatever class I had written for it to populate. However, since I need to parse line by line, is there a clean way of being able to tell when I've arrived to the next object?
I could say to keep track of the very last field and then add the next line (the next ticker's name) to dictionary, but that seems a little hacky.
I don't think any VB code is necessary besides the initial startup of creating a new JSONReader.
{
"TQQQ": {
"assetType": "ETF",
"symbol": "TQQQ",
"description": "ProShares UltraPro QQQ",
"bidPrice": 54.59,
"bidSize": 200,
"bidId": "Q",
"askPrice": 54.6,
"askSize": 8000,
"askId": "Q",
"lastPrice": 54.6,
"lastSize": 100,
"lastId": "P",
"openPrice": 51.09,
"highPrice": 54.6,
"lowPrice": 50.43,
"bidTick": " ",
"closePrice": 48.92,
"netChange": 5.68,
"totalVolume": 14996599,
"quoteTimeInLong": 1540493136946,
"tradeTimeInLong": 1540493136946,
"mark": 54.6,
"exchange": "q",
"exchangeName": "NASDAQ",
"marginable": true,
"shortable": true,
"volatility": 0.02960943,
"digits": 4,
"52WkHigh": 73.355,
"52WkLow": 38.6568,
"nAV": 0,
"peRatio": 0,
"divAmount": 0,
"divYield": 0,
"divDate": "2016-12-21 00:00:00.0",
"securityStatus": "Normal",
"regularMarketLastPrice": 54.6,
"regularMarketLastSize": 1,
"regularMarketNetChange": 5.68,
"regularMarketTradeTimeInLong": 1540493136946,
"delayed": true
}
}
Is there a clean way of being able to tell when I've arrived at the next object?
Yes, assuming you are using a JsonTextReader you can look at the TokenType property and check whether it is StartObject. This corresponds to the opening braces { in the JSON. There is also an EndObject token type corresponding to the closing braces }, which will probably also be useful depending on how your code is written.
Typical usage pattern is something like this:
If reader.TokenType == TokenType.StartObject Then
While reader.Read() AndAlso reader.TokenType <> JsonToken.EndObject
' process properties of the JSON object
End While
End If

Edit Parameter in JSON

I want to deploy an Azure ARM Template.
In the parameter section I defined a IP Range for the Subnet.
"SubnetIP": {
"defaultValue": "10.0.0.0",
"type": "string"
},
"SubnetMask": {
"type": "int",
"defaultValue": 16,
"allowedValues": [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27
]
}
When creating the private IP i used
"privateIPAddress": "[concat(parameters('SubnetIP'),copyindex(20))]",
This give me not the excepted output because Subnet Ip is 10.0.0.0 and not 10.0.0. is there a way to edit the parameter in that function?
Regards Stefan
you should do a bit calculation if you want this to be robust:
"ipAddress32Bit": "[add(add(add(mul(int(split(parameters('ipAddress'),'.')[0]),16777216),mul(int(split(parameters('ipAddress'),'.')[1]),65536)),mul(int(split(parameters('ipAddress'),'.')[2]),256)),int(split(parameters('ipAddress'),'.')[3]))]",
"modifiedIp": "[add(variables('ipAddress32Bit'),1)]",
"ipAddressOut": "[concat(string(div(variables('modifiedIP'),16777216)), '.', string(div(mod(variables('modifiedIP'),16777216),65536)), '.', string(div(mod(variables('modifiedIP'),65536),256)), '.', string(mod(variables('modifiedIP'),256)))]"
not going to take credit for that. source. addition happens in the modifiedIp variable in this example. you could also combine this with copy function.
edit. ok, i thought that this is somewhat obvious, but I'll explain how I understand whats going on (i might be wrong).
he takes individual ip address pieces (10.1.2.3 > 10, 1, 2, 3)
he multiplies each piece by a specific number to get its decimal representation
he sums the pieces
he adds 1 (to get next ip address in decimal representation)
he casts decimal number back to ip address
To illustrate the idea use these 3 links:
https://www.browserling.com/tools/dec-to-ip
https://www.ipaddressguide.com/ip
So you want only the first part of the specified subnet?
maybe try something like this?
"variables":{
"SubnetPrefix": "[substring(parameters('SubnetIP'), 0, lastIndexOf(parameters('SubnetIP'), '.'))]"
"privateIPAddress": "[concat(variables('SubnetPrefix'),copyindex(20))]"
}
It would not be pretty for larger subnets than /24, but in the example it could work. Have a look at ARM template string functions

Is it more efficient to organize data into multiple rows columns or tables in a RDS schema, or multiple dynamo DB tables?

Using Node.JS, (Dynamo DB and Sequelize)
Hey there,
I'm trying to tackle the best way to schema the following statistical report data that needs to be tracked for long term scalability. The original architect used a dynamoDB solution because it was No SQL, and we needed the flexibility to add data without being restricted by the schema. However, he nested properties into column objects, requiring us to query the entire table, and then loop through the results checking each row individually to building the response based on the requested search query.
Here is an example report Object that was originally stored in dynamoDB
EXAMPLE 1
{
"boolData": null,
"datetime": 1490391013471,
"eventData": [
{
"eventType": 0,
"location": "-16.3, 2.1, -70.8",
"timestamp": 1490391033260
}
],
"floatData": {
"averageAltitude": 1.79624987,
"averageSpeed": 0,
"maxAltitude": 3.55,
"scorePercent": 0,
"topSpeed": 0
},
"intData": {
"altitudeViolations": 0,
"closeCalls": 0,
"crashCount": 1,
"distanceToNoFlyViolations": 0,
"lostLineOfSightViolations": 0,
"moduleId": 2010,
"moduleStatus": 0,
"resetCount": 0,
"sceneId": 1007,
"score": 0,
"tooCloseOrAbovePersonViolations": 0
},
"longData": {
"moduleCompleted": 1490391033260,
"moduleStartTime": 1490391023584,
"moduleTotalTime": 9676
},
"objecttype": 1,
"stringData": {
"name": "test",
"grade": "F",
"moduleName": "HorizontalFlight1",
"sceneName": "BasicTraining"
},
"userid": 1
}
This is obviously not the best way to store the data for the reasons stated above, since if I wanted to know what the averageAltitude for all flights that took place in a specific module or scene, i'd have to query ALL the data, then loop through ALL the results and check the property nested in the floatData and intData and compare against the information requested in the query to build the response.
So my next thought was why not store all the data in its own column like in EXAMPLE 2? The only downside I could see with this schema in dynamo DB is the maximum amount of data allowed in a row/item is 400KB, and since we don't know how much data we'll want to add in the future, this might have issues with scaling. The solution would just be to limit the items returned, and paginate the response based on what we currently know an items size is, and dividing that by the 1mb scan cap.
EXAMPLE 2
{
"boolData": null,
"datetime": 1490391013471,
"eventData": [
{
"eventType": 0,
"location": "-16.3, 2.1, -70.8",
"timestamp": 1490391033260
}
],
"averageAltitude": 1.79624987,
"averageSpeed": 0,
"maxAltitude": 3.55,
"scorePercent": 0,
"topSpeed": 0
"altitudeViolations": 0,
"closeCalls": 0,
"crashCount": 1,
"distanceToNoFlyViolations": 0,
"droneId": 0,
"lostLineOfSightViolations": 0,
"moduleId": 2010,
"moduleStatus": 0,
"resetCount": 0,
"sceneId": 1007,
"score": 0,
"tooCloseOrAbovePersonViolations": 0
"moduleCompleted": 1490391033260,
"moduleStartTime": 1490391023584,
"moduleTotalTime": 9676
"objecttype": 1,
"name": "test",
"grade": "F",
"moduleName": "HorizontalFlight1",
"sceneName": "BasicTraining"
"userid": 1
}
The OTHER idea I was mulling over was to separate intData, floatData, stringData, and eventData into their own dynamo DB tables, with an index of reportId to associate them accordingly, and then construct the the response. However, I'm not sure if dynamo DB was designed for this association / relationship purpose, and i'm pretty sure that sort of association would be faster with an RDS which leads to my second proposal.
If I stored the EXAMPLE 1 in an Aurora/MySQL RDS, and stringified the intData, floatData, stringData, and eventData to store them in their own respective TEXT or BLOB column for a Reports table, i'm pretty sure the scalability would be drastically less efficient. Stringifying the data adds all those extra bytes, even though it'd allow us to flexibly add and remove properties to track in those columns, we still can't do a query like SELECT * REPORTS WHERE averageAltitude >= 1.5 since that would require me to do exactly what i'm already doing with the extra step of parsing the stringified JSON. I'd query ALL the reports and iterate through them checking the floatData property averageAltitude, and then build my result. So to circumvent that, creating RDS tables for intData, floatData, and stringData with the following schema (only showing intData example purposes)
intData
id: {
type: DataTypes.INTEGER.UNSIGNED,
allowNull: false,
primaryKey: true,
autoIncrement: true
},
name: {
type: DataTypes.STRING(191),
allowNull: false
},
value: {
type: DataTypes.INTEGER,
allowNull: false
}
and then doing a Report.hasMany Association
db.Report.hasMany(db.IntData, {
as: 'intData',
foreignKey: 'reportId',
constraints: false
});
seems like a pretty practical method that might work well since I could easily include the data on the query, and it would be agnostic to the amount of intData, floatData, inserted per report. Is this the most optimal method? This method would remove the dynamo DB tables all together, but certainly seems like it would be more optimal than storing the intData, floatData, etc. as JSON strings in TEXT columns. I'm just not sure if in the long term this method is more scalable for cost effective purposes then using dynamo DB. We would like to postpone upgrading to a Large RDS for as long as possible, and querying reports is definitely going to be the most expensive call.
I appreciate your recommendations and input. And please let me know of an alternative solution if I'm completely missing one that could be even better than the ones I propose. Thank you!

Regex Return First Match

I have a weather file where I would like to extract the first value for "air_temp" recorded in a JSON file. The format this HTTP retriever uses is regex (I know it is not the best method).
I've shortened the JSON file to 2 data entries for simplicity - there are usually 100.
{
"observations": {
"notice": [
{
"copyright": "Copyright Commonwealth of Australia 2017, Bureau of Meteorology. For more information see: http://www.bom.gov.au/other/copyright.shtml http://www.bom.gov.au/other/disclaimer.shtml",
"copyright_url": "http://www.bom.gov.au/other/copyright.shtml",
"disclaimer_url": "http://www.bom.gov.au/other/disclaimer.shtml",
"feedback_url": "http://www.bom.gov.au/other/feedback"
}
],
"header": [
{
"refresh_message": "Issued at 12:11 pm EST Tuesday 11 July 2017",
"ID": "IDN60901",
"main_ID": "IDN60902",
"name": "Canberra",
"state_time_zone": "NSW",
"time_zone": "EST",
"product_name": "Capital City Observations",
"state": "Aust Capital Territory"
}
],
"data": [
{
"sort_order": 0,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/12:00pm",
"local_date_time_full": "20170711120000",
"aifstime_utc": "20170711020000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 5.7,
"cloud": "Mostly clear",
"cloud_base_m": 1050,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 3.6,
"gust_kmh": 11,
"gust_kt": 6,
"air_temp": 9.0,
"dewpt": 0.2,
"press": 1032.7,
"press_qnh": 1031.3,
"press_msl": 1032.7,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 54,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "WNW",
"wind_spd_kmh": 7,
"wind_spd_kt": 4
},
{
"sort_order": 1,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/11:30am",
"local_date_time_full": "20170711113000",
"aifstime_utc": "20170711013000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 4.6,
"cloud": "Mostly clear",
"cloud_base_m": 900,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 2.9,
"gust_kmh": 9,
"gust_kt": 5,
"air_temp": 7.3,
"dewpt": 0.1,
"press": 1033.1,
"press_qnh": 1031.7,
"press_msl": 1033.1,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 60,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "NW",
"wind_spd_kmh": 4,
"wind_spd_kt": 2
}
]
}
}
The regex expression I am currently using is: .*air_temp": (\d+).* but this is returning 9 and 7.3 (entries 1 and 2). Could someone suggest a way to only return the first value?
I have tried using lazy quantifier group, but have had no luck.
This regex will help you. But I think you should capture and extract the first match with features of the programming language you are using.
.*air_temp": (\d{1,3}\.\d{0,3})[\s\S]*?},
To understand the regex better: take a look at this.
Update
The above solution works if you have only two data entries. For more than two entries, we should have used this one:
header[\s\S]*?"air_temp": (\d{1,3}\.\d{0,3})
Here we match the word header first and then match anything in a non-greedy way. After that, we match our expected pattern. thus we get the first match. Play with it here in regex101.
To capture the negative numbers, we need to check if there is any - character exists or not. We do this by ? which means 'The question mark indicates zero or one occurrence of the preceding element'.
So the regex becomes,
header[\s\S]*?"air_temp": (-?\d{1,3}\.\d{0,3}) Demo
But the use of \K without the global flag ( in another answer given by mickmackusa ) is more efficient. To detect negative numbers, the modified version of that regex is
air_temp": \K-?\d{1,2}\.\d{1,2} demo.
Here {1,2} means 1~2 occurance/s of the previous character. We use this as {min_occurance,max_occurance}
I do not know which language you are using, but it seems like a difference between the global flag and not using the global flag.
If the global flag is not set, only the first result will be returned. If the global flag is set on your regex, it will iterate through returning all possible results. You can test it easily using Regex101, https://regex101.com/r/x1bwg2/1
The lazy/greediness should not have any impact in regards to using/not using the global flag
If \K is allowed in your coding language, use this: Demo
/air_temp": \K[\d.]+/ (117steps) this will be highly efficient in searching your very large JSON text.
If no \K is allowed, you can use a capture group: (Demo)
/air_temp": ([\d.]+)/ this will still move with decent speed through your JSON text
Notice that there is no global flag at the end of the pattern, so after one match, the regex engine stops searching.
Update:
For "less literal" matches (but it shouldn't matter if your source is reliable), you could use:
Extended character class to include -:
/air_temp": \K[\d.-]+/ #still 117 steps
or change to negated character class and match everything that isn't a , (because the value always terminates with a comma):
/air_temp": \K[^,]+/ #still 117 steps
For a very strict match (if you are looking for a pattern that means you have ZERO confidence in the input data)...
It appears that your data doesn't go beyond one decimal place, temps between 0 and 1 prepend a 0 before the decimal, and I don't think you need to worry with temps in the hundreds (right?), so you could use:
/air_temp": \K-?[1-9]?\d(?:\.\d)? #200steps
Explanation:
Optional negative sign
Optional tens digit
Required ones digit
Optional decimal which must be followed by a digit
Accuracy Test Demo
Real Data Demo

Using addToSet inside an array with MongoDB

I'm trying to track daily stats for an individual.
I'm having a hard time adding a new day inside "history" and can also use a pointer on updating "walkingSteps" as new data comes in.
My schema looks like:
{
"_id": {
"$oid": "50db246ce4b0fe4923f08e48"
},
"history": [
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e13"
},
"date": {
"$date": "2012-12-25T15:26:15.321Z"
},
"walkingSteps": 5,
"goalStatus": 0
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e14"
},
"date": {
"$date": "2012-12-26T15:26:15.321Z"
},
"walkingSteps": 8,
"goalStatus": 0
}
]
}
db.history.update( ? )
I've been browsing (and attempting) the mongodb documentation but they don't quite break it all the way down to dummies like myself... I couldn't quite translate their examples to my setup.
Thanks for any help.
E = noob trying to learn programming
Adding a day:
user = {_id: ObjectId("50db246ce4b0fe4923f08e48")}
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
db.users.update(user, {$addToSet: {history:day}})
Updating walkingSteps:
user = ObjectId("50db246ce4b0fe4923f08e48")
day = ObjectId("50db2316e4b0fe4923f08e13") // second day in your example
query = {_id: user, 'history._id': day}
db.users.update(query, {$set: {"history.$.walkingSteps": 6}})
This uses the $ positional operator.
It might be easier to have a separate history collection though.
[Edit] On the separate collections:
Adding days grows the document in size and it might need to be relocated on the disk. This can lead to performance issues and fragmentation.
Deleting days won't shrink the document size on disk.
It makes querying easier/straightforward (e.g. searching for a period of time)
Even though #Justin Case puts the right answer he doesn't explain a few things in it extremely well.
You will notice first of all that he gets rid of the resolution on dates and moves their format to merely the date instead of date and time like so:
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
This means that all your dates will have 00:00:00 for their time instead of the exact time you are using atm. This increases the ease of querying per day so you can do something like:
db.col.update(
{"_id": ObjectId("50db246ce4b0fe4923f08e48"),
"history.date": ISODate("2013-01-07")},
{$inc: {"history.$.walkingSteps":0}}
)
and other similar queries.
This also makes $addToSet actually enforce its rules, however since the data in this sub document could change, i.e. walkingSteps will increment $addToSet will not work well here anyway.
This is something I would change from the ticked answer. I would probably use $push or something else instead since $addToSet is heavier and won't really do anything useful here.
The reason for a separate history collection in my view would be what you said earlier with:
Yes, the amount of history items for that day.
So this array contains a set of days, which is fine but it sounds like the figure that you wish to get walkingSteps from, a set of history items, should be in another collection and you set walkingSteps according to the count of the amount of items in that other collection for today:
db.history_items.find({date: ISODate("2013-01-07")}).count();
Referring to MongoDB Manual, $ is the positional operator which identifies an element in an array field to update without explicitly specifying the position of the element in the array. The positional $ operator, when used with the update() method and acts as a placeholder for the first match of the update query selector.
So, if you issue a command to update your collection like this:
db.history.update(
{someCriterion: someValue },
{ $push: { "history":
{"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
}
}
)
Mongodb might try to identify $oid and $date as some positional parameters. $ also is part of the atomic operators like $set and $push. So, it is better to avoid use this special character in Mongodb.