I have a weather file where I would like to extract the first value for "air_temp" recorded in a JSON file. The format this HTTP retriever uses is regex (I know it is not the best method).
I've shortened the JSON file to 2 data entries for simplicity - there are usually 100.
{
"observations": {
"notice": [
{
"copyright": "Copyright Commonwealth of Australia 2017, Bureau of Meteorology. For more information see: http://www.bom.gov.au/other/copyright.shtml http://www.bom.gov.au/other/disclaimer.shtml",
"copyright_url": "http://www.bom.gov.au/other/copyright.shtml",
"disclaimer_url": "http://www.bom.gov.au/other/disclaimer.shtml",
"feedback_url": "http://www.bom.gov.au/other/feedback"
}
],
"header": [
{
"refresh_message": "Issued at 12:11 pm EST Tuesday 11 July 2017",
"ID": "IDN60901",
"main_ID": "IDN60902",
"name": "Canberra",
"state_time_zone": "NSW",
"time_zone": "EST",
"product_name": "Capital City Observations",
"state": "Aust Capital Territory"
}
],
"data": [
{
"sort_order": 0,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/12:00pm",
"local_date_time_full": "20170711120000",
"aifstime_utc": "20170711020000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 5.7,
"cloud": "Mostly clear",
"cloud_base_m": 1050,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 3.6,
"gust_kmh": 11,
"gust_kt": 6,
"air_temp": 9.0,
"dewpt": 0.2,
"press": 1032.7,
"press_qnh": 1031.3,
"press_msl": 1032.7,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 54,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "WNW",
"wind_spd_kmh": 7,
"wind_spd_kt": 4
},
{
"sort_order": 1,
"wmo": 94926,
"name": "Canberra",
"history_product": "IDN60903",
"local_date_time": "11/11:30am",
"local_date_time_full": "20170711113000",
"aifstime_utc": "20170711013000",
"lat": -35.3,
"lon": 149.2,
"apparent_t": 4.6,
"cloud": "Mostly clear",
"cloud_base_m": 900,
"cloud_oktas": 1,
"cloud_type_id": 8,
"cloud_type": "Cumulus",
"delta_t": 2.9,
"gust_kmh": 9,
"gust_kt": 5,
"air_temp": 7.3,
"dewpt": 0.1,
"press": 1033.1,
"press_qnh": 1031.7,
"press_msl": 1033.1,
"press_tend": "-",
"rain_trace": "0.0",
"rel_hum": 60,
"sea_state": "-",
"swell_dir_worded": "-",
"swell_height": null,
"swell_period": null,
"vis_km": "10",
"weather": "-",
"wind_dir": "NW",
"wind_spd_kmh": 4,
"wind_spd_kt": 2
}
]
}
}
The regex expression I am currently using is: .*air_temp": (\d+).* but this is returning 9 and 7.3 (entries 1 and 2). Could someone suggest a way to only return the first value?
I have tried using lazy quantifier group, but have had no luck.
This regex will help you. But I think you should capture and extract the first match with features of the programming language you are using.
.*air_temp": (\d{1,3}\.\d{0,3})[\s\S]*?},
To understand the regex better: take a look at this.
Update
The above solution works if you have only two data entries. For more than two entries, we should have used this one:
header[\s\S]*?"air_temp": (\d{1,3}\.\d{0,3})
Here we match the word header first and then match anything in a non-greedy way. After that, we match our expected pattern. thus we get the first match. Play with it here in regex101.
To capture the negative numbers, we need to check if there is any - character exists or not. We do this by ? which means 'The question mark indicates zero or one occurrence of the preceding element'.
So the regex becomes,
header[\s\S]*?"air_temp": (-?\d{1,3}\.\d{0,3}) Demo
But the use of \K without the global flag ( in another answer given by mickmackusa ) is more efficient. To detect negative numbers, the modified version of that regex is
air_temp": \K-?\d{1,2}\.\d{1,2} demo.
Here {1,2} means 1~2 occurance/s of the previous character. We use this as {min_occurance,max_occurance}
I do not know which language you are using, but it seems like a difference between the global flag and not using the global flag.
If the global flag is not set, only the first result will be returned. If the global flag is set on your regex, it will iterate through returning all possible results. You can test it easily using Regex101, https://regex101.com/r/x1bwg2/1
The lazy/greediness should not have any impact in regards to using/not using the global flag
If \K is allowed in your coding language, use this: Demo
/air_temp": \K[\d.]+/ (117steps) this will be highly efficient in searching your very large JSON text.
If no \K is allowed, you can use a capture group: (Demo)
/air_temp": ([\d.]+)/ this will still move with decent speed through your JSON text
Notice that there is no global flag at the end of the pattern, so after one match, the regex engine stops searching.
Update:
For "less literal" matches (but it shouldn't matter if your source is reliable), you could use:
Extended character class to include -:
/air_temp": \K[\d.-]+/ #still 117 steps
or change to negated character class and match everything that isn't a , (because the value always terminates with a comma):
/air_temp": \K[^,]+/ #still 117 steps
For a very strict match (if you are looking for a pattern that means you have ZERO confidence in the input data)...
It appears that your data doesn't go beyond one decimal place, temps between 0 and 1 prepend a 0 before the decimal, and I don't think you need to worry with temps in the hundreds (right?), so you could use:
/air_temp": \K-?[1-9]?\d(?:\.\d)? #200steps
Explanation:
Optional negative sign
Optional tens digit
Required ones digit
Optional decimal which must be followed by a digit
Accuracy Test Demo
Real Data Demo
Related
I'm new to MySQL and received a task which requires a complex(for me) query. I read the documentation and a few sources but I still cannot write it myself.
I'm selecting a rows from a table where in one of the cells I have JSON like this one
{
[
{
"interval" : 2,
"start": 03,
"end": 07,
"day_of_week": 3
}, {
"interval" : 8,
"start": 22,
"end": 23,
"day_of_week": 6
}
]
}
I want to check if some of the "day_of_week" values is equal to the current day of week and if so to write this value and the values of "start", "end" and "day_of_week" assoiciated with it in a variables to use them in the query.
That's not valid JSON format, so none of the MySQL JSON functions will work on it regardless. Better just fetch the whole blob of not-JSON into a client application that knows how to parse it, and deal with it there.
Even if it were valid JSON, I would ask this: why would you store data in a format you don't know how to query?
The proper solution is the following:
SELECT start, end, day_of_week
FROM mytable
WHERE day_of_week = DAYOFWEEK(CURDATE());
See how easy that is when you store data in normal rows and columns? You get to use ordinary SQL expressions, instead of wondering how you can trick MySQL into giving up the data buried in your non-JSON blob.
JSON is the worst thing to happen to relational databases.
Re your comment:
If you need to query by day of week, then you could reorganize your JSON to support that type of query:
{
"3":{
"interval" : 2,
"start": 03,
"end": 07,
"day_of_week": 3
},
"6": {
"interval" : 8,
"start": 22,
"end": 23,
"day_of_week": 6
}
}
Then it's possible to get results for the current weekday this way:
SELECT data->>'$.start' AS `start`,
data->>'$.end' AS `end`,
data->>'$.day_of_week' AS `day_of_week`
FROM (
SELECT JSON_EXTRACT(data, CONCAT('$."', DAYOFWEEK(CURDATE()), '"')) AS data
FROM mytable
) AS d;
In general, when you store data in a non-relational manner, the way to optimize it is to organize the data to support a specific query.
So I'm currently using JSON.NET in Visual Studio to parse my JSON since using deserialization is too slow for what I'm trying to do. I'm pulling stock information from TD Ameritrade and can request multiple stocks at the same time. The JSON result below is from pulling only 1. As you can see, the first line is "TQQQ". If I were to pull more than one stock, I'd have "TQQQ", then "CEI" in separate blocks representing different objects.
Under normal deserialization, I could just say to deserialize a dictionary and it would put them into the dictionary accordingly with whatever class I had written for it to populate. However, since I need to parse line by line, is there a clean way of being able to tell when I've arrived to the next object?
I could say to keep track of the very last field and then add the next line (the next ticker's name) to dictionary, but that seems a little hacky.
I don't think any VB code is necessary besides the initial startup of creating a new JSONReader.
{
"TQQQ": {
"assetType": "ETF",
"symbol": "TQQQ",
"description": "ProShares UltraPro QQQ",
"bidPrice": 54.59,
"bidSize": 200,
"bidId": "Q",
"askPrice": 54.6,
"askSize": 8000,
"askId": "Q",
"lastPrice": 54.6,
"lastSize": 100,
"lastId": "P",
"openPrice": 51.09,
"highPrice": 54.6,
"lowPrice": 50.43,
"bidTick": " ",
"closePrice": 48.92,
"netChange": 5.68,
"totalVolume": 14996599,
"quoteTimeInLong": 1540493136946,
"tradeTimeInLong": 1540493136946,
"mark": 54.6,
"exchange": "q",
"exchangeName": "NASDAQ",
"marginable": true,
"shortable": true,
"volatility": 0.02960943,
"digits": 4,
"52WkHigh": 73.355,
"52WkLow": 38.6568,
"nAV": 0,
"peRatio": 0,
"divAmount": 0,
"divYield": 0,
"divDate": "2016-12-21 00:00:00.0",
"securityStatus": "Normal",
"regularMarketLastPrice": 54.6,
"regularMarketLastSize": 1,
"regularMarketNetChange": 5.68,
"regularMarketTradeTimeInLong": 1540493136946,
"delayed": true
}
}
Is there a clean way of being able to tell when I've arrived at the next object?
Yes, assuming you are using a JsonTextReader you can look at the TokenType property and check whether it is StartObject. This corresponds to the opening braces { in the JSON. There is also an EndObject token type corresponding to the closing braces }, which will probably also be useful depending on how your code is written.
Typical usage pattern is something like this:
If reader.TokenType == TokenType.StartObject Then
While reader.Read() AndAlso reader.TokenType <> JsonToken.EndObject
' process properties of the JSON object
End While
End If
I want to deploy an Azure ARM Template.
In the parameter section I defined a IP Range for the Subnet.
"SubnetIP": {
"defaultValue": "10.0.0.0",
"type": "string"
},
"SubnetMask": {
"type": "int",
"defaultValue": 16,
"allowedValues": [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27
]
}
When creating the private IP i used
"privateIPAddress": "[concat(parameters('SubnetIP'),copyindex(20))]",
This give me not the excepted output because Subnet Ip is 10.0.0.0 and not 10.0.0. is there a way to edit the parameter in that function?
Regards Stefan
you should do a bit calculation if you want this to be robust:
"ipAddress32Bit": "[add(add(add(mul(int(split(parameters('ipAddress'),'.')[0]),16777216),mul(int(split(parameters('ipAddress'),'.')[1]),65536)),mul(int(split(parameters('ipAddress'),'.')[2]),256)),int(split(parameters('ipAddress'),'.')[3]))]",
"modifiedIp": "[add(variables('ipAddress32Bit'),1)]",
"ipAddressOut": "[concat(string(div(variables('modifiedIP'),16777216)), '.', string(div(mod(variables('modifiedIP'),16777216),65536)), '.', string(div(mod(variables('modifiedIP'),65536),256)), '.', string(mod(variables('modifiedIP'),256)))]"
not going to take credit for that. source. addition happens in the modifiedIp variable in this example. you could also combine this with copy function.
edit. ok, i thought that this is somewhat obvious, but I'll explain how I understand whats going on (i might be wrong).
he takes individual ip address pieces (10.1.2.3 > 10, 1, 2, 3)
he multiplies each piece by a specific number to get its decimal representation
he sums the pieces
he adds 1 (to get next ip address in decimal representation)
he casts decimal number back to ip address
To illustrate the idea use these 3 links:
https://www.browserling.com/tools/dec-to-ip
https://www.ipaddressguide.com/ip
So you want only the first part of the specified subnet?
maybe try something like this?
"variables":{
"SubnetPrefix": "[substring(parameters('SubnetIP'), 0, lastIndexOf(parameters('SubnetIP'), '.'))]"
"privateIPAddress": "[concat(variables('SubnetPrefix'),copyindex(20))]"
}
It would not be pretty for larger subnets than /24, but in the example it could work. Have a look at ARM template string functions
I am no RegEx expert. I am trying to understand if can use RegEx to find a block of data from a JSON file.
My Scenario:
I am using an AWS RDS instance with enhanced monitoring. The monitoring data is being sent to a CloudWatch log stream. I am trying to use the data posted in CloudWatch to be visible in log management solution Loggly.
The ingestion is no problem, I can see the data in Loggly. However, the whole message is contained in one big blob field. The field content is a JSON document. I am trying to figure out if I can use RegEx to extract only certain parts of the JSON document.
Here is an sample extract from the JSON payload I am using:
{
"engine": "MySQL",
"instanceID": "rds-mysql-test",
"instanceResourceID": "db-XXXXXXXXXXXXXXXXXXXXXXXXX",
"timestamp": "2017-02-13T09:49:50Z",
"version": 1,
"uptime": "0:05:36",
"numVCPUs": 1,
"cpuUtilization": {
"guest": 0,
"irq": 0.02,
"system": 1.02,
"wait": 7.52,
"idle": 87.04,
"user": 1.91,
"total": 12.96,
"steal": 2.42,
"nice": 0.07
},
"loadAverageMinute": {
"fifteen": 0.12,
"five": 0.26,
"one": 0.27
},
"memory": {
"writeback": 0,
"hugePagesFree": 0,
"hugePagesRsvd": 0,
"hugePagesSurp": 0,
"cached": 505160,
"hugePagesSize": 2048,
"free": 2830972,
"hugePagesTotal": 0,
"inactive": 363904,
"pageTables": 3652,
"dirty": 64,
"mapped": 26572,
"active": 539432,
"total": 3842628,
"slab": 34020,
"buffers": 16512
},
My Question
My question is, can I use RegEx to extract, say a subset of the document? For example, CPU Utilization, or Memory etc.? If that is possible, how do I write the RegEx? If possible, I can use it to drill down into the extracted document to get individual data elements as well.
Many thanks for your help.
First I agree with Sebastian: A proper JSON parser is better.
Anyway sometimes the dirty approach must be used. If your text layout will not change, then a regexp is simple:
E.g. "total": (\d+\.\d+) gets the CPU usage and "total": (\d\d\d+) the total memory usage (match at least 3 digits not to match the first total text, memory will probably never be less than 100 :-).
If changes are to be expected make it a bit more stable: ["']total["']\s*:\s*(\d+\.\d+).
It may also be possible to match agains return chars like this: "cpuUtilization"\s*:\s*\{\s*\n.*\n\s*"irq"\s*:\s*(\d+\.\d+) making it a bit more stable (this time for irq value).
And so on and so on.
You see that you can get fast into very complex expressions. That approach is very fragile!
P.S. Depending of the exact details of the regex of loggy, details may change. Above examples are based on Perl.
I'm trying to track daily stats for an individual.
I'm having a hard time adding a new day inside "history" and can also use a pointer on updating "walkingSteps" as new data comes in.
My schema looks like:
{
"_id": {
"$oid": "50db246ce4b0fe4923f08e48"
},
"history": [
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e13"
},
"date": {
"$date": "2012-12-25T15:26:15.321Z"
},
"walkingSteps": 5,
"goalStatus": 0
},
{
"_id": {
"$oid": "50db2316e4b0fe4923f08e14"
},
"date": {
"$date": "2012-12-26T15:26:15.321Z"
},
"walkingSteps": 8,
"goalStatus": 0
}
]
}
db.history.update( ? )
I've been browsing (and attempting) the mongodb documentation but they don't quite break it all the way down to dummies like myself... I couldn't quite translate their examples to my setup.
Thanks for any help.
E = noob trying to learn programming
Adding a day:
user = {_id: ObjectId("50db246ce4b0fe4923f08e48")}
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
db.users.update(user, {$addToSet: {history:day}})
Updating walkingSteps:
user = ObjectId("50db246ce4b0fe4923f08e48")
day = ObjectId("50db2316e4b0fe4923f08e13") // second day in your example
query = {_id: user, 'history._id': day}
db.users.update(query, {$set: {"history.$.walkingSteps": 6}})
This uses the $ positional operator.
It might be easier to have a separate history collection though.
[Edit] On the separate collections:
Adding days grows the document in size and it might need to be relocated on the disk. This can lead to performance issues and fragmentation.
Deleting days won't shrink the document size on disk.
It makes querying easier/straightforward (e.g. searching for a period of time)
Even though #Justin Case puts the right answer he doesn't explain a few things in it extremely well.
You will notice first of all that he gets rid of the resolution on dates and moves their format to merely the date instead of date and time like so:
day = {_id: ObjectId(), date: ISODate("2013-01-07"), walkingSteps:0, goalStatus: 0}
This means that all your dates will have 00:00:00 for their time instead of the exact time you are using atm. This increases the ease of querying per day so you can do something like:
db.col.update(
{"_id": ObjectId("50db246ce4b0fe4923f08e48"),
"history.date": ISODate("2013-01-07")},
{$inc: {"history.$.walkingSteps":0}}
)
and other similar queries.
This also makes $addToSet actually enforce its rules, however since the data in this sub document could change, i.e. walkingSteps will increment $addToSet will not work well here anyway.
This is something I would change from the ticked answer. I would probably use $push or something else instead since $addToSet is heavier and won't really do anything useful here.
The reason for a separate history collection in my view would be what you said earlier with:
Yes, the amount of history items for that day.
So this array contains a set of days, which is fine but it sounds like the figure that you wish to get walkingSteps from, a set of history items, should be in another collection and you set walkingSteps according to the count of the amount of items in that other collection for today:
db.history_items.find({date: ISODate("2013-01-07")}).count();
Referring to MongoDB Manual, $ is the positional operator which identifies an element in an array field to update without explicitly specifying the position of the element in the array. The positional $ operator, when used with the update() method and acts as a placeholder for the first match of the update query selector.
So, if you issue a command to update your collection like this:
db.history.update(
{someCriterion: someValue },
{ $push: { "history":
{"_id": {
"$oid": "50db2316e4b0fe4923f08e12"
},
"date": {
"$date": "2012-12-24T15:26:15.321Z"
},
"walkingSteps": 10,
"goalStatus": 1
}
}
)
Mongodb might try to identify $oid and $date as some positional parameters. $ also is part of the atomic operators like $set and $push. So, it is better to avoid use this special character in Mongodb.