Alter log file date with the command sed? - json

i have the following line multiple times in a log file with other data.
And i like to analyze this data by importing the json part to a mongodb first and the run selected queries over it.
DEBUG 2015-04-18 23:13:23,374 [TEXT] (Class.java:19) - {"a":"1", "b":"2", ...}
To alter the data just to get the json part i use:
cat mylog.log | sed "s/DEBUG.*19) - //g" > mylog.json
The main problem here is, that is like to add the date and time part as well and as an additional json value to get something like this:
{"date": "2015-04-18", "time":"23:13:26,374", "a":"1", "b":"2", ...}
Here is the main question. How can i do this by using the linux console and the comman sed? Or by an alternative console command?
thx in advance

Since this appears to be a very rigid format, you could probably use sed like so:
sed 's/DEBUG \([^ ]*\) \([^ ]*\).*19) - {/{ "date": "\1", "time": "\2", /' mylog.log
Where [^ ]* matches a sequence of non-space characters and \(regex\) is a capturing group that makes a matched string available for use in the replacement as \1, \2, and so forth depending on its position. You can see these used in the replacement part.
If it were me, though, I'd use Perl for its ability to split a line into fields and match non-greedily:
perl -ape 's/.*?{/{ "date": "$F[1]", "time": "$F[2]", /' mylog.log
The latter replaces everything up to the first { (because .*? matches non-greedily) and replaces it with the string you want. $F[1] and $F[2] are the second and third whitespace-delimited field in the line; -a makes Perl split the line into the #F array this way.

Related

How can I extract specific lines from a json document?

I have a json file with thousands of lines and 90 json objects.
Each object come with the following structure:
{
"country_codes": [
"GB"
],
"institution_id": "ins_118309",
"name": "Barclaycard (UK) - Online Banking: Personal", // I want to extract this line only
"oauth": true,
"products": [
"assets",
"auth",
"balance",
"transactions",
"identity",
"standing_orders"
],
"routing_numbers": []
},
For the ninety objects, I would like to delete all the lines and keep only the one with the name of the institution.
I guess that I will have to use a regex here?
I'm happy to use with vim, sublime, vscode or any other code editor that will alow me to do so
How can I extract these lines so I will stay with the following 90 lines?
"name": "Barclaycard (UK) - Online Banking: Personal",
"name": "Metro Bank - Commercial and Business Online Plus",
...
...
"name": "HSBC (UK) - Business",
If you must use a code editor, then in Vim you can delete all lines not
matching a pattern with: :v/^\s*"name":/d
The above pattern says:
^ line begins with
\s* zero or more white spaces
"name:" (pretty explanatory)
Although it's better to use a dedicated tool for parsing json files
rather than regex as json is not a 'regular
language'.
Bonus
If you do end up doing it in Vim, you can finish up by left align all the lines, do :%left or even just :%le.
That doesn't sound like the job for a text editor or even for regular expressions. How about using the right tool for the job™?
# print only the desired fields to stdout
$ jq '.[] | .name' < in.json
# write only the desired fields to file
$ jq '.[] | .name' < in.json > out.json
See https://stedolan.github.io/jq/.
If you really want to do it from a text editor, the simplest is still to filter the current buffer through a specialized external tool. In Vim, it would look like this:
:%!jq '.[] | .name'
See :help filter.
FWIW, here it is with another right tool for the job™:
:%!jj \\#.name -l
See https://github.com/tidwall/jj.
you can use grep eventually :
grep '^\s*"name":' your_file.json
In VSC
select institution_id
execute Selection > Select All Occurenses
Arrow Left
Ctrl+X
Esc
Ctrl+V
In vscode (although I would think it is the same for any regex-handling editor), use this Find:
^(?!\s*"name":.*).*\n?|^\s*
and replace with nothing. See regex101 demo.
^(?!\s*"name":.*).*\n? : get all lines that are not followed by "name":... including the newline so that line is completely discarded.
^\s* gets the whitespace before "name":...... - also discarded since we are replacing all matches with nothing.
Parsing JSON in Vim natively:
call getline(1, '$')
\ ->join("\n")
\ ->json_decode()
\ ->map({_, v -> printf('"name": "%s",', v.name)})
\ ->append('$')
NB. Line continuation is only available when sourcing script from file. If run interactively then type command on a single line.

Break JSON in pager "less"

I use the pager called less since 20 years.
Time changes and I often look at files containing json.
A json dict which is on one line is not easy to read for me.
Is there a way to break the json into key-value pairs if you look at the log file?
Example:
How to display a line in a log file which looks like this:
{"timestamp": "2019-05-13 14:40:51", "name": "foo.views.error", "log_intent": "line1\nline2" ...}
roughly like this:
"timestamp": "2019-05-13 14:40:51"
"name": "foo.views.error"
"log_intent": "line1
line2"
....
I am not married with the pager less if there is better tool, please leave a comment.
In your case, the log file seems to consist of one json document per line, you can use jq to preformat the logfile before piping to less:
jq -s . file.log | less
With colors:
jq -Cs . file.log | less -r

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Finding a string between two strings in a file

This is a bit of a .json file I need to find information in:
"title":
"Spring bank holiday","date":"2012-06-04","notes":"Substitute day","bunting":true},
{"title":"Queen\u2019s Diamond Jubilee","date":"2012-06-05","notes":"Extra bank holiday","bunting":true},
{"title":"Summer bank holiday","date":"2012-08-27","notes":"","bunting":true},
{"title":"Christmas Day","date":"2012-12-25","notes":"","bunting":true},
{"title":"Boxing Day","date":"2012-12-26","notes":"","bunting":true},
{"title":"New Year\u2019s Day","date":"2013-01-01","notes":"","bunting":true},
{"title":"Good Friday","date":"2013-03-29","notes":"","bunting":false},
{"title":"
The file is much longer, but it is one long line of text.
I would like to display what bank holiday it is after a certain date, and also if it involves bunting.
I've tried grep and sed but I can't figure it out.
I'd like something like this:
[command] between [date] and [}] display [title] and [bunting]/[no bunting]
[title] should be just "Christmas Day" or something else
Forgot to mention:
I would like to achieve this in bash shell, either from the prompt or from a short bit of code.
You should use a proper JSON parser in a decent programming language, then you can do a lot of work in a safe way without too much code. How about this little Python code:
#!/usr/bin/env python
import json
with open('my.json') as jsonFile:
holidays = json.load(jsonFile)
for holiday in holidays:
if holiday['date'] > '2012-05-06':
print holiday['date'], ':', holiday['title'], \
("bunting" if holiday['bunting'] else "no bunting")
break # in case you only want one line of output
I could not figure out what exactly the output should be; if you can be more specific, I can adjust my example.
You can try this with awk:
awk -F"}," '{for(i=1;i<=NF;i++){print $i}}' file.json | awk -F"\"[:,]\"?" '$4>"2013-01-01"{printf "%s:%s:%s\n" ,$2,$4,$8}'
Seeing that the json file is one long string we first split this line into multiple json records on },. Then each individual record is split on a combination of ":, characters with an optional closing ". We then only output the line if its after a certain date.
This will find all records after Jan 1 2013.
EDIT:
The 2nd awk splits each individual json record into key-value pairs using a sub-string starting with ", followed by either a : or ,, and an optional ending ".
So in your example it will split on either ",", ":" or ":.
All odd fields are keys, and all even fields are values (hence $4 being the date in your example). We then check if $4(date) is after 2013-01-01.
I noticed i made a mistake on the optional " (should be followed by ? instead of *) in the split which i have now corrected and i also used printf function to display the values.

Extract dates from a specific json format with sed

I have a json file including the sample lines of code below:
[{"tarih":"20130824","tarihView":"24-08-2013"},{"tarih":"20130817","tarihView":"17-08-2013"},{"tarih":"20130810","tarihView":"10-08-2013"},{"tarih":"20130803","tarihView":"03-08-2013"},{"tarih":"20130727","tarihView":"27-07-2013"},{"tarih":"20130720","tarihView":"20-07-2013"},{"tarih":"20130713","tarihView":"13-07-2013"},{"tarih":"20130706","tarihView":"06-07-2013"}]
I need to extract all the dates in the yy/mm/dd format into a text format with proper line endings:
20130824
20130817
20130810
20130803
...
20130706
How can I do this by using sed or similar console utility?
Many thanks for your help.
this line works for your example:
grep -Po '\d{8}' file
or with BRE:
grep -o '[0-9]\{8\}' file
it outputs:
20130824
20130817
20130810
20130803
20130727
20130720
20130713
20130706
if you want to extract the string after "tarih":", you could :
grep -Po '"tarih":"\K\d{8}' file
it gives same output.
Note that regex won't do date string validation.
This is VERY easy in python:
#!/bin/bash
python -c "vals=$(cat jsonfile)
for curVal in vals: print curVal['tarih']"
If I paste your example to jsonfile I get this output
20130824
20130817
20130810
20130803
20130727
20130720
20130713
20130706
Which is exactly what you need, right?
This works because in python [] is a list and {} is a dictionary, so it is very easy to get any data from that structure. This method is very safe as well, because it wont fail if some field in your data contains { , " or any other character that sed will probably look for. Also it does not depend on the field position or the number of fields.