How to group grep results by seconds - language-agnostic

I have the following in a log file,
01:31:01,222 Received event
01:31:01,435 Received event
01:31:01,441 Received event
01:31:01,587 Received event
01:31:02,110 Received event
01:31:02,650 Received event
01:31:02,869 Received event
01:31:03,034 Received event
01:31:03,222 Received event
I would like to group this by seconds and count the number of lines in each group to output the following,
01:31:01 4
01:31:02 3
01:31:03 2
Ideally I like to do this in a simple awk script without having to resort to perl or python, any ideas? Thanks.

Sounds like a job for awk:
awk -F, '{a[$1]++}END{for(i in a){print i, a[i]}}' file.txt
Output:
01:31:01 4
01:31:02 3
01:31:03 2
Explanation:
I'm using the option -F (field separator) and set it to ,. This makes it easy to obtain the time with seconds accuracy in field 1 ($1).
Explanation of the script itself (in a multiline form):
# Runs on every line and increments a count tied to the first field (the time)
# (The associative array a will get created on first access)
{a[$1]++}
# Runs after all lines have been processed. Iterates trough the array 'a' and prints
# each key (time) and its associated value (count)
END {
for(i in a){
print i, a[i]
}
}

If you don't care about the output order, you can just do:
cut -d, -f1 file|uniq -c
(with a |sort before |uniq if the data isn't always sorted initially).
Produces:
4 01:31:01
3 01:31:02
2 01:31:03

Related

Trying to get all lines from csv file where column 8 equal a value using awk, but it prints all lines and matching lines twice

I have a CSV file and I want to get a subset of the lines where column 8 is equal to a specific value. I'm doing this:
awk 'FS=","; $8 == 0' infile.csv
The output of this command is that it prints every line in the file at least once, but the lines where the 8th column is 0, it will print twice. Why is it doing this? How do I get it to just print the matching lines once?
You have two blocks FS="," and $8 == 0. The implied action when the condition is satisfied is {print}. The first assignment returned value is non false, therefore it prints the record. Whereas the second condition is only true when it's satisfied, that's why you see that record printed twice.
If you don't want the assignment to be used as a condition wrap with curly braces.
$ awk '{FS=","} $8==0{print}'
However, setting FS again and again for each record is unnecessary
$ awk 'BEGIN{FS=","} $8==0'
will do the same. However the easiest will be using the -F
$ awk -F, '$8==0'

Extract numeric values from awk return

For supervision system, I need to return 2 values about latency to my supervisor server thru nrpe.
Here the values that I'm working on (I put this in a file : test.txt) :
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"project_site":"AUB"},"value":[1575277537.052,"0.3889104875437488"]},{"metric":{"project_site":"VDR"},"value":[1575277537.052,"0.2267407994117705"]}]}}
I need to extract 0.3889104875437488 and 0.2267407994117705
I'm using this :
for i in $(""cat test.txt | awk -F ',' '{print $5 $NF}' | grep -o '[0.0001-9999.9]\+'""); do echo $i; done
I'm not sure that's the best method, especially since I have to add this : "AUB" for row 1 and "VDR" for row 2 before each line. Like :
AUB : 0.3889104875437488 seconds
VDR : 0.2267407994117705 seconds
Use jq for parsing JSON, for example:
$ jq -r '.data.result[] | "\(.metric.project_site) : \(.value[1]) seconds"' file
AUB : 0.3889104875437488 seconds
VDR : 0.2267407994117705 seconds
I have upvoted the answer by #oguzismail, and will repeat their suggestion to use jq instead if at all feasible.
If your input is not valid JSON, there are several things wrong with your approach, several of them related more to efficiency and common practice than outright erroneous.
Your regex is wrong. See below.
Avoid the useless cat.
If you are using Awk anyway, you don't need grep. See useless use of grep.
Quote your variable.
Only in this case, you want to remove the useless echo entirely. Capturing standard output so that you can echo it to standard output is simply a waste of processes (unless you specifically wanted to break the quoting, as a special case of the previous item; but that is not the case here).
It is unclear what you hope for the empty string "" to accomplish. After the shell is done with quote removal, ""cat is simply cat.
In some more detail, [0.0001-9999.9] matches a single character which is 0 or . or 0 (oh we mentioned that already, didn't we?) or 0 (ditto) or between 1 and 9 or 9 (etc etc). In short, grep is not at all the right tool for searching for number ranges; fortunately, Awk can do that easily too.
Here, then, is an attempt to refactor to remove these problems.
awk -F ',' '{ split("5:" NF, a, ":"); split("AUB:VDR", l, ":")
for (i=1; i<=2; i++) {
n = $a[i]; gsub(/[]}"]+/, "", n);
if (n >= 0.0001 && n <= 9999.9)
print l[i] ": " n " seconds"} }' test.txt
This is extremely brittle because it hard-codes the locations of the strings within the surface structure of the (not?) JSON data, which could change without warning.
The split is a hack to get the numbers 5 and NF into an array a. We create a second array with the same length for the corresponding labels. We then loop over the first array and use the numbers as indices into the current record's fields. We trim off any quoting and brackets, and then perform the numeric comparison on the thus extracted field. At the end, we add the corresponding label from the other array in front of the printed text.

can't understand this bash script $ awk -F

Can someone help me understand what these scripts mean?
$ awk -F',' '{ data[$4]+=$29;}END{c=0; for (i in data) { if (data[i]+0<1000000) {c++}} ;print c}' file.csv
thanks
This scripts iterates line by line over the input file file.csv. The file is apparently separated by comma (,), thus the field separator is set via -F',' appropriately. Then you access the data array with the content of the 4th field ($4) and add the value of the 29th field ($29). After processing all lines, at the end, END code section is invoked. It iterates over your data array, does some numerical comparison and eventually prints the number of times c the numerical comparison evaluated to true.

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

Can awk deal with CSV file that contains comma inside a quoted field?

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
I was using this awk script to count the sum:
awk -F, '{sum+=$3} END {print sum}'
Some of the value in name field contains comma and this break my awk script.
My question is: can awk solve this problem? If yes, and how can I do that?
Thank you.
One way using GNU awk and FPAT
awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt
Result:
192
I am using
`FPAT="([^,]+)|(\"[^\"]+\")" `
to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field.
I changed it to:
`FPAT="([^,]*)|(\"[^\"]*\")"`
and replace "+" with "*". It works correctly.
I also find that GNU Awk User Guide also has this problem.
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.
You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.
Here is the command:
csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'
see https://github.com/dbro/csvquote for the code
For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:
$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }
{
decsv()
for (i=1;i<=NF;i++) {
printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
}
print ""
}
function decsv( curr,head,tail)
{
tail = $0
while ( match(tail,/"[^"]+"/) ) {
head = substr(tail, 1, RSTART-1);
gsub(fs,RS,head)
curr = curr head substr(tail, RSTART, RLENGTH)
tail = substr(tail, RSTART + RLENGTH)
}
gsub(fs,RS,tail)
$0 = curr tail
}
$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>
Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>
Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>
Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>
It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...
See What's the most robust way to efficiently parse CSV using awk? for more information.
You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.
Update(as Dennis requested). A simple example
$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'
$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99
$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question
As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.
This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.
If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!
grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile
The solution that empties the fields with commas within them but also maintains the record, of course is:
awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}
Megs of thanks to ghostdog74 for the great solution!
NetsGuy256/
FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:
awk -F"," '{sum+=$NF} END {print sum}'
To access the second to last column, you would use this:
awk -F"," '{sum+=$(NF-1)} END {print sum}'
If you know for sure that the 'value' column is always the last column:
awk -F, '{sum+=$NF} END {print sum}'
NF represents the number of fields, so $NF is the last column
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){#f=$csv->fields();$sum+=$f[2]} END{print $sum}' file
allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk
you could try piping the file through a perl regex to convert the quoted , into something else like a |.
cat test.csv | perl -p -e "s/(\".+?)(,)(.+?\")/\1\|\3/g" | awk -F, '{...
The above regex assumes there is always a comma within the double quotes. so more work would be needed to make the comma optional
you write a function in awk like below:
$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1
you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.

Categories