Miller - Ignore valid field names when using -N - csv

I'm using miller to process some CSV files like so:
mlr --mmap --csv --skip-comments -N cut -f 2 my.csv
It works well, but some of the CSV files contain field names and some do not, which is why I'm using -N. In the files that have field names, they get printed in the output. You would think that having the headerless-csv-output bundled in the N flag they wouldn't, but they are. Maybe it's a bug? Anyway, how would do I prevent the field names from being printed? If the input needs to be altered somehow and piped in that's fine, but the output is being uniquely processed.
Here's the documentation I've been referencing:
https://manpages.ubuntu.com/manpages/focal/man1/mlr.1.html#options
https://miller.readthedocs.io/en/latest/reference.html
my.csv
################################################################
# #
# #
# BIG OL' COMMENT BLOCK #
# #
# #
################################################################
#
"first_seen_utc","dst_ip","dst_port","c2_status","last_online"
"2021-01-17 07:30:05","67.213.75.205","443","online","2021-06-24"
"2021-01-17 07:44:46","192.73.238.101","443","online","2021-06-24"
Expected output
67.213.75.205
192.73.238.101
Present output
dst_ip
67.213.75.205
192.73.238.101

If your first field is always a date, you can use it
mlr --csv --skip-comments -N filter -S '$1=~"^[0-9]{4}-"' then cut -f 2 input.txt

if you use N for a CSV that has a header, you will add an automatic numeric header and the the original header will be a data row. Using N you will have also --implicit-csv-header
+---------------------+----------------+----------+-----------+-------------+
| 1 | 2 | 3 | 4 | 5 |
+---------------------+----------------+----------+-----------+-------------+
| first_seen_utc | dst_ip | dst_port | c2_status | last_online |
| 2021-01-17 07:30:05 | 67.213.75.205 | 443 | online | 2021-06-24 |
| 2021-01-17 07:44:46 | 192.73.238.101 | 443 | online | 2021-06-24 |
+---------------------+----------------+----------+-----------+-------------+
If you want an headerless output you must use only it. If you run
mlr --csv --skip-comments --headerless-csv-output cut -f dst_ip input.txt
you will have
67.213.75.205
192.73.238.101

Related

shell: treatment of the multi-line format according to its column patterns

Dealing with multi-line CSV file, I am looking for a possible Bash shell workflow that could be useful for its treatment. Here is format of the file containing data in multi-column format:
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1000.dlg: 6 | -4.86 | 2 | -4.79 | 4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1001.dlg: 2 | -5.25 | 10 | -5.22 | 8 |########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1002.dlg: 5 | -5.76 | 6 | -5.48 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1003.dlg: 4 | -3.88 | 17 | -3.50 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1009.dlg: 5 | -4.51 | 5 | -4.39 | 4 |####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_100.dlg: 3 | -4.40 | 11 | -4.38 | 9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_1010.dlg: 1 | -5.07 | 15 | -4.51 | 5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_150.dlg: 4 | -5.01 | 5 | -4.82 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_156.dlg: 2 | -5.38 | 11 | -4.70 | 3 |###
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_157.dlg: 1 | -4.22 | 10 | -4.16 | 7 |#######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_167.dlg: 2 | -3.85 | 3 | -3.69 | 9 |#########
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_168.dlg: 2 | -4.42 | 12 | -4.01 | 6 |######
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_169.dlg: 2 | -4.94 | 17 | -4.80 | 5 |#####
/scratch_p/johnycash/results_test_docking/7000/7000_01_lig_cne_16.dlg: 1 | -6.23 | 4 | -5.77 | 4 |###
According to the format: all the columns with valuable information are divided by | with the exception of the first column (name of the line), divided by : from the rest. The script should operate with following post-processing:
Descending sorting of all lines according to the value from the third column (from mostly negative to positive values);
Set up some filter to the last column (according to the number of #), discarding all of the lines containing #, ## or ###. Alternatively this filter can be applied on the penultimate column, which expresses the number of #characters as a number.
While I can do the first task using sort
sort -t '|' -k 3 filename.csv
and the second may be achieved using AWK
awk '(NR>1) && ($8 > 2) ' filename.csv > filename_processed.txt
how could I combine the both commands in efficient fashion taking into account the format of my file?
Could you please try following, written and tested in shown samples in GNU awk.
awk '
BEGIN{
FS=OFS="|"
}
gsub(/#/,"&",$6)>4
' Input_file | sort -t'|' -nk 3 > output_file
EDIT: As per OP's comment to get last 10% lines from starting of Input_file you could following, take above command's output into a output file and could run following.
awk -v lines="$(wc -l < output_file)" '
BEGIN{
tenPer=int(lines/10)
}
FNR>(tenPer){exit}
1
' output_file
For getting 10% last lines of output_file try:
tac output_file |
awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>tenPer{exit} 1' |
tac
OR
awk -v lines="$(wc -l < output_file)" 'BEGIN{tenPer=int(lines/10)} FNR>=(lines-tenPer)' output_file
You can try:
sort -nr -k 4 scratch.scv | grep -v -E "[^#]#{1,3}$"
Sort base on column value and eject the line with 1-3 number of #.
It is better to sort at the end with fewer lines.
grep -E "#{4}$" file | sort -t"|" -nk3
If you need to filter for different number of # modify the number in the expression of grep. If you need reversed sorting add the r parameter to the sort command. If you need sorting per different column, modify the k argument.
If your commands are really all you need, trivially
awk '(NR>1) && ($8 > 2) ' filename.csv |
sort -t '|' -k 3 filename.csv > filename_processed.txt

CSV output file using command line for wireshark IO graph statistics

I save the IO graph statistics as CSV file containing the bits per second using the wireshark GUI. Is there a way to generate this CSV file with command line tshark? I can generate the statistics on command line as bytes per second as follows
tshark -nr test.pcap -q -z io,stat,1,BYTES
How do I generate bits/second and save it to a CSV file?
Any help is appreciated.
I don't know a way to do that using only tshark, but you can easily parse the output from tshark into a CSV file:
tshark -nr tmp.pcap -q -z io,stat,1,BYTES | grep -P "\d+\s+<>\s+\d+\s*\|\s+\d+" | awk -F '[ |]+' '{print $2","($5*8)}'
Explanations
grep -P "\d+\s+<>\s+\d+\s*\|\s+\d+" selects only the raw from the tshark output with the actual data (i.e., second <> second | transmitted bytes).
awk -F '[ |]+' '{print $2","($5*8)}' splits that data into 5 blocks with [ |]+ as the separator and display blocks 2 (the second at which starts the interval) and 5 (the transmitted bytes) with a comma between them.
Another thing that may be good to know:
If you change the interval from 1 second to 0.5 seconds, then you have to allow . in the grep part by adding \. between two digits \d .
Otherwise the result will be an empty *.csv file.
grep -P "\d{1,2}\.{1}\d{1,2}\s+<>\s+\d{1,2}\.{1}\d{1,2}\s*\|\s+\d+"
The answers in this thread gave me the keys to solving a similar problem with tshark io stats and I wanted to share the results and how it works. In my case, the task was to convert multiple columns of tshark io stat records with potential decimals in the data. This answer converts multiple data columns to csv, adds rudimentary headers, accounts for decimals in fields and variable numbers of spaces.
Complete command string
tshark -r capture.pcapng -q -z io,stat,30,,FRAMES,BYTES,"FRAMES()ip.src == 10.10.10.10","BYTES()ip.src == 10.10.10.10","FRAMES()ip.dst == 10.10.10.10","BYTES()ip.dst == 10.10.10.10" \
| grep -P "\d+\.?\d*\s+<>\s+|Interval +\|" \
| tr -d " " | tr "|" "," | sed -E 's/<>/,/; s/(^,|,$)//g; s/Interval/Start,Stop/g' > somefile.csv
Explanation
The command string has 3 major parts.
tshark creates the report with the data in columns
Extract the desired lines with grep
Use tr and sed to convert the records grep matched into a csv delimited file.
Part 1: tshark creates the report with the data in columns
tshark is run with -z io,stat at a 30 second interval, counting frames and bytes with various filters.
tshark -r capture.pcapng -q -z io,stat,30,,FRAMES,BYTES,"FRAMES()ip.src == 10.10.10.10","BYTES()ip.src == 10.10.10.10","FRAMES()ip.dst == 10.10.10.10","BYTES()ip.dst == 10.10.10.10"
Here is the output when run against my test pcap file:
=================================================================================================
| IO Statistics |
| |
| Duration: 179.179180 secs |
| Interval: 30 secs |
| |
| Col 1: Frames and bytes |
| 2: FRAMES |
| 3: BYTES |
| 4: FRAMES()ip.src == 10.10.10.10 |
| 5: BYTES()ip.src == 10.10.10.10 |
| 6: FRAMES()ip.dst == 10.10.10.10 |
| 7: BYTES()ip.dst == 10.10.10.10 |
|-----------------------------------------------------------------------------------------------|
| |1 |2 |3 |4 |5 |6 |7 |
| Interval | Frames | Bytes | FRAMES | BYTES | FRAMES | BYTES | FRAMES | BYTES |
|-----------------------------------------------------------------------------------------------|
| 0 <> 30 | 107813 | 120111352 | 107813 | 120111352 | 26682 | 15294257 | 80994 | 104808983 |
| 30 <> 60 | 122437 | 124508575 | 122437 | 124508575 | 49331 | 17080888 | 73017 | 107422509 |
| 60 <> 90 | 138999 | 135488315 | 138999 | 135488315 | 54829 | 22130920 | 84029 | 113348686 |
| 90 <> 120 | 158241 | 217781653 | 158241 | 217781653 | 42103 | 15870237 | 115971 | 201901201 |
| 120 <> 150 | 111708 | 131890800 | 111708 | 131890800 | 43709 | 18800647 | 67871 | 113082296 |
| 150 <> Dur | 123736 | 142639416 | 123736 | 142639416 | 50754 | 22053280 | 72786 | 120574520 |
=================================================================================================
Considerations
Looking at this output, we can see several items to consider:
Rows with data have a unique sequence in the Interval column of "space<>space", which we will can use for matching.
We want the header line, so we will use the word "Interval" followed by spaces and then a "|" character.
The number of spaces in a column are variable depending on the number of digits per measurement.
The Interval column gives both the time from 0 and the from the first measurement. Either can be used, so we will keep both and let the user decide.
When using milliseconds there will be decimals in the Interval field
Depending on the statistic requested, there may be decimals in the data columns
The use of "|" as delimiters will require escaping in any regex statement that covers them.
Part 2: Extract the desired lines with grep
Once tshark produces output, we use grep with regex to extract the lines we want to save.
grep -P "\d+\.?\d*\s+<>\s+|Interval +\|""
grep will use the "Digit(s)Space(s)<>Space(s)" character sequence in the Interval column to match the lines with data. It also uses an OR to grab the header by matching the characters "Interval |".
grep -P # The "-P" flag turns on PCRE regex matching, which is not the same as egrep. With egrep, you will need to change the escaping.
"\d+ # Match on 1 or more Digits. This is the 1st set of numbers in the Interval column.
\.? # 0 or 1 Periods. We need this to handle possible fractional seconds.
\d* # 0 or more Digits. To handle possible fractional seconds.
\s+<>\s+ # 1 or more Spaces followed by the Characters "<>", then 1 or more Spaces.
| # Since this is not escaped, it is a regex OR
Interval\s+\|" # Match the String "Interval" followed by 1 or more Spaces and a literal "|".
From the tshark output, grep matched these lines:
| Interval | Frames | Bytes | FRAMES | BYTES | FRAMES | BYTES | FRAMES | BYTES |
| 0 <> 30 | 107813 | 120111352 | 107813 | 120111352 | 26682 | 15294257 | 80994 | 104808983 |
| 30 <> 60 | 122437 | 124508575 | 122437 | 124508575 | 49331 | 17080888 | 73017 | 107422509 |
| 60 <> 90 | 138999 | 135488315 | 138999 | 135488315 | 54829 | 22130920 | 84029 | 113348686 |
| 90 <> 120 | 158241 | 217781653 | 158241 | 217781653 | 42103 | 15870237 | 115971 | 201901201 |
| 120 <> 150 | 111708 | 131890800 | 111708 | 131890800 | 43709 | 18800647 | 67871 | 113082296 |
| 150 <> Dur | 123736 | 142639416 | 123736 | 142639416 | 50754 | 22053280 | 72786 | 120574520 |
Part 3: Use tr and sed to convert the records grep matched into a csv delimited file.
tr and sed are used for converting the lines grep matched into csv. tr does the bulk work of removing spaces and changing the "|" to ",". This is simpler and faster then using sed. However, sed is used for some cleanup work
tr -d " " | tr "|" "," | sed -E 's/<>/,/; s/(^,|,$)//g; s/Interval/Start,Stop/g'
Here is how these commands perform the conversion. The first trick is to get rid of all of the spaces. This means we dont have to account for them in any regex sequences, making the rest of the work simpler
| tr -d " " # Spaces are in the way, so delete them.
| tr "|" "," # Change all "|" Characters to ",".
| sed -E 's/<>/,/; # Change "<>" to "," splitting the Interval column.
s/(^,|,$)//g; # Delete leading and/or trailing "," on each line.
s/Interval/Start,Stop/g' # Each of the "Interval" columns needs a header, so change the text "Interval" into two words with a , separating them.
> somefile.csv # Pipe the output into somefile.csv
Final result
Once through this process, we have a csv output that can now be imported into your favorite csv tool, spreadsheet, or fed to a graphing program like gnuplot.
$cat somefile.csv
Start,Stop,Frames,Bytes,FRAMES,BYTES,FRAMES,BYTES,FRAMES,BYTES
0,30,107813,120111352,107813,120111352,26682,15294257,80994,104808983
30,60,122437,124508575,122437,124508575,49331,17080888,73017,107422509
60,90,138999,135488315,138999,135488315,54829,22130920,84029,113348686
90,120,158241,217781653,158241,217781653,42103,15870237,115971,201901201
120,150,111708,131890800,111708,131890800,43709,18800647,67871,113082296
150,Dur,123736,142639416,123736,142639416,50754,22053280,72786,120574520

jq print character inside output

I want print "/" separator inside output title.
curl -s http://cd0a4a.ethosdistro.com/?json=yes \
| jq -c '.rigs|."0d6b27",."50dc35"|[.version,.driver,.miner,"\(.gpus)\(.miner_instance)"]|#csv' \
| sed 's/\\//g;s/\"//g' \
| gawk 'BEGIN{print "version" "," "GPU_driver" "," "miner" "," "gpu"} {print $0}' \
| csvlook -I
The output is like this :
| version | GPU_driver | miner | gpu |
| ------- | ---------- | -------- | --- |
| 1.2.3 | nvidia | ethminer | 22 |
| 1.2.4 | amdgpu | ethminer | 11 |
But I want separator in between the numbers inside gpu title like this :
| version | GPU_driver | miner | gpu |
| ------- | ---------- | -------- | ---- |
| 1.2.3 | nvidia | ethminer | 2/2 |
| 1.2.4 | amdgpu | ethminer | 1/1 |
You're doing a lot of unnecessary calls just to process the data. Your commands could be drastically simplified.
You don't need to explicitly key into the .rigs object to get their values, you could just access them using [].
You don't need the sed call to strip the quotes, just use the raw output -r.
You don't need the awk call to add the header, you could just output an additional row from jq.
So your command turns into this instead:
$ curl -s http://cd0a4a.ethosdistro.com/?json=yes \
| jq -r '["version", "GPU_driver", "miner", "gpu"],
(.rigs[] | [.version, .driver, .miner, "\(.gpus)/\(.miner_instance)"])
| #csv' \
| csvlook -I
Since you already use string interpolation for that specific field, simply include the character you need (slash /) inside the string, like this:
curl ... | jq -c '... [.version,.driver,.miner,"\(.gpus)/\(.miner_instance)"] ...'
In your case (the complete line):
curl -s http://cd0a4a.ethosdistro.com/?json=yes | jq -c '.rigs|."0d6b27",."50dc35"|[.version,.driver,.miner,"\(.gpus)/\(.miner_instance)"]|#csv' | sed 's/\\//g;s/\"//g' | gawk 'BEGIN{print "version" "," "GPU_driver" "," "miner" "," "gpu"} {print $0}' | csvlook -I
Here are some suggestions for simplification:
use the --raw-output option to jq to remove extraneous back-slashes
there is no need to remove the quotes, csvlook does it for you
no need for awk to add a title line, use a sub-shell
no need to specify rigs implicitly, use .[]
Here is an example:
(
echo version,GPU_driver,miner,gpu
curl -s 'http://cd0a4a.ethosdistro.com/?json=yes' |
jq -r '
.rigs | .[] |
[ .version, .driver , .miner , "\(.gpus)/\(.miner_instance)" ] |
#csv
'
) |
csvlook
Output:
|----------+------------+----------+------|
| version | GPU_driver | miner | gpu |
|----------+------------+----------+------|
| 1.2.3 | nvidia | ethminer | 2/2 |
| 1.2.4 | amdgpu | ethminer | 1/1 |
|----------+------------+----------+------|

jq name of title as new column on table [duplicate]

I want print "/" separator inside output title.
curl -s http://cd0a4a.ethosdistro.com/?json=yes \
| jq -c '.rigs|."0d6b27",."50dc35"|[.version,.driver,.miner,"\(.gpus)\(.miner_instance)"]|#csv' \
| sed 's/\\//g;s/\"//g' \
| gawk 'BEGIN{print "version" "," "GPU_driver" "," "miner" "," "gpu"} {print $0}' \
| csvlook -I
The output is like this :
| version | GPU_driver | miner | gpu |
| ------- | ---------- | -------- | --- |
| 1.2.3 | nvidia | ethminer | 22 |
| 1.2.4 | amdgpu | ethminer | 11 |
But I want separator in between the numbers inside gpu title like this :
| version | GPU_driver | miner | gpu |
| ------- | ---------- | -------- | ---- |
| 1.2.3 | nvidia | ethminer | 2/2 |
| 1.2.4 | amdgpu | ethminer | 1/1 |
You're doing a lot of unnecessary calls just to process the data. Your commands could be drastically simplified.
You don't need to explicitly key into the .rigs object to get their values, you could just access them using [].
You don't need the sed call to strip the quotes, just use the raw output -r.
You don't need the awk call to add the header, you could just output an additional row from jq.
So your command turns into this instead:
$ curl -s http://cd0a4a.ethosdistro.com/?json=yes \
| jq -r '["version", "GPU_driver", "miner", "gpu"],
(.rigs[] | [.version, .driver, .miner, "\(.gpus)/\(.miner_instance)"])
| #csv' \
| csvlook -I
Since you already use string interpolation for that specific field, simply include the character you need (slash /) inside the string, like this:
curl ... | jq -c '... [.version,.driver,.miner,"\(.gpus)/\(.miner_instance)"] ...'
In your case (the complete line):
curl -s http://cd0a4a.ethosdistro.com/?json=yes | jq -c '.rigs|."0d6b27",."50dc35"|[.version,.driver,.miner,"\(.gpus)/\(.miner_instance)"]|#csv' | sed 's/\\//g;s/\"//g' | gawk 'BEGIN{print "version" "," "GPU_driver" "," "miner" "," "gpu"} {print $0}' | csvlook -I
Here are some suggestions for simplification:
use the --raw-output option to jq to remove extraneous back-slashes
there is no need to remove the quotes, csvlook does it for you
no need for awk to add a title line, use a sub-shell
no need to specify rigs implicitly, use .[]
Here is an example:
(
echo version,GPU_driver,miner,gpu
curl -s 'http://cd0a4a.ethosdistro.com/?json=yes' |
jq -r '
.rigs | .[] |
[ .version, .driver , .miner , "\(.gpus)/\(.miner_instance)" ] |
#csv
'
) |
csvlook
Output:
|----------+------------+----------+------|
| version | GPU_driver | miner | gpu |
|----------+------------+----------+------|
| 1.2.3 | nvidia | ethminer | 2/2 |
| 1.2.4 | amdgpu | ethminer | 1/1 |
|----------+------------+----------+------|

Print column names only once in shell script

Here is my shell script and It is working fine without any errors. But i want to get output differently.
Script:
#!/bin/bash
DB_USER='root'
DB_PASSWD='123456'
DB_NAME='job'
Table_Name='status_table'
#sql=select job_name, date,status from $DB_NAME.$Table_Name where job_name='$f1' and date=CURDATE()
file="/root/jobs.txt"
while read -r f1
do
mysql -N -u$DB_USER -p$DB_PASSWD <<EOF
select job_name, date,status from $DB_NAME.$Table_Name where job_name='$f1' and date=CURDATE()
EOF
done <"$file"
Source Table:
mysql> select * from job.status_table
+---------+----------+------------+-----------+
| Job_id | Job Name | date | status |
+---------+----------+--------+---------------+
| 111 | AA | 2016-12-01 | completed |
| 112 | BB | 2016-12-01 | completed |
| 113 | CC | 2016-12-02 | completed |
| 112 | BB | 2016-12-01 | completed |
| 114 | DD | 2016-12-02 | completed |
| 201 | X | 2016-12-03 | completed |
| 202 | y | 2016-12-04 | completed |
| 203 | z | 2016-12-03 | completed |
| 111 | A | 2016-12-04 | completed |
+---------+----------+------------+-----------+
Input text file
[rteja#server0 ~]# more jobs.txt
AA
BB
CC
DD
X
Y
Z
A
ABC
XYZ
Output - Supressed coumn names
(mysql -N -u$DB_USER -p$DB_PASSWD <<EOF)
[rteja#server0 ~]# ./script.sh
AA 2016-12-01 completed
BB 2016-12-01 completed
Output - without Suppressed column names, output printing the columns names for every loop iteration.
(mysql -u$DB_USER -p$DB_PASSWD <<EOF)
[rteja#server0 ~]# ./script.sh
job_name date status
AA 2016-12-01 completed
job_name date status
BB 2016-12-01 completed
Challenges:
1. Want to print column names only once in output and the result i want to store in CSV file.
2. I don't want to expose password & username in code to everyone. Is there way to hide like i heard we can create environmental variables and call it in the script. And we can set the permissions for the environmental variable file to prevent everyone to access it, and only our should be able to access it.
Rather than executing a select query multiple times you can run a single query as:
job_name in ('AA','BB','CC'...)
To do that first read complete file in an array using mapfile:
mapfile -t arr < jobs.txt
Then format the array values into a list of values suited for IN operator:
printf -v cols "'%s'," "${arr[#]}"
cols="(${cols%,})"
Display your values:
echo "$cols"
('AA','BB','CC','DD','X','Y','Z','A','ABC','XYZ')
Finally run your SQL query as:
mysql -N -u$DB_USER -p$DB_PASSWD <<EOF
select job_name, date,status from $DB_NAME.$Table_Name
where job_name IN "$cols" and date=CURDATE();
EOF
To securely connecting to MySQL use login-paths (.mylogin.cnf)
As per MySQL manual:
The best way to specify server connection information is with your .mylogin.cnf file. Not only is this file encrypted, but any logging of the utility execution does not expose the connection information.