difference between lines in the same column using AWK - csv

I want to compare the lines of the same column in a csv file and keep only the lines that respect the following conditions
1.if the first pattern is the same as the one in the previous line and
2.the difference between the values in the second column equal abs(1)
for example if I have this lines
aaaa;12
aaaa;13
bbbb;11
bbbb;9
cccc;9
cccc;8
I will keep only
aaaa;12
aaaa;13
cccc;9
cccc;8

The logic would work this way:
If the previous pattern is not equal to this pattern, then remember the this pattern and this value as the new "previous", and move to the next line.
Otherwise, if the difference between the previous value and this value equals 1 or -1 (awk does not have an abs() function) then print the previous pattern and value and print this line.
Take a stab at translating that into code, and come back when you have questions.

Given:
$ echo "$test"
aaaa;12
aaaa;13
bbbb;11
bbbb;9
cccc;9
cccc;8
You can do something like:
$ echo "$test" | awk -F ";" 'function abs(v) {return v < 0 ? -v : v} $1==l1 && abs($2-l2)==1 {print l1 FS l2 RS $0} {l1=$1;l2=$2}'
aaaa;12
aaaa;13
cccc;9
cccc;8

Related

CSV Column Insertion via awk

I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file
With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.

transform multiline text into csv with awk sed and grep

I run a shell command that returns a list of repeated values like this (note the indentation):
Name: vm346
cpu 1 (12%) 6150m (76%)
memory 1130Mi (7%) 1130Mi (7%)
Name: vm847
cpu 6 (75%) 30150m (376%)
memory 12980Mi (87%) 12980Mi (87%)
Name: vm848
cpu 3500m (43%) 17150m (214%)
memory 6216Mi (41%) 6216Mi (41%)
I am trying to transform that data like this (in csv):
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The problem is that any given dataset like the one above is always on more than one line.
when I pipe that into it awk it drives me mad because even if I use:
BEGIN{ FS="\n" }
to try and stitch the data together in one line, it doesn't work. No matter what I do, awk keeps the name value as a separated line above everything else.
I am sorry I haven't much code to share but I have been spinning my wheels with this for a few hours now and I am running out of ideas...
I can solve this in Perl:
perl -ane 'print join ",", #F[1 .. $#F]; print $F[0] eq "memory" ? "\n" : ","'
It should be easy to translate it to awk if you need it.
How does it work?
-a splits each line on whitespace into the #F array
-n reads the input line by line and runs the code specified after -e for each line
We print all the elements but the first one separated by commas (see join)
We then look at the first column, if it's memory, we are at the last line of the block, so we print a newline, otherwise we print a comma
With AWK, one option is to set RS to "Name: ", and ignore the first record with NR > 1, e.g.
awk -v RS="Name: " 'BEGIN{OFS=","} NR > 1 {print $1, $3, $4, $5, $6, $8, $9, $10, $11}' file
#> vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
#> vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
#> vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
awk '{$1=""}1' | paste -sd' \n' - | awk '{$1=$1}1' OFS=,
Get rid of the first column. Join every three rows. Same idea with sed:
sed 's/^ *[^ ]* *//' | paste -sd' \n' - | sed 's/ */,/g'
Something else:
awk '
$1=="Name:" {
sep=ors
ors=ORS
} {
for (i=2;i<=NF;++i) {
printf "%s%s",sep,$i
sep=OFS
}
} END {printf "%s",ors}'
Or if you want to print an ORS based on the first field being "memory" (note that this program may end without printing a terminating ORS):
awk '{for (i=2;i<=NF;++i) printf "%s%s",$i,(i==NF && $1=="memory" ? ORS : OFS)}'
something else else:
awk -v OFS=, '
index($0,$1)==1 {
OFS=ors
ors=ORS
} {
$1=""
printf "%s",$0
OFS=ofs
} END {printf "%s",ors} BEGIN {ofs=OFS}'
This might work for you (GNU sed):
sed -nE '/^ +\S+ +/{s///;H;$!d};x;/./s/\s+/,/gp;x;s/^\S+ +//;h' file
In overview the sed program processes indented lines, already gathered lines (except in the case that the current line is the first line of the file) and non-indented lines.
Turn off implicit printing and enable extended regexp's. (-nE).
If the current line is indented, remove the indent, the first field and any following spaces, append the result to the hold space and if it is not the last line, delete it.
Otherwise, check the hold space for gathered lines and if found, replace one or more whitespaces by commas and print the result. Then prep the current line by removing the first field and any following spaces and replace the hold space with the result.
The solution seems logically back-to-front, but programming in this style avoids having to check for end-of-file multiple times and invoking labels and gotos.
N.B. This solution will work for any number of indented lines.
Here is a ruby to do that:
ruby -e '
s=$<.read
s.scan(/^([^ \t]+:)([\s\S]+?)(?=^\1|\z)/m). # parse blocks
map(&:last). # get data part
# parse and join the data fields:
map{|block| block.split(/\n[ \t]+[^ \t]+[ \t]+/)}.
map{|lines| lines.map(&:strip).join(" ").split().join(",")}.
each{|l| puts "#{l}"}
' file
vm346,1,(12%),6150m,(76%),1130Mi,(7%),1130Mi,(7%)
vm847,6,(75%),30150m,(376%),12980Mi,(87%),12980Mi,(87%)
vm848,3500m,(43%),17150m,(214%),6216Mi,(41%),6216Mi,(41%)
The advantage is that this is not dependent on the number of lines or the number of fields. It is parsing data that is in blocks of the form:
START: ([ \t]+[data_with_no_space])*\n
l1 ([ \t]+[data_with_no_space])*\n
...
START:
...
Works this way:
Parse the blocks with THIS REGEX;
Save an array of the data elements;
Join the sub arrays and then split into data fields;
Join(',') to make a csv.

Insert newline character at index in .bash

I'm taking an introductory course to bash at my university and am working on a little MotD script that uses a json-object grabbed from an API using curl.
I want to make absolutely certain that you understand that this is NOT an assignment, but something I'm playing around with to learn more about how to script with bash.
I've found myself stuck with what could possibly be a very simply issue; I want to insert a new line ('\n') on a specific index if the 'quote' value of my json-object is too long (in this case on index 80).
I've been following a bunch of SO threads and this is my current solution:
#!/bin/bash
json_object=$(curl -s 'http://quotes.stormconsultancy.co.uk/random.json')
quote=$(echo ${json_object} | jq .quote | sed -e 's/^"//' -e 's/"$//')
author=$(echo ${json_object} | jq .author)
count=${#quote}
echo $quote
echo $author
echo "wc: $count"
if((count > 80));
then
quote=${quote:0:80}\n${quote:80:(count - 80)}
else
echo "lower"
fi
printf "$quote"
The current output I receive from the printf is the first word of the quote, whereas if I have an echo before trying to do the string-manipulation I get the entire quote.
I'm sorry if it's not following best practice or anything, but I'm an absolute beginner using both vi and bash.
I'd be very happy with any sort of advice. :)
EDIT:
Sample output:
$ ./json.bash
You should name a variable using the same care with which you name a first-born child.
"James O. Coplien"
86
higher
You should name a variable using the same care with which you name a first-born nchild.
You can just use a single line bash command to achieve this,
string="You should name a variable using the same care with which you name a first-born child."
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
You should name a variable using the same care with which you name a first-born
child.
(and) for an input line less than 80 charaacters
string="You should name a variable using the same care"
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
You should name a variable using the same care
An explanation,
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
# The syntax is a indirect implementation of ternary operator as bash doesn't
# directly support it.
#
# (( "${#string}" > 80 )) will return a success/fail depending upon the length
# of the string variable and if it is greater than 80, the command after && is
# executed and if it fails the command after || is executed
#
# "${string:0:80}"$'\n'"${string:80}"
# A parameter expansion syntax for sub-string extraction.
#
# ${PARAMETER:OFFSET}
#
# ${PARAMETER:OFFSET:LENGTH}
#
# This one can expand only a part of a parameter's value, given a position
# to start and maybe a length. If LENGTH is omitted, the parameter will be
# expanded up to the end of the string. If LENGTH is negative, it's taken as
# a second offset into the string, counting from the end of the string.
#
# So in our example we basically extract the characters from position 0 to 80
# insert a new-line and append the rest of the string
#
# The $'\n' syntax allows to include all escape sequence characters be
# included, in this case just the new line character.
Not really in the original question, but adding some extra code to #Inian great answer to allow not to break in the middle of a word, but rather at the last white space in ${string:0:80}:
#!/usr/bin/env bash
string="You should really name a variable using the same care with which you name a first-born child."
if (( "${#string}" > 80 )); then
maxstring="${string:0:80}"
lastspace="${maxstring##*\ }"
breakat="$((${#maxstring} - ${#lastspace}))"
printf "%s\n" $"${string:0:${breakat}}"$'\n'"${string:${breakat}}"
else
printf "%s\n" "$string"
fi
maxstring=${string:0:80}:
Let's get the first 80 characters of the quote.
lastspace=${maxstring##*\ }:
Deletes longest match of *\ (white space is escaped) from front of $maxstring, ${lastspace} will be the remaining string from last white space until end of the string.
breakat="$((${#maxstring} - ${#lastspace}))":
Subtract the length of ${lastspace} with the length of ${maxstring} to get the last index of the white space from ${maxstring}. This is the index where \n will be inserted.
Example output with "hard" break at character 80:
You should really name a variable using the same care with which you name a firs
t-born child.
Example output with a "soft" break at the closest white space from character 80:
You should really name a variable using the same care with which you name a
first-born child.

Extract column data from csv file based on row values

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

I just want the last 3 characters of a column returned to the original file

first 2lines of my data:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","123427","456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
I only want the last 3 characters of column 2 and column 3, I dont want the column header affected.
happy for a solution that can do column2 first and then do column 3
I am fiddling with sed and awk at the minute but have no joy yet.
this is what I want:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit1 this gives me the last 3 digits(+ "), just need to write this back to the orig file?
$ awk -F"," 'NR>1{ print $2}' head_test_real.csv | sed 's/.*\(....\)/\1/'
427"
592"
007"
592"
409"
742"
387"
731"
556"
edit2 this works but i lose the double quotes "123427" goes to 427, i ould like to keep the double quotes.
* NR>1 works on the rows after the 1st row.
$ awk -F, 'NR>1{$2=substr($2,length($2)-3,3)}1' OFS=, head_test_real.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06",427,"456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit3 #Mark tks fro correct answer, and here just for my ref on the quotes.
$ ####csv.QUOTE_ALL
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ####csv.QUOTE_MINIMAL
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan
$ ###csv.QUOTE_NONNUMERIC
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ###csv.QUOTE_NONE
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan
While awk seems like a natural fit for comma-separated data, it doesn't deal well with the quoted-fields version. I would recommend using a dedicated CSV-processing library like the one that ships with Python (both 2 and 3):
import csv
with open('in.csv','r') as infile:
reader = csv.reader(infile)
with open('out.csv','w') as outfile:
writer = csv.writer(outfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_ALL)
writer.writerow(next(reader))
for row in reader:
row[1] = row[1][-3:]
row[2] = row[2][-3:]
writer.writerow(row)
Put the above code into a file named e.g. fixcsv.py and make the filenames match what you have and want, then just run it with python fixcsv.py (or python3 fixcsv.py).
I set it to quote everything in the output (QUOTE_ALL); if you don't want it to do that, you can set it to QUOTE_MINIMAL, QUOTE_NONNUMERIC or QUOTE_NONE.
The row assignments replace the second and third fields (row[1] and row[2], since the first field is row[0]) with their last three characters ([-3:]). You could also do it arithmetically with e.g. row[1] = int(row[1]) % 1000.
$ awk 'BEGIN{FS=OFS="\",\""} NR>1{for (i=2;i<=3;i++) $i=substr($i,length($i)-2)} 1' file
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
As with any command, to write back to the original file is just:
command file > tmp && mv tmp file
Perl to the rescue!
perl -pe 's/",".*?(...",")/","$1/ if $. > 1' < input > output
-p reads the input line by line and prints the result
s/regex/replacement/ is a substitution
.*? matches anything (like .*), but the question mark makes it "frugal", i.e. it matches the shortest string possible
(...",") creates a capture group starting three characters before ",", it can be referenced as $1.
$. is the line number, no replacement happens on line 1.
Make sure the first two columns are always quoted and the second column is never shorter than 3 characters.
To modify the third column, you can modify the regex to
perl -pe 's/^("(?:.*?","){2}).*?(...",")/$1$2/ if $. > 1'
# ~
Modify the indicated number to handle any column you like.