Related
I am trying to insert a column in front of the first column in a comma separated value file (CSV). At first blush, awk seems to be the way to go but, I'm struggling with how to move down the new column.
CSV File
A,B,C,D,E,F
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8
4,5,6,7,8,9
Attempted Code
awk 'BEGIN{FS=OFS=","}{$1=$1 OFS (FNR<1 ? $1 "0\nA\n2\nC" : "col")}1'
Result
A,col,B,C,D,E,F
1,col,2,3,4,5,6
2,col,3,4,5,6,7
3,col,4,5,6,7,8
4,col,5,6,7,8,9
Expected Result
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
This can be easily done using paste + printf:
paste -d, <(printf "col\n0\nA\n2\nC\n") file
col,A,B,C,D,E,F
0,1,2,3,4,5,6
A,2,3,4,5,6,7
2,3,4,5,6,7,8
C,4,5,6,7,8,9
<(...) is process substitution available in bash. For other shells use a pipeline like this:
printf "col\n0\nA\n2\nC\n" | paste -d, - file
With awk only you could try following solution, written and tested with shown samples.
awk -v value="$(echo -e "col\n0\nA\n2\nC")" '
BEGIN{
FS=OFS=","
num=split(value,arr,ORS)
for(i=1;i<=num;i++){
newVal[i]=arr[i]
}
}
{
$1=arr[FNR] OFS $1
}
1
' Input_file
Explanation:
First of all creating awk variable named value whose value is echo(shell command)'s output. NOTE: using -e option with echo will make sure that \n aren't getting treated as literal characters.
Then in BEGIN section of awk program, setting FS and OFS as , here for all line of Input_file.
Using split function on value variable into array named arr with delimiter of ORS(new line).
Then traversing through for loop till value of num(total values posted by echo command).
Then creating array named newVal with index of i(1,2,3 and so on) and its value is array arr value.
In main awk program, setting first field's value to array arr value and $1 and printing the line then.
An input file is given, each line of which contains quotes for each column and carriage return/ new line character.
If the line contains new lines it has be appended with in the same
line which is inside the quotes i.e for example line 1
Removing of double quotes for each column if the delimiter(,) is
not present.
Removing of Carriage Return characters i.e(^M)
To exemplify, given the following input file
"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M
I would like to obtain the following output by means of a sed/perl/awk script/oneliner.
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Solutions which i have tired it so far
For appending with previous line
sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt
for replacing control-m characters
perl -pne 's/\\r//g' sample.txt
But i didn't achieve final output what i required below
Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.
In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Comments
The binary option in the constructor is what handles newlines embedded in data
Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed
The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.
I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop
The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file
The above prints, for the sample input provided in the question
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
This might work for you (GNU sed):
sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file
The solution is in two parts:
Join broken lines to make whole ones.
Remove double quotes surrounding fields that do not contain commas.
If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.
N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.
FPAT is the way to go using gnu awk, it handles comma separated files.
remove ^m
clean lines
remove qutes
.
dos2unix sample.txt
awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
All in one go:
dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'
Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.
(i=1;i<=NF;i++) loop through on by one field on the line.
if($i!~",") if it does not contain comma, then
$i=substr($i,2,length($i)-2) remove first and last character, the "
If a field for some reason do not contain ", this is more robust:
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file
It will not do any thing to a field not contains double quote.
With perl, please try the following:
perl -e '
while (<>) {
s/\r$//; # remove trailing CR code
$str .= $_;
}
while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
$_ = $&;
if (/,/) { # the element contains ","
s/\n//g; # then remove newline(s) if any
} else { # otherwise remove surrounding double quotes
s/^"//s; s/"$//s;
}
push(#ary, $_);
if (/\n$/) { # newline terminates the element
print join(",", #ary);
#ary = ();
}
}' sample.txt
Output:
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Before u start reading all my long post I figured it out why it doesnt work. Its about base64, it adds new line after each 60chars by default. This link : base64 encode length parameter
is a bit explenation. Is there Base64 module which adds new line after every 76 chars ?
So solution is just this
#checkout_signature = Base64.strict_encode64(#signature + "|" + checkout_request)
#checkout_signature = #checkout_signature.insert(76, "\n") # once
or
#checkout_signature.gsub!(/.{76}(?=.)/, '\0'+"\n") # every 76 chars
Please help me translate this bash script to ruby in proper way. Actually just base64 encoding is bad
UPDATE
bash way
#!/bin/bash
echo 'Signature test'
export checkout_request='{"charge":{"amount":499,"currency":"EUR"}}'
echo $checkout_request
export signature=`echo -n "$checkout_request" | openssl dgst -sha256 -hmac 'pr_test_tXHm9qV9qV9bjIRHcQr9PLPa' | sed 's/^.* //'`
echo $signature
echo '--------'
echo -n "$signature|$checkout_request" | base64
RESULT
{"charge":{"amount":499,"currency":"EUR"}}
cf9ce2d8331c531f8389a616a18f9578c134b784dab5cb7e4b5964e7790f173c
--------
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3
OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
ruby way
checkout_request='{"charge":{"amount":499,"currency":"EUR"}}'
secret_key ='pr_test_tXHm9qV9qV9bjIRHcQr9PLPa'
#signature = OpenSSL::HMAC.hexdigest('sha256', secret_key, checkout_request);
puts #signature;
puts "-----"
#checkout_signature = Base64.urlsafe_encode64(#signature + "|" + checkout_request)
puts #checkout_signature
RESULT
cf9ce2d8331c531f8389a616a18f9578c134b784dab5cb7e4b5964e7790f173c
-----
ODY4YzY4YTg4NmFmOTg2MGY5OGVjMmUyODM5OTBhYmViNmQyZjUzYWI5ZjgxMzlhYzFlODllNThhZTVhZTFkMnx7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
UPDATE
ok signature is ok, but I use bad base64 encoding....
SOMEHOW
echo -n "$signature|$checkout_request" | base64
adds newline after $signature
and its the only '\n' in bash result
in ruby when i use .encode64 i get many more n's
"Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVj\nYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwi\nY3VycmVuY3kiOiJFVVIifX0=\n"
when I use .strict_encode64 or .urlsafe_encode64 there is no \n
"Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0="
and result from bash looks like this
Y2Y5Y2UyZDgzMzFjNTMxZjgzODlhNjE2YTE4Zjk1NzhjMTM0Yjc4NGRhYjVjYjdlNGI1OTY0ZTc3
OTBmMTczY3x7ImNoYXJnZSI6eyJhbW91bnQiOjQ5OSwiY3VycmVuY3kiOiJFVVIifX0=
there must be newline after c3, that's the way pritnenv shows variable.I can't inspect it like in ruby
WHY THERE IS SUCH DIFFERENCE
I'm taking an introductory course to bash at my university and am working on a little MotD script that uses a json-object grabbed from an API using curl.
I want to make absolutely certain that you understand that this is NOT an assignment, but something I'm playing around with to learn more about how to script with bash.
I've found myself stuck with what could possibly be a very simply issue; I want to insert a new line ('\n') on a specific index if the 'quote' value of my json-object is too long (in this case on index 80).
I've been following a bunch of SO threads and this is my current solution:
#!/bin/bash
json_object=$(curl -s 'http://quotes.stormconsultancy.co.uk/random.json')
quote=$(echo ${json_object} | jq .quote | sed -e 's/^"//' -e 's/"$//')
author=$(echo ${json_object} | jq .author)
count=${#quote}
echo $quote
echo $author
echo "wc: $count"
if((count > 80));
then
quote=${quote:0:80}\n${quote:80:(count - 80)}
else
echo "lower"
fi
printf "$quote"
The current output I receive from the printf is the first word of the quote, whereas if I have an echo before trying to do the string-manipulation I get the entire quote.
I'm sorry if it's not following best practice or anything, but I'm an absolute beginner using both vi and bash.
I'd be very happy with any sort of advice. :)
EDIT:
Sample output:
$ ./json.bash
You should name a variable using the same care with which you name a first-born child.
"James O. Coplien"
86
higher
You should name a variable using the same care with which you name a first-born nchild.
You can just use a single line bash command to achieve this,
string="You should name a variable using the same care with which you name a first-born child."
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
You should name a variable using the same care with which you name a first-born
child.
(and) for an input line less than 80 charaacters
string="You should name a variable using the same care"
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
You should name a variable using the same care
An explanation,
(( "${#string}" > 80 )) && printf "%s\n" "${string:0:80}"$'\n'"${string:80}" || printf "%s\n" "$string"
# The syntax is a indirect implementation of ternary operator as bash doesn't
# directly support it.
#
# (( "${#string}" > 80 )) will return a success/fail depending upon the length
# of the string variable and if it is greater than 80, the command after && is
# executed and if it fails the command after || is executed
#
# "${string:0:80}"$'\n'"${string:80}"
# A parameter expansion syntax for sub-string extraction.
#
# ${PARAMETER:OFFSET}
#
# ${PARAMETER:OFFSET:LENGTH}
#
# This one can expand only a part of a parameter's value, given a position
# to start and maybe a length. If LENGTH is omitted, the parameter will be
# expanded up to the end of the string. If LENGTH is negative, it's taken as
# a second offset into the string, counting from the end of the string.
#
# So in our example we basically extract the characters from position 0 to 80
# insert a new-line and append the rest of the string
#
# The $'\n' syntax allows to include all escape sequence characters be
# included, in this case just the new line character.
Not really in the original question, but adding some extra code to #Inian great answer to allow not to break in the middle of a word, but rather at the last white space in ${string:0:80}:
#!/usr/bin/env bash
string="You should really name a variable using the same care with which you name a first-born child."
if (( "${#string}" > 80 )); then
maxstring="${string:0:80}"
lastspace="${maxstring##*\ }"
breakat="$((${#maxstring} - ${#lastspace}))"
printf "%s\n" $"${string:0:${breakat}}"$'\n'"${string:${breakat}}"
else
printf "%s\n" "$string"
fi
maxstring=${string:0:80}:
Let's get the first 80 characters of the quote.
lastspace=${maxstring##*\ }:
Deletes longest match of *\ (white space is escaped) from front of $maxstring, ${lastspace} will be the remaining string from last white space until end of the string.
breakat="$((${#maxstring} - ${#lastspace}))":
Subtract the length of ${lastspace} with the length of ${maxstring} to get the last index of the white space from ${maxstring}. This is the index where \n will be inserted.
Example output with "hard" break at character 80:
You should really name a variable using the same care with which you name a firs
t-born child.
Example output with a "soft" break at the closest white space from character 80:
You should really name a variable using the same care with which you name a
first-born child.
I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out