An input file is given, each line of which contains quotes for each column and carriage return/ new line character.
If the line contains new lines it has be appended with in the same
line which is inside the quotes i.e for example line 1
Removing of double quotes for each column if the delimiter(,) is
not present.
Removing of Carriage Return characters i.e(^M)
To exemplify, given the following input file
"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M
I would like to obtain the following output by means of a sed/perl/awk script/oneliner.
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
Solutions which i have tired it so far
For appending with previous line
sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt
for replacing control-m characters
perl -pne 's/\\r//g' sample.txt
But i didn't achieve final output what i required below
Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.
In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Comments
The binary option in the constructor is what handles newlines embedded in data
Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed
The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.
I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop
The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file
The above prints, for the sample input provided in the question
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
This might work for you (GNU sed):
sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file
The solution is in two parts:
Join broken lines to make whole ones.
Remove double quotes surrounding fields that do not contain commas.
If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.
N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.
FPAT is the way to go using gnu awk, it handles comma separated files.
remove ^m
clean lines
remove qutes
.
dos2unix sample.txt
awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
All in one go:
dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'
Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.
(i=1;i<=NF;i++) loop through on by one field on the line.
if($i!~",") if it does not contain comma, then
$i=substr($i,2,length($i)-2) remove first and last character, the "
If a field for some reason do not contain ", this is more robust:
awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file
It will not do any thing to a field not contains double quote.
With perl, please try the following:
perl -e '
while (<>) {
s/\r$//; # remove trailing CR code
$str .= $_;
}
while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
$_ = $&;
if (/,/) { # the element contains ","
s/\n//g; # then remove newline(s) if any
} else { # otherwise remove surrounding double quotes
s/^"//s; s/"$//s;
}
push(#ary, $_);
if (/\n$/) { # newline terminates the element
print join(",", #ary);
#ary = ();
}
}' sample.txt
Output:
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
I'm developing a Perl script that's supposed to generate an HTML file from numerical values from other file. The idea is to read the file that has these values and then list them in a separate HTML file. The file that contains the numerical values is updated every a certain period of time, and those changes should be seen on the HTML.
Even though these values are correctly read (I've tested it) they are not printed in the HTML. Whats-more, the HTML tags are not even printed. This is the code I've written:
#!/usr/bin/perl
use IO::Handle;
use CGI qw(:standard);
print "Status: 200 OK", "\n";
print "Content-type: text/plain", "\n\n";
for(;;) {
open (my $input_file, "<", "/path/to/input/file/input_file.txt") || die "Unable to open the file: $!";
open (my $html_file, ">", "/path/to/html/file/index.html") || die "Unable to open the HTML file: $!";
print $html_file "<html><head><title>title</title><META HTTP-QUIV='refresh' CONTENT='10'></head><body>";
#lines = <$input_file>;
foreach my $line (#lines) {
print $html_file "<p>$line</p>";
}
print $html_file "</body></html>";
sleep 1;
close $input_file || die;
close $html_file || die;
}
The script only works in the first for iteration. What I mean is that the HTML tags and the numerical values are correctly printed in the output file. Then, from iteration 2 to N, the file remains literally empty. I can not see what I'm missing here. Why does it work in the first iteration but not in the following ones?
Thanks in advance
You need to close the file before the sleep. As it stands, the data is flushed to the file by the close and then immediately overwritten by the next open, and left empty for one second
You also need to write
close $html_file or die $!
as the code you have is equivalent to
close($html_file || die)
so your program will never die as long as $html_file is true
I'm new to LWP, URI, Base64. I'm using LWP to post a json string containing an array from a perl script to another perl script. One of the values in the array is a base64 encoded jpg.
I encode the image
open (IMAGE, "./flower.jpg") or die "$!";
$raw_string = do{ local $/ = undef; <IMAGE>; };
$encoded = encode_base64( $raw_string );
$encoded = uri_escape($encoded);
In the other script I decode the image and save it to a directory. The file is slightly larger after saving it than it was originally (a couple kb larger).
$decoded = decode_base64($item->{'FILE'});
open my $fh, '>', "$path/flower.jpg" or die $!;
binmode $fh;
print $fh $decoded;
close $fh;
Also in the second script I pass the json string back and in the first script essentially print what was returned. Everything seems to be returned/prints as expected. When I try to open the file, I just get a standard OS message stating cannot open file. I tried now with a pdf and a jpg. I know I'm missing something somewhere. Thanks for the help!
I am able to convert a hard coded json string into perl hashes however if i want to convert a complete json file into perl data structures which can be parsed later in any manner, I am getting the folloring error.
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at json_vellai.pl line 9
use JSON::PP;
$json= JSON::PP->new()
$json = $json->allow_singlequote([$enable]);
open (FH, "jsonsample.doc") or die "could not open the file\n";
#$fileContents = do { local $/;<FH>};
#fileContents = <FH>;
#print #fileContents;
$str = $json->allow_barekey->decode(#filecontents);
foreach $t (keys %$str)
{
print "\n $t -- $str->{$t}";
}
This is how my code looks .. plz help me out
It looks to me like decode doesn't want a list, it wants a scalar string.
You could slurp the file:
undef $/;
$fileContents = <FH>;
I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out