From command line, we are passing multiple values separated by commas such as sydney,delhi,NY,Russia as an option. These values are getting stored under $runTest in the perl script. Now I want to create a new file under the script with contents of $runTest but as line by line. For example:
INPUT (passed values from command line):
sydney,delhi,NY,Russia
OUTPUT (under new file: myfile):
sydney
delhi
NY
Russia
In this simple example, it is better to use split on a delimiter than tr in such case. A few minor points: use snake_case for names instead of CamelCase, and use autodie to make open, close, etc, fatal, without the need to clutter the code with or die "...":
use autodie;
my $run_test = 'sydney,delhi,NY,Russia';
open my $out, '>', 'myFile';
print {$out} map { "$_\n" } split /,/, $run_test;
close $out;
For more robust parsing in general, beyond this simple example, prefer specialized modules, such as Text::CSV or Text::CSV_XS for csv parsing. Compared to the overly simplistic split, Text::CSV_XS enables correct input/output of quoted fields, fields containing the delimiter (comma), binary characters, provides error messages and more. Example:
use Text::CSV_XS;
use autodie;
open my $out, q{>}, q{myFile};
# All of these input strings are parsed correctly, unlike when using "split":
# my $run_test = q{sydney,delhi,NY,Russia};
# my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas"};
my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas","field,with,missing,quote};
# binary => 1 : enable parsing binary characters in quoted fields.
# auto_diag => 1 : print the internal error code and the associated error message to STDERR.
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1 } );
if ( $csv->parse( $run_test ) ) {
print {$out} map { "$_\n" } $csv->fields;
}
else {
print STDERR q{parse() failed on: }, $csv->error_input, qq{\n};
}
Related
I am working with the Text::CSV library of Perl to import data from a CSV file, using the functional interface. The data is stored in an array of hashes, and the problem is that when the script tries to access those elements/keys, they are uninitialized (or undefined).
Using the library Dumper, it is possible to see that the array and the hashes are not empty, in fact, they are correctly filled with the data of the CSV file.
With this small piece of code, I get the following output:
my $array = csv(
in => $csv_file,
headers => 'auto');
foreach my $mem (#{$array}) {
print Dumper $mem;
foreach (keys $mem) {
print $mem{$_};
}
}
Last part of the output:
$VAR1 = {
'Column' => '16',
'Width' => '13',
'Type' => 'RAM',
'Depth' => '4096'
};
Use of uninitialized value in print at ** line 81.
Use of uninitialized value in print at ** line 81.
Use of uninitialized value in print at ** line 81.
Use of uninitialized value in print at ** line 81.
This happens with all the elements of the array. Is this problem related to the encoding, or I am just simply accessing the elements in a incorrect way?
$mem is a reference to a hash, but you keep trying to use it directly as a hash. Change your code to:
foreach (keys %$mem) {
print $mem->{$_};
}
There is a slight complication in that in some versions of perl, 'keys $mem' was allowed directly as an experimental feature, which later got removed. In any case, adding
use warnings;
use strict;
would likely have given you some helpful clues as to what was happening.
When I run your code on my version of Perl (5.24), I get this error:
Experimental keys on scalar is now forbidden at ... line ...
This points to the line:
foreach (keys $mem) {
You should dereference the hash ref:
use warnings;
use strict;
use Data::Dumper;
use Text::CSV qw( csv );
my $csv_file="data.csv";
my $array = csv(
in => $csv_file,
headers => 'auto');
foreach my $mem (#{$array}) {
print Dumper($mem);
foreach (keys %{ $mem }) {
print $mem->{$_}, "\n";
}
}
I have pored over this site (and others) trying to glean the answer for this but have been unsuccessful.
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, auto_diag => 1 } );
$line = q(data="a=1,b=2",c=3);
my $csvParse = $csv->parse($line);
my #fields = $csv->fields();
for my $field (#fields) {
print "FIELD ==> $field\n";
}
Here's the output:
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote # rec 0 pos 6 field 1
FIELD ==>
I am expecting 2 array elements:
data="a=1,b=2"
c=3
What am I missing?
You may get away with using Text::ParseWords. Since you are not using real csv, it may be fine. Example:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $line = q(data="a=1,b=2",c=3);
my #fields = quotewords(',', 1, $line);
print Dumper \#fields;
This will print
$VAR1 = [
'data="a=1,b=2"',
'c=3'
];
As you requested. You may want to test further on your data.
Your input data isn't "standard" CSV, at least not the kind that Text::CSV expects and not the kind that things like Excel produce. An entire field has to be quoted or not at all. The "standard" encoding of that would be "data=""a=1,b=2""",c=3 (which you can see by asking Text::CSV to print your expected data using say).
If you pass the allow_loose_quotes option to the Text::CSV constructor, it won't error on your input, but it won't consider the quotes to be "protecting" the comma, so you will get three fields, namely data="a=1, b=2" and c=3.
I have a json file like this:
"tool_name": {
"command": "$ENV{TOOL_BIN_DIR}/some_file_name",
"args": "some args"
}
I am using use:JSON from Perl 5.14. and using decode_json function to read the file and get data into perl hash.
But when I refer to this read data from code like this:
my $cmd = "$data->{tool_name}->{command}";
print $cmd;
I get
$ENV{TOOL_BIN_DIR}/some_file_name
How can I make perl resolve the value of this variable?
This example uses enviornment variable but in general if I want to use variables from JSON - how can I do that?
Using eval opens you to up to malicious or accidental damage: the string you are executing could contain any Perl code that may do anything at all to your system
It is preferable to use interpolate from the String::Interpolate module, which uses Perl's own interpolation engine that expands ordinary double-quoted strings at run time
This program sets up a value for the environment variable TOOL_BIN_DIR and expands all the values in the tool_name hash that contain a dollar $ or at # sigil
I've used Data::Dump to display the contents of the data after the interpolation
You may want to write a recursive subroutine that will process the values of all nested hashes and arrays if you don't know which values are likely to contain a value that needs to be expanded
use strict;
use warnings 'all';
use JSON 'decode_json';
use String::Interpolate 'interpolate';
use Data::Dump 'dd';
my $data = decode_json <<'__END_JSON__';
{
"tool_name": {
"command": "$ENV{TOOL_BIN_DIR}/some_file_name",
"args": "some args"
}
}
__END_JSON__
$ENV{TOOL_BIN_DIR} = 'tool_dir_test';
for ( values %{ $data->{tool_name} } ) {
$_ = interpolate($_) if /[\$\#]/;
}
dd $data;
output
{
tool_name => { args => "some args", command => "tool_dir_test/some_file_name" },
}
You can use eval like in:
my $cmd = '$ENV{HOME}/toto';
print eval('"' . $cmd . '"'), "\n";
But remember, this is very unsafe from a security point of view. You should probably avoid having to do this.
I have two Json files which come from different OSes.
Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.
One file comes from OS X and the filename is in NFD form: (od -bc)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp and decode_json from the JSON::XS.
Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
How to simply read both json files to get the same normalization into the both $hashrefs?
or need after the decode_json run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK
While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.
Even though it may be important right now only to convert a few file names to the same normalization for comparison, other unexpected problems could arise from almost anywhere if JSON data has a different normalization.
So my suggestion is to normalize the entire input from both sources as your first step before doing any parsing (i.e., at the same time you read the file and before decode_json). This should not corrupt any of your JSON structures since those are delimited using ASCII characters. Then your existing perl code should be able to blindly assume all UTF8 characters have the same normalization.
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);
To make this process slightly faster (it should be plenty fast already, since the module uses fast XS procedures), you can find out whether one of the two data files is already in a certain normalization form, and then leave that file unchanged, and convert the other file into that form.
For example:
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
if (checkNFD($rawdata1)) {
# then you know $file1 is already in Normalization Form D
# (i.e., it was formed by canonical decomposition).
# so you only need to convert $file2 into NFD
$rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;
Of course, you would naturally have to experiment now in the development time to see if one or other of the input files is already in a normalized form, and then in your final version of the code, you would no longer need a conditional statement, but simply convert the other input file into the same normalized form.
Also note that it is suggested to produce output in NFC form (if your program produces any output that would be stored and used later). See here, for example: http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html
Hm. I can't advice you some better "programming" solution. But why simply doesn't run
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json >osxok.json
and now your script can read and use both because they are both in the same normalisation? So instead searching for som programming solution inside of your script, solve the problem before entering to the script. (The second command is unnecessary - simple convert on the file level. Sure is more easy as traversing data structures...
Instead of traversing the data structure manually, let a module handle this for you.
Data::Visitor
Data::Rmap
Data::Dmap
I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out