I am trying to parse a given CSV File, stream on regular base.
My requirement is to Access the Data via ColumName (Header).
The ColumNames are not given in row 1. The ColumNames are given in row 2.
The CSV does have 100 rows but I only need 2 data rows to import.
The separator is a tab.
The following script works for header at row 1 and for all rows in the file
I failed to modify it to header at row 2 and to use only 2 rows or a number of rows.
script:
#!/usr/bin/perl
use strict;
use warnings;
use Tie::Handle::CSV;
use Data::Dumper;
my $file = "data.csv";
my $fh = Tie::Handle::CSV->new ($file, header => 1, sep_char => "\t");
my $hfh = Tie::Handle::CSV->new ($file, header => 0, sep_char => "\t");
my $line = <$hfh>;
my $myheader;
while (my $csv_line = <$fh>)
{
foreach(#{$line})
{
if ( $_ ne "" )
{
print $_ . "=" . $csv_line->{$_} . "\n" ;
}
}
}
The Data.csv could look like:
This is a silly sentence on the first line
Name GivenName Birthdate Number
Meier hans 18.03.1999 1
Frank Thomas 27.1.1974 2
Karl Franz 1.1.2000 3
Here could be something silly again
Thanks for any hint.
best regards
Use Text::CSV_XS instead of Tie::Handle::CSV (Which depends on the module so you have it installed already), read and throw away the first line, use the second line to set column names, and then read the rest of the data:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({ sep => ",", # Using CSV because TSV doesn't play well with SO formatting
binary => 1});
# Read and discard the first line
$_ = <DATA>;
# Use the next line as the header and set column names
$csv->column_names($csv->getline(*DATA));
# Read some rows and access columns by name instead of position
my $nr = 0;
while (my $record = $csv->getline_hr(*DATA)) {
last if ++$nr == 4;
say "Row $nr: $record->{GivenName} was born on $record->{Birthdate}";
}
__DATA__
This is a silly sentence on the first line
Name,GivenName,Birthdate,Number
Meier,hans,18.03.1999,1
Frank,Thomas,27.1.1974,2
Karl,Franz,1.1.2000,3
Here could be something silly again
Tie::Handle::CSV accepts a filehandle instead of a filename. You can skip the first line by reading one line from it before you pass the filehandle to Tie::Handle::CSV:
use strict;
use warnings;
use Tie::Handle::CSV;
use Data::Dumper;
my $file = "data.csv";
open (my $infile, '<',$file) or die "can't open file $file: $!\n";
<$infile>; # skip first line
my $hfh = Tie::Handle::CSV->new ($infile, header => 1, sep_char => "\t");
my #csv;
my $num_lines = 3;
while ($num_lines--){
my $line = <$hfh>;
push #csv, $line;
}
print Dumper \#csv;
thanks to you both.
To clarify more detail my requirements.
The original Data File does have maybe 100 Colums with dynamic unknown Names for me.
I will create a list of Colums/Attribute from a other Service for which this script should provide the data content of some rows.
Request is in Terms of the data example:
Please provide all Names and all Birthdates of the first 25 Rows.
The next Request could be all Names and Givennames of the first 10 rows.
That means from the content of 100 Columns I have to provide the content for two, four, five Columns only.
The output I use (foreach), is only to test the Access by ColumName to the content of rows.
I mixed up your solution and stayed with Tie::Handle::CSV.
At the moment I have to use the two filehandles- Maybe you have a hint to be more effective.
#!/usr/bin/perl
use strict;
use warnings;
use Tie::Handle::CSV;
use Data::Dumper;
my $file = "data.csv";
open (my $infile, '<',$file) or die "can't open file $file: $!\n";
open (my $secfile, '<',$file) or die "can't open file $file: $!\n";
<$infile>; # skip first line
<$secfile>;
my $fh = Tie::Handle::CSV->new ($secfile, header => 1, sep_char => "\t");
my $hfh = Tie::Handle::CSV->new ($infile, header => 0, sep_char => "\t");
my $line = <$hfh>;
my $numberoflines = 2 ;
while ($numberoflines-- )
{
my $csv_line = <$fh> ;
foreach(#{$line})
{
if ( $_ ne "" )
{
print $_ . "=" . $csv_line->{$_} . "\n" ;
}
}
}
thanks got it running with "keys %$csv_line". I was not using because of missing knowlegde. ;-)
#!/usr/bin/perl
use strict;
use warnings;
use Tie::Handle::CSV;
my $file = "data.csv";
open (my $secfile, '<',$file) or die "can't open file $file: $!\n";
<$secfile>;
my $fh = Tie::Handle::CSV->new ($secfile, header => 1, sep_char => "\t");
my $numberoflines = 3 ;
while ($numberoflines-- )
{
my $csv_line = <$fh> ;
my #Columns = keys %{ $csv_line } ;
foreach (#Columns )
{
if ( $_ ne "" )
{
print $_ . "=" . $csv_line->{$_} . "\n" ;
}
}
print "-----------\n"
}
On last question:
The File I Read will be filled and modified by an other program.
What can I do to detect the File violation in case it makes a problem.
And I dont what the my script dies.
Thanks
regards
I have 2 csv files of multiple fields(approx 30 fields), and huge size ( approx 4GB ).
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
I want to compare these 2 files and report the differences into another csv file. In the above example, Employee Vivek and Karthick details are present in different row numbers but still the record data is same, so it should be considered as match. Employee Vinoth record should be considered as a mismatch since there is a mismatch in the address.
Output diff.csv file can contain the mismatched record from the File1 and File 2 as below.
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
I've written the code so far as below. After this I'm confused which option to choose whether a Binary Search or any other efficient way to do this. Could you please help me?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
}
}
print "End of Program" . "\n";
Storing data of such magnitude in database is most correct approach to tasks of this kind. At minimum SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL will work quite well.
Following code demonstrates how desired output can be achieved without special modules, but it does not take in account possibly messed up input data. This script will report data records as different even if difference can be just one extra space.
Default output is into console window unless you specify option output.
NOTE: Whole file #1 is read into memory, please be patient processing big files can take a while.
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my #args = (
'file1|f1=s',
'file2|f2=s',
'output|o=s',
'debug|d',
'help|?',
'man|m'
);
GetOptions( \%opt, #args ) or pod2usage(2);
print Dumper(\%opt) if $opt{debug};
pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};
unlink $opt{output} if defined $opt{output} and -f $opt{output};
compare($opt{file1},$opt{file2});
sub compare {
my $fname1 = shift;
my $fname2 = shift;
my $hfile1 = file2hash($fname1);
open my $fh, '<:encoding(utf8)', $fname2
or die "Couldn't open $fname2";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
my($key,$data) = ($1, $2);
if( !defined $hfile1->{$key} ) {
my $msg = "$fname1 $key is missing";
say_msg($msg);
} elsif( $data ne $hfile1->{$key} ) {
my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
say_msg($msg);
}
}
}
sub say_msg {
my $msg = shift;
if( $opt{output} ) {
open my $fh, '>>:encoding(utf8)', $opt{output}
or die "Couldn't to open $opt{output}";
say $fh $msg;
close $fh;
} else {
say $msg;
}
}
sub file2hash {
my $fname = shift;
my %hash;
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while(<$fh>) {
chomp;
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
close $fh;
return \%hash;
}
__END__
=head1 NAME
comp_cvs - compares two CVS files and stores differense
=head1 SYNOPSIS
comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt
Options:
-f1,--file1 input CVS filename #1
-f2,--file2 input CVS filename #2
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-f1,--file1>
Input CVS filename #1
=item B<-f2,--file2>
Input CVS filename #2
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
=head1 DESCRIPTION
B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.
=head1 EXIT STATUS
The section describes B<EXIT STATUS> codes of the program
=head1 ENVIRONMENT
The section describes B<ENVIRONMENT VARIABLES> utilized in the program
=head1 FILES
The section describes B<FILES> which used for program's configuration
=head1 EXAMPLES
The section demonstrates some B<EXAMPLES> of the code
=head1 REPORTING BUGS
The section provides information how to report bugs
=head1 AUTHOR
The section describing author and his contanct information
=head1 ACKNOWLEDGMENT
The section to give credits people in some way related to the code
=head1 SEE ALSO
The section describing related information - reference to other programs, blogs, website, ...
=head1 HISTORY
The section gives historical information related to the code of the program
=head1 COPYRIGHT
Copyright information related to the code
=cut
Output for test files
file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"
#!/usr/bin/env perl
use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;
my %POS;
my %chars;
open my $FILEA,'<',q{FileA.txt}
or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
$POS{FILEA}=tell $FILEA;
<$FILEB>,
$POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';
{ # Scan for first character of the records involved
while (<$FILEA>) {
$chars{substr($_,0,1)}++;
};
while (<$FILEB>) {
$chars{substr($_,0,1)}--;
};
# So what characters do we need to deal with?
warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
};
my #chars=sort keys %chars;
{
my %_h;
# For each of the characters in our character set
for my $char (#chars) {
warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
# Beginning of data sections
seek $FILEA,$POS{FILEA},0;
seek $FILEB,$POS{FILEB},0;
%_h=();
my $pos=tell $FILEA;
while (<$FILEA>) {
next
unless (substr($_,0,1) eq $char);
# for each record save the lengthAndMD5 as the key and its start as the value
$_h{lengthAndMD5(\$_)}=$pos;
$pos=tell $FILEA;
};
my $_s;
while (<$FILEB>) {
next
unless (substr($_,0,1) eq $char);
if (exists $_h{$_s=lengthAndMD5(\$_)}) { # It's a duplicate
print {$InBoth} $_;
delete $_h{$_s};
}
else { # (Not in FILEA) It's only in FILEB
print {$OnlyInB} $_;
}
};
# only in FILEA
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
for my $key (keys %_h) { # Only in FILEA
seek $FILEA,delete $_h{$key},0;
print {$OnlyInA} scalar <$FILEA>;
};
# Should be empty
warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
};
};
close $OnlyInB
or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
or die "Could NOT close 'FileA.txt' after reading! $!";
exit;
sub lengthAndMD5 {
return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
};
__END__
My custom code (on Perl) give next wrong JSON, missing comma between blocks:
{
"data": [{
"{#LOGFILEPATH}": "/tmp/QRZ2007.tcserverlogs",
"{#LOGFILE}": "QRZ2007"
} **missing comma** {
"{#LOGFILE}": "ARZ2007",
"{#LOGFILEPATH}": "/tmp/ARZ2007.tcserverlogs"
}]
}
My terrible code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use utf8;
use JSON;
binmode STDOUT, ":utf8";
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
opendir(DIR, $dir) or die $!;
print '{"data": [';
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file");
next unless ($file =~ m/\.tcserverlogs$/);
my $fullPath = "$dir/$file";
my $filenameshort = basename($file, ".tcserverlogs");
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
print $json->encode($data_to_json);
}
print ']}'."\n";
closedir(DIR);
exit 0;
Dear Team i am not a programmer, please any idea how fix it, thank you!
If you do not print a comma, you will not get a comma.
You are trying to build your own JSON string from pre-encoded pieces of smaller data structures. That will not work unless you tell Perl when to put commas. You could do that, but it's easier to just collect all the data into a Perl data structure that is equivalent to the JSON string you want to produce, and encode the whole thing in one go when you're done.
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
my #data;
opendir( DIR, $dir ) or die $!;
while ( my $file = readdir(DIR) ) {
next unless ( -f "$dir/$file" );
next unless ( $file =~ m/\.tcserverlogs$/ );
my $fullPath = "$dir/$file";
my $filenameshort = basename( $file, ".tcserverlogs" );
my $data_to_json = { "{#LOGFILEPATH}" => $fullPath, "{#LOGFILE}" => $filenameshort };
push #data, $data_to_json;
}
closedir(DIR);
print $json->encode( { data => \#data } );
Why do I get this error, when I use the pretty print version?
'"' expected, at character offset 2 (before "(end of string)") at ./perl.pl line 29.
#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';
use Data::Dumper;
use JSON;
my $json = JSON->new->utf8;
my $hashref = {
'muster, hanß' => {
'hello' => {
year => 2000,
color => 'green'
}
}
};
my $utf8_encoded_json_text = $json->pretty->encode( $hashref ); # leads to a die
#my $utf8_encoded_json_text = $json->encode( $hashref ); # works
open my $fh, '>', 'testfile.json' or die $!;
print $fh $utf8_encoded_json_text;
close $fh;
open $fh, '<', 'testfile.json' or die $!;
$utf8_encoded_json_text = readline $fh;
close $fh;
$hashref = decode_json( $utf8_encoded_json_text );
say Dumper $hashref;
Because when you read the file back in, you're using readline, and only reading the first line of the file. When pretty is off, the entire output is on one line. When pretty is on, the JSON is spread out over multiple lines, so you're passing invalid truncated JSON to decode_json.
Read the entire content by using local $/ = undef; or slurp or whatever else you want.
I have an html page that has particular text that I want to parse into a databse using a Perl Script.
I want to be able to strip off all the stuff I don't want, an exmple of the html is-
<div class="postbody">
<h3><a href "foo">Re: John Smith <span class="posthilit">England</span></a></h3>
<div class="content">Is C# better than Visula Basic?</div>
</div>
Therefore I would want to import into the database
Name: John Smith.
Lives in: England.
Commented: Is C# better than Visula Basic?
I have started to create a Perl script but it needs to be changed to work for what I want;
use DBI;
open (FILE, "list") || die "couldn't open the file!";
open (F1, ">list.csv") || die "couldn't open the file!";
print F1 "Name\|Lives In\|Commented\n";
while ($line=<FILE>)
{
chop($line);
$text = "";
$add = 0;
open (DATA, $line) || die "couldn't open the data!";
while ($data=<DATA>)
{
if ($data =~ /ds\-div/)
{
$data =~ s/\,//g;
$data =~ s/\"//g;
$data =~ s/\'//g;
$text = $text . $data;
}
}
#p = split(/\\/, $line);
print F1 $p[2];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 "\n";
$a = $a + 1;
Any input would be greatly appreciated.
Please do not use regular expressions to parse HTML as HTML is not a regular language. Regular expressions describe regular languages.
It is easy to parse HTML with HTML::TreeBuilder (and its family of modules):
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content(
do { local $/; <DATA> }
);
for ( $tree->look_down( 'class' => 'postbody' ) ) {
my $location = $_->look_down( 'class' => 'posthilit' )->as_trimmed_text;
my $comment = $_->look_down( 'class' => 'content' )->as_trimmed_text;
my $name = $_->look_down( '_tag' => 'h3' )->as_trimmed_text;
$name =~ s/^Re:\s*//;
$name =~ s/\s*$location\s*$//;
print "Name: $name\nLives in: $location\nCommented: $comment\n";
}
__DATA__
<div class="postbody">
<h3>Re: John Smith <span class="posthilit">England</span></h3>
<div class="content">Is C# better than Visual Basic?</div>
</div>
Output
Name: John Smith
Lives in: England
Commented: Is C# better than Visual Basic?
However, if you require much more control, have a look at HTML::Parser as has already been answered by ADW.
Use an HTML parser, like HTML::TreeBuilder to parse the HTML--don't do it yourself.
Also, don't use two-arg open with global handles, don't use chop--use chomp (read the perldoc to understand why). Find yourself a newer tutorial. You are using a ton of OLD OLD OLD Perl. And damnit, USE STRICT and USE WARNINGS. I know you've been told to do this. Do it. Leaving it out will do nothing but buy you pain.
Go. Read. Modern Perl. It is free.
my $page = HTML::TreeBuilder->new_from_file( $file_name );
$page->elementify;
my #posts;
for my $post ( $page->look_down( class => 'postbody' ) ) {
my %post = (
name => get_name($post),
loc => get_loc($post),
comment => get_comment($post),
);
push #posts, \%post;
}
# Persist #posts however you want to.
sub get_name {
my $post = shift;
my $name = $post->look_down( _tag => 'h3' );
return unless defined $name;
$name->look_down->(_tag=>'a');
return unless defined $name;
$name = ($name->content_list)[0];
return unless defined $name;
$name =~ s/^Re:\s*//;
$name =~ /\s*$//;
return $name;
}
sub get_loc {
my $post = shift;
my $loc = $post->look_down( _tag => 'span', class => 'posthilit' );
return unless defined $loc;
return $loc->as_text;
}
sub get_comment {
my $post = shift;
my $com = $post->look_down( _tag => 'div', class => 'content' );
return unless defined $com;
return $com->as_text;
}
Now you have a nice data structure with all your post data. You can write it to CSV or a database or whatever it is you really want to do. You seem to be trying to do both.
You'd be much better using the HTML::Parser module from the CPAN.