Efficient code to do a countif, ranking on huge csv file - csv
need help to accelerate the run of below perl code. 4 to 6hours should be fine. faster is better :)
csv file is about 14m to 14.5m rows, and aorund 1100 to 1500columns; 62gig
what it does:
do a count (like a countif in excel)
get the percent (based on 14m rows)
get the rank based on count
My current code:
use List::MoreUtils qw(uniq);
$x="Room_reserve";
$in = "D:\\package properties\\${x}.csv";
$out = "D:\\package properties\\output\\${x}_output.csv";
open($fh, '<', $in) or die "Could not open file '$file' $!";
#data = <$fh>;
close($fh);
%counts;
#columns;
$first = 1;
#counter
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
if ($first == 1) {
$first = 0;
next;
}
else {
$count = 1;
foreach $i (0..$#rows) {
if ( exists($columns[$i]{$rows[$i]}) ) {
$columns[$i]{$rows[$i]}++;
}
else {
$columns[$i]{$rows[$i]} = int($count);
}
}
}
}
#output
$first = 1;
open($fh, '>', $out) or die "Could not open file '$file' $!";
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
foreach $i (0..$#rows) {
if ($i > 6) {
#for modifying name
if ( $first == 1 ) {
$line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]",
"Count_$rows[$i]", $rows[$i]);
print $fh "$line,";
if ( $i == $#rows ) {
$first = 0;
}
}
else {
#dat_val = reverse sort { $a <=> $b } values %{$columns[$i]};
%ranks = {};
$rank_cnt = 0;
foreach $val (#dat_val) {
if ( ! exists($ranks{$val}) ) {
$rank_cnt++;
}
$ranks{$val} = $rank_cnt;
}
$rank = $ranks{$columns[$i]{$rows[$i]}};
$cnt = $columns[$i]{$rows[$i]};
$ave = ($cnt / 14000000) * 100;
$line = join( ",", $rank, $ave, $cnt, $rows[$i]);
print $fh "$line,";
}
}
else {
print $fh "$rows[$i],";
}
}
print $fh "\n";
}
close($fh);
thanks in advance.
my table
Col
_1
_2
_3
_4
_5
_6
Col2
Col3
Col
Col5
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
7
6
6
1
3
edited to show sample table and correct $x
##Sample output
Col
_1
_2
_3
_4
_5
_6
Col2
rank_Col2
percent_rank_Col2
count_Col2
Col3
rank_Col3
percent_rank_Col3
count_Col3
Col
rank_Col
percent_rank_Col
count_Col
Col5
rank_Col5
percent_rank_Col5
count_Col5
FALSE
1
2
3
4
5
6
9
2
0.33
1
6
1
0.67
2
1
1
0.67
2
11
1
0.33
1
FALSE
1
2
3
4
5
6
6
1
0.67
2
6
1
0.67
2
2
2
0.33
1
4
1
0.33
1
FALSE
1
2
3
4
5
7
6
1
0.67
2
4
2
0.33
1
1
1
0.67
2
3
1
0.33
1
Presume you have this file:
% ls -lh file
-rw-r--r-- 1 dawg wheel 57G Jul 24 13:15 file
% time wc -l file
14000000 file
wc -l file 29.24s user 7.27s system 99% cpu 36.508 total
% awk -F, 'FNR==1{print $NF; exit}' file
1099
So we have a 57GB file with 14,000,000 line by 1099 col csv with random numbers.
It only takes 20 to 30 SECONDS to read the entire file in a line-by-line fashion.
How long in Perl?
% time perl -lnE '' file
perl -lnE '' file 5.70s user 9.67s system 99% cpu 15.381 total
So only 15 seconds in Perl line by line. How long to 'gulp' it?
% time perl -0777 -lnE '' file
perl -0777 -lnE '' file 12.13s user 23.86s system 98% cpu 36.688 total
But that is on THIS computer which has 255GB of RAM...
It took this Python script approximately 23 minutes to write that file:
total=0
cols=1100
with open('/tmp/file', 'w') as f_out:
for cnt in range(14_000_000):
line=','.join(f'{x}' for x in range(cols))
total+=cols
f_out.write(f'{cnt},{total},{line}\n')
The issue with your Perl script is you are gulping the whole file (with #data = <$fh>;) and the OS maybe is running out of RAM. It is likely swapping and this is very slow. (How much RAM do you have?)
Rewrite the script to do it line by line. You should be able to do your entire analysis in less than 1 hour.
Can you try the following script? It precomputes the ranks instead of recalculating them for each row. It also avoids saving the first 7 columns in #columns since they appear not to be used for anything.
use feature qw(say);
use strict;
use warnings;
{
my $x="Room_reserve";
my $in = "D:\\package properties\\${x}.csv";
my $out = "D:\\package properties\\output\\${x}_output.csv";
my $min_col = 7;
my ($data1, $data2) = read_input_file($in, $min_col);
my $columns = create_counter($data2);
my $ranks = create_ranks( $columns);
write_output($data1, $data2, $columns, $ranks, $min_col, $out);
}
sub create_counter {
my ($data) = #_;
print "Creating counter..";
my #columns;
my $first = 1;
for my $row (#$data) {
if ( $first ) {
$first = 0; next; # skip header
}
for my $i (0..$#$row) {
$columns[$i]{$row->[$i]}++;
}
}
say "done.";
return \#columns;
}
sub create_ranks {
my ( $columns ) = #_;
print "Creating ranks..";
my #ranks;
for my $col (#$columns) {
# sort the column values according to highest frequency.
my #freqs = sort { $b <=> $a } values %$col;
my $idx = 1;
my %ranks = map {$_ => $idx++} #freqs;
push #ranks, \%ranks;
}
say "done.";
return \#ranks;
}
sub read_input_file {
my ($in, $min_col) = #_;
print "Reading input file $in ..";
open(my $fh, '<', $in) or die "Could not open file '$in' $!";
my #data1;
my #data2;
while( my $line = <$fh> ) {
chomp $line;
my #fields = split(',', $line);
next if #fields == 0;
die "Unexpected column count at line $.\n" if #fields < $min_col;
push #data1, [#fields[0..($min_col-1)]];
push #data2, [#fields[$min_col..$#fields]];
}
say " $. lines.";
close($fh);
return \#data1, \#data2;
}
sub write_output {
my ($data1, $data2, $columns, $ranks, $min_col, $out) = #_;
say "Saving output file $out..";
open(my $fh, '>', $out) or die "Could not open file '$out': $!";
for my $i (0..$#$data1) {
my $row1 = $data1->[$i];
my $row2 = $data2->[$i];
if ($i == 0) {
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
print $fh join ",", $row2->[$j],
"Rank_" . $row2->[$j],
"Percent_". $row2->[$j],
"Count_" . $row2->[$j];
print $fh "," if $j < $#$row2;
}
print $fh "\n";
next;
}
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
my $cnt = $columns->[$j]{$row2->[$j]};
my $rank = $ranks->[$j]{$cnt};
my $ave = ($cnt / 14000000) * 100;
print $fh join ",", $row2->[$j], $rank, $ave, $cnt;
print $fh "," if $j < $#$row2;
}
print $fh "\n";
}
close $fh;
say "Done.";
}
Related
extract variables from proc using upvar
I want to return some value from proc.. but It didn't work because of pointer.. So I tried to use upvar to return $sofar. but It didn't work.. Could you help me to extarct list of $sofar to return?? I need to grep $test_output as list of list. thanks :) proc combinationSum {sum sofar want numbers output } { if {$sum == $want} { puts $sofar #upvar $output tmp #set tmp $output } if {($sum < $want) && ([lindex $numbers 0] > 0) && ([llength $numbers] > 0)} { combinationSum [expr $sum + [lindex $numbers 0]] [concat $sofar [lindex $numbers 0]] $want $numbers $output] combinationSum $sum $sofar $want [lrange $numbers 1 end] $output] } } set test_input [list 2 3 4 7 8 ] set test_target 15 set test_output [ combinationSum 0 [] $test_target $test_input [] ] puts $test_output I expand the test_output is list ( {2 2 2 2 2 2 3} { 2 2 2 2 3 4 } { 2 2 2 2 7} ..... )
Converting Multiple CSV Rows to Individual Columns [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago. Improve this question I have a CSV file in this format: #Time,CPU,Data x,0,a x,1,b y,0,c y,1,d I want to transform it into this #Time,CPU 0 Data,CPU 1 Data x,a,b y,c,d But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column). How would I go about doing this? Example input # hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07 hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76 hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7 hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55 hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29 hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15 hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8 hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09 hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03 hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68 This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant) # hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29 hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least: $ cat tst.awk BEGIN{ FS = OFS = "," } NR==1 { for (i=1; i<=NF; i++) { fldName2nmbr[$i] = i } tsFldNmbr = fldName2nmbr["timestamp"] cpuFldNmbr = fldName2nmbr["CPU"] next } { tsVal = $tsFldNmbr cpuVal = $cpuFldNmbr if ( !(seenTs[tsVal]++) ) { tsVal2nmbr[tsVal] = ++numTss tsNmbr2val[numTss] = tsVal } if ( !(seenCpu[cpuVal]++) ) { cpuVal2nmbr[cpuVal] = ++numCpus cpuNmbr2val[numCpus] = cpuVal } tsNmbr = tsVal2nmbr[tsVal] cpuNmbr = cpuVal2nmbr[cpuVal] cpuData = "" for (i=1; i<=NF; i++) { if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) { cpuData = (cpuData == "" ? "" : cpuData OFS) $i } } data[tsNmbr,cpuNmbr] = cpuData } END { printf "%s", "timestamp" for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) { printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr] } print "" for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) { printf "%s", tsNmbr2val[tsNmbr] for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) { printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr] } print "" } } . $ awk -f tst.awk file timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data 2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29" 2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68" I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.
If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this: use feature qw(say); use strict; use warnings; my $fn = 'log.csv'; open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my $header = <$fh>; my %info; my #times; while ( my $line = <$fh> ) { chomp $line; my ( $time, $cpu, $data ) = split ",", $line; push #times, $time if !exists $info{$time}; push #{ $info{$time} }, $data; } close $fh; for my $time (#times) { say join ",", $time, #{ $info{$time} }; } Output: x,a,b y,c,d
Print HTML Table from an array with condition with Perl CGI
i have made a script that returns me an array with several lines like : DATA:VALUE:VALUE_MAX I need to fill a table with those value like : NAME | Status -------------------------- DATA | OK/minor/warning... .... | ......... .... | ......... with VALUE and VALUE_MAX i calculate the percent wich give me the status. here is my code for print the table : my #i = my_status(); print <<END; <div class="container"> <table class="table"> <thead> <tr> <th>Name</th> <th>Status</th> </tr> </thead> <tbody> END my $inc = 0; while (#i) { my #temp = split /:/, #i[$inc]; my $name = $temp[0]; my $percent = ($temp[1] * $temp[2] / 100); my $status = undef; if ($percent <= 24 ) { print "<tr class='info'>"; $status = "Critical !"; } elsif ($percent <= 49 ) { print "<tr class='danger'>"; $status = "Danger !"; } elsif ($percent <= 74 ) { print "<tr class='warning'>"; $status = "Warning"; } elsif ($percent <= 99 ) { print "<tr class='active'>"; $status = "Minor"; } elsif ($percent == 100 ) { print "<tr class='success'>"; $status = "OK"; } print "<td>$name</td>"; print "<td>$status</td>"; print "</tr>"; $inc++; } print <<END; </tbody> </table> </div> END My script "my_status" is a bit long to execute, it's full of server request... but the thing is, on the HTML page, everything is a mess, i get wrong value, and an endless loop who print only "Critical !" in status colomns what is wrong with my script ?
You are not iterating #i in your while loop. Your line while (#i) { means that it will stay in the loop as long as #i is true. Because that's an array, that means that as long as there are items in #i, it will stay in the loop. You do not remove anything from #i inside of the loop. There are no shift or pop commands, and you also do not overwrite #i. So it will stay indefinitely. You've got yourself an infinite loop. What you want instead is probably a foreach loop. Then you also don't need $inc. It will put each element inside of #i into $elem and run the loop. foreach my $elem (#i) { my #temp = split /:/, $elem; my $name = $temp[0]; my $percent = ( $temp[1] * $temp[2] / 100 ); my $status = undef; if ( $percent <= 24 ) { print "<tr class='info'>"; $status = "Critical !"; } elsif ( $percent <= 49 ) { print "<tr class='danger'>"; $status = "Danger !"; } elsif ( $percent <= 74 ) { print "<tr class='warning'>"; $status = "Warning"; } elsif ( $percent <= 99 ) { print "<tr class='active'>"; $status = "Minor"; } elsif ( $percent == 100 ) { print "<tr class='success'>"; $status = "OK"; } print "<td>$name</td>"; print "<td>$status</td>"; print "</tr>"; } You can read up on loops in perlsyn starting from for loops.
Compare two csv files and deduct matches from original
Given two csv files: File1.csv SKU,Description,UPC 101,Saw,101010103 102,Drill,101010102 103,Screw,101010101 104,Nail,101010104 File2.csv SKU,Description,UPC 100,Light,101010105 101,Saw,101010103 104,Nail,101010104 106,Battery,101010106 108,Bucket,101010114 I'd like to create a new csv file, we'll call UpdatedList.csv, that has every entry from File1.csv minus any rows where the SKU is in both File1.csv and File2.csv. In this case UpdatedList.csv will look like UpdatedList.csv "SKU","Description","UPC" "102","Drill","101010102" "103","Screw","101010101" The following code will do what I want but I believe there is a more efficient way. How can I do this without loops? My code is as follows. #### Create a third file that has all elements of file 1 minus those in file 2 ### $FileName1 = Get-FileName "C:\LowInventory" $FileName2 = Get-FileName "C:\LowInventory" $f1 = ipcsv $FileName1 $f2 = ipcsv $FileName2 $f3 = ipcsv $FileName1 For($i=0; $i -lt $f1.length; $i++){ For($j=0; $j -lt $f2.length; $j++){ if ($f1[$i].SKU -eq $f2[$j].SKU){$f3[$i].SKU = 0} } } $f3 | Where-Object {$_.SKU -ne "0"} | epcsv "C:\LowInventory\UpdatedList.csv" -NoTypeInformation Invoke-Item "C:\LowInventory\UpdatedList.csv" ################################
You can do this without loops by taking advantage of the Group-Object cmdlet: $f1 = ipcsv File1.csv; $f2 = ipcsv File2.csv; $f1.ForEach({Add-Member -InputObject $_ 'X' 0}) # So we can select these after $f1 + $f2 | # merge our lists group SKU | # group by SKU where {$_.Count -eq 1} | # select ones with unique SKU select -expand Group | # ungroup where {$_.X -eq 0} # where from file1
Using AWK to get text and looping in csv
i'am new in awk and i want ask... i have a csv file like this IVALSTART IVALEND IVALDATE 23:00:00 23:30:00 4/9/2012 STATUS LSN LOC K lskpg 1201 K lntrjkt 1201 K lbkkstp 1211 and i want to change like this IVALSTART IVALEND 23:00:00 23:30:00 STATUS LSN LOC IVALDATE K lskpg 1201 4/9/2012 K lntrjkt 1201 4/9/2012 K lbkkstp 1211 4/9/2012 How to do it in awk? thanks and best regards!
Try this: awk ' NR == 1 { name = $3; print $1, $2 } NR == 2 { date = $3; print $1, $2 } NR == 3 { print "" } NR == 4 { $4 = name; print } NR > 4 { $4 = date; print } ' FILE If you need formating, it's necessary to change print to printf with appropriate specifiers.