Efficient code to do a countif, ranking on huge csv file

Efficient code to do a countif, ranking on huge csv file - csv

need help to accelerate the run of below perl code. 4 to 6hours should be fine. faster is better :)
csv file is about 14m to 14.5m rows, and aorund 1100 to 1500columns; 62gig
what it does:
do a count (like a countif in excel)
get the percent (based on 14m rows)
get the rank based on count
My current code:
use List::MoreUtils qw(uniq);
$x="Room_reserve";
$in = "D:\\package properties\\${x}.csv";
$out = "D:\\package properties\\output\\${x}_output.csv";
open($fh, '<', $in) or die "Could not open file '$file' $!";
#data = <$fh>;
close($fh);
%counts;
#columns;
$first = 1;
#counter
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
if ($first == 1) {
$first = 0;
next;
}
else {
$count = 1;
foreach $i (0..$#rows) {
if ( exists($columns[$i]{$rows[$i]}) ) {
$columns[$i]{$rows[$i]}++;
}
else {
$columns[$i]{$rows[$i]} = int($count);
}
}
}
}
#output
$first = 1;
open($fh, '>', $out) or die "Could not open file '$file' $!";
foreach $dat (#data) {
chomp($dat);
#rows = split(',',$dat);
foreach $i (0..$#rows) {
if ($i > 6) {
#for modifying name
if ( $first == 1 ) {
$line = join( ",", "Rank_$rows[$i]", "Percent_$rows[$i]",
"Count_$rows[$i]", $rows[$i]);
print $fh "$line,";
if ( $i == $#rows ) {
$first = 0;
}
}
else {
#dat_val = reverse sort { $a <=> $b } values %{$columns[$i]};
%ranks = {};
$rank_cnt = 0;
foreach $val (#dat_val) {
if ( ! exists($ranks{$val}) ) {
$rank_cnt++;
}
$ranks{$val} = $rank_cnt;
}
$rank = $ranks{$columns[$i]{$rows[$i]}};
$cnt = $columns[$i]{$rows[$i]};
$ave = ($cnt / 14000000) * 100;
$line = join( ",", $rank, $ave, $cnt, $rows[$i]);
print $fh "$line,";
}
}
else {
print $fh "$rows[$i],";
}
}
print $fh "\n";
}
close($fh);
thanks in advance.
my table
Col
_1
_2
_3
_4
_5
_6
Col2
Col3
Col
Col5
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
6
6
6
1
4
FALSE
1
2
3
4
5
7
6
6
1
3
edited to show sample table and correct $x
##Sample output
Col
_1
_2
_3
_4
_5
_6
Col2
rank_Col2
percent_rank_Col2
count_Col2
Col3
rank_Col3
percent_rank_Col3
count_Col3
Col
rank_Col
percent_rank_Col
count_Col
Col5
rank_Col5
percent_rank_Col5
count_Col5
FALSE
1
2
3
4
5
6
9
2
0.33
1
6
1
0.67
2
1
1
0.67
2
11
1
0.33
1
FALSE
1
2
3
4
5
6
6
1
0.67
2
6
1
0.67
2
2
2
0.33
1
4
1
0.33
1
FALSE
1
2
3
4
5
7
6
1
0.67
2
4
2
0.33
1
1
1
0.67
2
3
1
0.33
1

Presume you have this file:
% ls -lh file
-rw-r--r-- 1 dawg wheel 57G Jul 24 13:15 file
% time wc -l file
14000000 file
wc -l file 29.24s user 7.27s system 99% cpu 36.508 total
% awk -F, 'FNR==1{print $NF; exit}' file
1099
So we have a 57GB file with 14,000,000 line by 1099 col csv with random numbers.
It only takes 20 to 30 SECONDS to read the entire file in a line-by-line fashion.
How long in Perl?
% time perl -lnE '' file
perl -lnE '' file 5.70s user 9.67s system 99% cpu 15.381 total
So only 15 seconds in Perl line by line. How long to 'gulp' it?
% time perl -0777 -lnE '' file
perl -0777 -lnE '' file 12.13s user 23.86s system 98% cpu 36.688 total
But that is on THIS computer which has 255GB of RAM...
It took this Python script approximately 23 minutes to write that file:
total=0
cols=1100
with open('/tmp/file', 'w') as f_out:
for cnt in range(14_000_000):
line=','.join(f'{x}' for x in range(cols))
total+=cols
f_out.write(f'{cnt},{total},{line}\n')
The issue with your Perl script is you are gulping the whole file (with #data = <$fh>;) and the OS maybe is running out of RAM. It is likely swapping and this is very slow. (How much RAM do you have?)
Rewrite the script to do it line by line. You should be able to do your entire analysis in less than 1 hour.

Can you try the following script? It precomputes the ranks instead of recalculating them for each row. It also avoids saving the first 7 columns in #columns since they appear not to be used for anything.
use feature qw(say);
use strict;
use warnings;
{
my $x="Room_reserve";
my $in = "D:\\package properties\\${x}.csv";
my $out = "D:\\package properties\\output\\${x}_output.csv";
my $min_col = 7;
my ($data1, $data2) = read_input_file($in, $min_col);
my $columns = create_counter($data2);
my $ranks = create_ranks( $columns);
write_output($data1, $data2, $columns, $ranks, $min_col, $out);
}
sub create_counter {
my ($data) = #_;
print "Creating counter..";
my #columns;
my $first = 1;
for my $row (#$data) {
if ( $first ) {
$first = 0; next; # skip header
}
for my $i (0..$#$row) {
$columns[$i]{$row->[$i]}++;
}
}
say "done.";
return \#columns;
}
sub create_ranks {
my ( $columns ) = #_;
print "Creating ranks..";
my #ranks;
for my $col (#$columns) {
# sort the column values according to highest frequency.
my #freqs = sort { $b <=> $a } values %$col;
my $idx = 1;
my %ranks = map {$_ => $idx++} #freqs;
push #ranks, \%ranks;
}
say "done.";
return \#ranks;
}
sub read_input_file {
my ($in, $min_col) = #_;
print "Reading input file $in ..";
open(my $fh, '<', $in) or die "Could not open file '$in' $!";
my #data1;
my #data2;
while( my $line = <$fh> ) {
chomp $line;
my #fields = split(',', $line);
next if #fields == 0;
die "Unexpected column count at line $.\n" if #fields < $min_col;
push #data1, [#fields[0..($min_col-1)]];
push #data2, [#fields[$min_col..$#fields]];
}
say " $. lines.";
close($fh);
return \#data1, \#data2;
}
sub write_output {
my ($data1, $data2, $columns, $ranks, $min_col, $out) = #_;
say "Saving output file $out..";
open(my $fh, '>', $out) or die "Could not open file '$out': $!";
for my $i (0..$#$data1) {
my $row1 = $data1->[$i];
my $row2 = $data2->[$i];
if ($i == 0) {
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
print $fh join ",", $row2->[$j],
"Rank_" . $row2->[$j],
"Percent_". $row2->[$j],
"Count_" . $row2->[$j];
print $fh "," if $j < $#$row2;
}
print $fh "\n";
next;
}
print $fh join ",", #$row1;
print $fh ",";
for my $j (0..$#$row2) {
my $cnt = $columns->[$j]{$row2->[$j]};
my $rank = $ranks->[$j]{$cnt};
my $ave = ($cnt / 14000000) * 100;
print $fh join ",", $row2->[$j], $rank, $ave, $cnt;
print $fh "," if $j < $#$row2;
}
print $fh "\n";
}
close $fh;
say "Done.";
}

Related

extract variables from proc using upvar

I want to return some value from proc.. but It didn't work because of pointer..
So I tried to use upvar to return $sofar. but It didn't work..
Could you help me to extarct list of $sofar to return??
I need to grep $test_output as list of list.
thanks :)
proc combinationSum {sum sofar want numbers output } {
if {$sum == $want} {
puts $sofar
#upvar $output tmp
#set tmp $output
}
if {($sum < $want) && ([lindex $numbers 0] > 0) && ([llength $numbers] > 0)} {
combinationSum [expr $sum + [lindex $numbers 0]] [concat $sofar [lindex $numbers 0]] $want $numbers $output]
combinationSum $sum $sofar $want [lrange $numbers 1 end] $output]
}
}
set test_input [list 2 3 4 7 8 ]
set test_target 15
set test_output [ combinationSum 0 [] $test_target $test_input [] ]
puts $test_output
I expand the test_output is list ( {2 2 2 2 2 2 3} { 2 2 2 2 3 4 } { 2 2 2 2 7} ..... )

Converting Multiple CSV Rows to Individual Columns [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a CSV file in this format:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
I want to transform it into this
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
But I don't know the number of CPU cores there will be in a system (represented by the CPU column). I also have multiple columns of data (not just the singular data column).
How would I go about doing this?
Example input
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
This would be the output for this file: (CPU -1 turns into CPU ALL)(also the key value is just the timestamp (the hostname and interval stay constant)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68

Your question isn't clear and doesn't contain the expected output for your posted larger/presumably more realistic sample CSV so idk what output you were hoping for but this will show you the right approach at least:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
I put the per-CPU data within double quotes so you could import it to Excel or similar without worrying about the commas between the sub-fields.

If we assume that the CSV input file is sorted according to increasing timestamps, you could try something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my #times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push #times, $time if !exists $info{$time};
push #{ $info{$time} }, $data;
}
close $fh;
for my $time (#times) {
say join ",", $time, #{ $info{$time} };
}
Output:
x,a,b
y,c,d

Print HTML Table from an array with condition with Perl CGI

i have made a script that returns me an array with several lines like :
DATA:VALUE:VALUE_MAX
I need to fill a table with those value like :
NAME | Status
--------------------------
DATA | OK/minor/warning...
.... | .........
.... | .........
with VALUE and VALUE_MAX i calculate the percent wich give me the status.
here is my code for print the table :
my #i = my_status();
print <<END;
<div class="container">
<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Status</th>
</tr>
</thead>
<tbody>
END
my $inc = 0;
while (#i) {
my #temp = split /:/, #i[$inc];
my $name = $temp[0];
my $percent = ($temp[1] * $temp[2] / 100);
my $status = undef;
if ($percent <= 24 ) {
print "<tr class='info'>";
$status = "Critical !";
}
elsif ($percent <= 49 ) {
print "<tr class='danger'>";
$status = "Danger !";
}
elsif ($percent <= 74 ) {
print "<tr class='warning'>";
$status = "Warning";
}
elsif ($percent <= 99 ) {
print "<tr class='active'>";
$status = "Minor";
}
elsif ($percent == 100 ) {
print "<tr class='success'>";
$status = "OK";
}
print "<td>$name</td>";
print "<td>$status</td>";
print "</tr>";
$inc++;
}
print <<END;
</tbody>
</table>
</div>
END
My script "my_status" is a bit long to execute, it's full of server request...
but the thing is, on the HTML page, everything is a mess, i get wrong value, and an endless loop who print only "Critical !" in status colomns
what is wrong with my script ?

You are not iterating #i in your while loop. Your line
while (#i) {
means that it will stay in the loop as long as #i is true. Because that's an array, that means that as long as there are items in #i, it will stay in the loop.
You do not remove anything from #i inside of the loop. There are no shift or pop commands, and you also do not overwrite #i. So it will stay indefinitely. You've got yourself an infinite loop.
What you want instead is probably a foreach loop. Then you also don't need $inc. It will put each element inside of #i into $elem and run the loop.
foreach my $elem (#i) {
my #temp = split /:/, $elem;
my $name = $temp[0];
my $percent = ( $temp[1] * $temp[2] / 100 );
my $status = undef;
if ( $percent <= 24 ) {
print "<tr class='info'>";
$status = "Critical !";
}
elsif ( $percent <= 49 ) {
print "<tr class='danger'>";
$status = "Danger !";
}
elsif ( $percent <= 74 ) {
print "<tr class='warning'>";
$status = "Warning";
}
elsif ( $percent <= 99 ) {
print "<tr class='active'>";
$status = "Minor";
}
elsif ( $percent == 100 ) {
print "<tr class='success'>";
$status = "OK";
}
print "<td>$name</td>";
print "<td>$status</td>";
print "</tr>";
}
You can read up on loops in perlsyn starting from for loops.

Compare two csv files and deduct matches from original

Given two csv files:
File1.csv
SKU,Description,UPC
101,Saw,101010103
102,Drill,101010102
103,Screw,101010101
104,Nail,101010104
File2.csv
SKU,Description,UPC
100,Light,101010105
101,Saw,101010103
104,Nail,101010104
106,Battery,101010106
108,Bucket,101010114
I'd like to create a new csv file, we'll call UpdatedList.csv, that has every entry from File1.csv minus any rows where the SKU is in both File1.csv and File2.csv. In this case UpdatedList.csv will look like
UpdatedList.csv
"SKU","Description","UPC"
"102","Drill","101010102"
"103","Screw","101010101"
The following code will do what I want but I believe there is a more efficient way. How can I do this without loops? My code is as follows.
#### Create a third file that has all elements of file 1 minus those in file 2 ###
$FileName1 = Get-FileName "C:\LowInventory"
$FileName2 = Get-FileName "C:\LowInventory"
$f1 = ipcsv $FileName1
$f2 = ipcsv $FileName2
$f3 = ipcsv $FileName1
For($i=0; $i -lt $f1.length; $i++){
For($j=0; $j -lt $f2.length; $j++){
if ($f1[$i].SKU -eq $f2[$j].SKU){$f3[$i].SKU = 0}
}
}
$f3 | Where-Object {$_.SKU -ne "0"} | epcsv "C:\LowInventory\UpdatedList.csv" -NoTypeInformation
Invoke-Item "C:\LowInventory\UpdatedList.csv"
################################

You can do this without loops by taking advantage of the Group-Object cmdlet:
$f1 = ipcsv File1.csv;
$f2 = ipcsv File2.csv;
$f1.ForEach({Add-Member -InputObject $_ 'X' 0}) # So we can select these after
$f1 + $f2 | # merge our lists
group SKU | # group by SKU
where {$_.Count -eq 1} | # select ones with unique SKU
select -expand Group | # ungroup
where {$_.X -eq 0} # where from file1

Using AWK to get text and looping in csv

i'am new in awk and i want ask...
i have a csv file like this
IVALSTART IVALEND IVALDATE
23:00:00 23:30:00 4/9/2012
STATUS LSN LOC
K lskpg 1201
K lntrjkt 1201
K lbkkstp 1211
and i want to change like this
IVALSTART IVALEND
23:00:00 23:30:00
STATUS LSN LOC IVALDATE
K lskpg 1201 4/9/2012
K lntrjkt 1201 4/9/2012
K lbkkstp 1211 4/9/2012
How to do it in awk?
thanks and best regards!

Try this:
awk '
NR == 1 { name = $3; print $1, $2 }
NR == 2 { date = $3; print $1, $2 }
NR == 3 { print "" }
NR == 4 { $4 = name; print }
NR > 4 { $4 = date; print }
' FILE
If you need formating, it's necessary to change print to printf with appropriate specifiers.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Efficient code to do a countif, ranking on huge csv file - csv

Related

extract variables from proc using upvar

Converting Multiple CSV Rows to Individual Columns [closed]

Print HTML Table from an array with condition with Perl CGI

Compare two csv files and deduct matches from original

Using AWK to get text and looping in csv

Categories

Resources