Delete one character at End of File in PERL - html

So I have encountered a problem while programming with PERL. I use a foreach loop to get some data out of the hash, so it has to loop through it.
The Code:
foreach $title (keys %FilterSPRINTHASH) {
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
print $outfile "$line\n";
}
The thing is, that I am creating a seperate File with the PERL's writting to a file expression, which will be a JSONP text (later used for HTML).
Back to the problem:
As JSONP requires comma's "," after every line that is not the last one, i had to put a comma at the end of line, however when the last line comes in, I have to remove the comma.
I have tried with CHOP function, but not sure where to put it, since if I put it at the end of foreach, it will just chop the comma in $line, but this wont chop it in the new file I created.
I have also tried with while (<>) statement, with no success.
Any ideas appreaciated.
BR

Using JSON module is far less error prone; no need to reinvent the wheel
use JSON;
print $outfile encode_json(\%FilterSPRINTHASH), "\n";

You can check if it is the last iteration of the loop, then remove the comma from line.
So something like
my $count = keys %FilterSPRINTHASH; #Get number of keys (scalar context)
my $loop_count = 1; #Use a variable to count number of iteration
foreach $title (keys %FilterSPRINTHASH){
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
if($loop_count == $count){
#this is the last iteration, so remove the comma from line
$line =~ s/,+$//;
}
print $outfile "$line\n";
$loop_count++;
}

i would approach this by storing your output in an array and then joining that with the line separators you wish:
my #output; # storage for output
foreach $title (keys %FilterSPRINTHASH) {
# create each line
my $line = sprintf "'%s':{'openSP' : %s, 'estSP' : %s}", $title, $FilterSPRINTHASH{$title}{openSP}, $FilterSPRINTHASH{$title}{estSP};
# and put it in the output container
push #output, $line;
}
# join all outputlines with comma and newline and then output
print $outfile (join ",\n", #output);

Related

Tcl: How to append multiple variables to a single line with a space between them?

Below code, append can add all file paths to a single line, but there is no space between them.
How to add a space between each path?
set all_path ""
foreach line $lines {
set filepath [proc_get_file_path $line]
...
#some commands
...
append ::all_path $filepath
}
Expected output:
../path/a ../path/b ../path/c ...
How do you want to use all_path later on?
From the distance, this is where you would like to use a Tcl list:
set all_path [list]
foreach line $lines {
set filepath [proc_get_file_path $line]
# ...
lappend all_path $filepath
}
The string representation of Tcl lists would also match your expectation re a whitespace delimiter. You can also assemble such a string manually, with append introducing a whitespace explicitly: append all_path " " $filepath. But maybe, this is not what you want to begin with ...

How to extract row data from a csv file using perl?

I have a csv file like
Genome Name,Resistance_phenotype,Amikacin,Gentamycin,Aztreonam
AB1,,Susceptible,Resistant,Resistant
AB2,,Susceptible,Susceptible,Susceptible
AB3,,Resistant,Resistant,NA
I need to fill 2nd column i.e. Resistant phenotype with MDR, XDR and susceptible. for which I have to match antibiotic resistance profile like if in first row gentamycin & antreanam both are resistant the 2nd column will be filled with MDR and in 3nd row if all 3 are susceptible the 2nd column of 3rd row will be filled with susceptible.
I have written a code mentioned below which only display columns of the csv file. I got stuck what to do further.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'text.csv';
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = split(/,/, $line);
print $fields[0], "\n";
#print $fields[1], "\n";
}
close $file;
Genome Name,Resistance_phenotype,Amikacin,Gentamycin,Aztreonam
AB1,MDR,Susceptible,Resistant,Resistant
AB2,Susceptible,Susceptible,Susceptible,Susceptible
AB3,MDR,Resistant,Resistant,NA
Use the Text::CSV_XS module. Read a line, assign the right value to the that column, then print it again. In your sample code, you were only writing one column instead of all of them; the module will handle all of that for you:
use Text::CSV_XS;
my $csv = Text::CSV_XS->new;
# replace *DATA and *STDOUT with whatever filehandles you want
# to read then write.
while( my $row = $csv->getline(*DATA) ) {
$row->[1] = 'Some value';
$csv->say( *STDOUT, $row );
}
__DATA__
Genome Name,Resistance_phenotype,Amikacin,Gentamycin,Aztreonam
AB1,,Susceptible,Resistant,Resistant
AB2,,Susceptible,Susceptible,Susceptible
AB3,,Resistant,Resistant,NA
The output is:
"Genome Name","Some value",Amikacin,Gentamycin,Aztreonam
AB1,"Some value",Susceptible,Resistant,Resistant
AB2,"Some value",Susceptible,Susceptible,Susceptible
AB3,"Some value",Resistant,Resistant,NA

Regex to parse html for sentences?

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";
Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this
A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.
I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";

HTML parser using perl

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.
<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>
Here is the following code.
use HTML::TagParser();
use URI::Fetch;
//my #list = $html->getElementsByTagName( "p" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
foreach $_ (#array) {
# print "$_\n";
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
}
}
}
$end = $#array1+1;
print "Elements in the array: $end\n";
close $html_fh;
The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?
The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text"); # this array is printed below
foreach $_ (#array) { # This is inside the other loop
# print "$_\n";
print $html_fh "$_\n"; # here comes the print
chomp ($_);
push (#array1, "$_");
}
}
So for example, if you have an array "foo", "bar", "baz", it would print:
foo # first iteration
foo # second
bar
foo # third
bar
baz
So, to fix your duplication errors, move the second loop outside the first one.
Some other notes:
You should always use these two pragmas:
use strict;
use warnings;
They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.
//my #list = $html->getElementsByTagName( "p" );
Comments in perl start with #. Not sure if this is a typo, because you use this array below.
foreach my $elem ( #list ) {
You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):
for my $elem ($html->getElementsByTagName("p")) {
These variables are also intermediate, and two of them unused.
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
Also note that you never have to quote a variable this way. You can simply do this:
push #array, $elem->innerText;
foreach $_ (#array) {
The $_ variable is used by default, no need to specify it explicitly.
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.
$end = $#array1+1;
This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:
$end = #array1; # size = last index + 1
But you can do this in one go:
print "Elements in the array: " . #array1 . "\n";
Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.
close $html_fh;
Explicitly closing a file handle is not required as it will automatically closed when the script ends.
If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):
#!/usr/bin/env perl
use strict;
use warnings qw(all);
use URI;
use Web::Scraper;
my $result = scraper {
process 'p',
'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));
for my $test (#{$result->{paragraph}}) {
print "$test\n";
}
print "Elements in the array: " . (scalar #{$result->{paragraph}});
Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10; # say
use Mojo::DOM;
my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END
my $dom = Mojo::DOM->new($html);
my #paragraphs = $dom->find('p')->pluck('text')->each;
say for #paragraphs;

Remove trailing commas at the end of the string using Perl

I'm parsing a CSV file in which each line look something as below.
10998,4499,SLC27A5,Q9Y2P5,GO:0000166,GO:0032403,GO:0005524,GO:0016874,GO:0047747,GO:0004467,GO:0015245,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
There seems to be trailing commas at the end of each line.
I want to get the first term, in this case "10998" and get the number of GO terms related to it.
So my output in this case should be,
Output:
10998,7
But instead it shows 299. I realized overall there are 303 commas in each line. And I'm not able to figure out an easy way to remove trailing commas. Can anyone help me solve this issue?
Thanks!
My Code:
use strict;
use warnings;
open my $IN, '<', 'test.csv' or die "can't find file: $!";
open(CSV, ">GO_MF_counts_Genes.csv") or die "Error!! Cannot create the file: $!\n";
my #genes = ();
my $mf;
foreach my $line (<$IN>) {
chomp $line;
my #array = split(/,/, $line);
my #GO = splice(#array, 4);
my $GO = join(',', #GO);
$mf = count($GO);
print CSV "$array[0],$mf\n";
}
sub count {
my $go = shift #_;
my $count = my #go = split(/,/, $go);
return $count;
}
I'd use juanrpozo's solution for counting but if you still want to go your way, then remove the commas with regex substitution.
$line =~ s/,+$//;
I suggest this more concise way of coding your program.
Note that the line my #data = split /,/, $line discards trailing empty fields (#data has only 11 fields with your sample data) so will produce the same result whether or not trailing commas are removed beforehand.
use strict;
use warnings;
open my $in, '<', 'test.csv' or die "Cannot open file for input: $!";
open my $out, '>', 'GO_MF_counts_Genes.csv' or die "Cannot open file for output: $!";
foreach my $line (<$in>) {
chomp $line;
my #data = split /,/, $line;
printf $out "%s,%d\n", $data[0], scalar grep /^GO:/, #data;
}
You can apply grep to #array
my $mf = grep { /^GO:/ } #array;
assuming $array[0] never matches /^GO:/
For each your line:
foreach my $line (<$IN>) {
my ($first_term) = ($line =~ /(\d+),/);
my #tmp = split('GO', " $line ");
my $nr_of_GOs = #tmp - 1;
print CSV "$first_term,$nr_of_GOs\n";
}