Encoding Special Characters in Perl - html

I have this string for example:
This is an example text ã&"><£
When I run this Perl code on the string:
my($string)= #_;
$string =~ s/>//g;
$string =~ s/<//g;
$string =~ s/&/and/g;
$string =~ s/\"//g;
$string =~ s/-//;
$string =~ s/ó//;
$string =~ s/;//g;
$string =~ s/&/&/g;
$string = encode_entities($string, '<>&"');
$string = encode_utf8($string);
return $string;
I receive this result:
This is an example text ã£ã£
Instead of the expected one:
This is an example text ã&"><£
How can I solve it?

Can you try the following script:
use feature qw(say);
use strict;
use warnings;
use utf8;
use open qw(:std :encoding(utf-8));
use HTML::Entities;
my $string = 'This is an example text ã&"><£';
$string = encode_entities($string, '<>&"');
say $string;
Output:
This is an example text ã&"><£

Related

Perl Regex for Not HTML

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.
I have the below regex to remove all HTML from a string, but need help with the opposite scenario.
$string =~ s/<[^>]+>//g;
Thanks.
If this is regex s///ubstitution to remove all html from document
$string =~ s/<[^>]+>//g;
Then you can use the same regex in a m//atch operator to keep all html from document
$string = join '', $string =~ m/<[^>]+>/g;
If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281
Mind the caveats like Ethan Browne mentions :)
Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.
HTML::Parser is more ubiquitous, but it's not hard to Google for more.
http://metacpan.org/pod/HTML::Parser
A solution using HTML::Parser is (tested once):
use HTML::Parser ();
my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;
Are you looking for this?
$string =~ s/>[^<]*</></mg;
Or this?
$string =~ s/(?<=>)[^<]*(?=<)//mg;
LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it
#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;
Main( #ARGV );
exit( 0 );
sub Main {
binmode STDOUT;
my $loc = shift or die "
Usage:
$0 ko00010.html
$0 http://example.com/ko00010.html\n\n";
my $dom = XML::LibXML->new(
qw/
recover 2
no_blanks 1
/
)->load_html( location => $loc, );
## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
for my $text ( $dom->findnodes(q{ //text() }) ){
node_detach( $text );
}
local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
local $XML::LibXML::setTagCompression = 0; ## <p />
#~ print "$dom";
my $pp = XML::LibXML::PrettyPrint->new_for_html;
$pp->{indent_string}=' ';
print $pp->pretty_print( $dom );
}
sub node_detach {
my( $self ) = #_;
$self->parentNode->removeChild( $self );
}

Remove trailing commas at the end of the string using Perl

I'm parsing a CSV file in which each line look something as below.
10998,4499,SLC27A5,Q9Y2P5,GO:0000166,GO:0032403,GO:0005524,GO:0016874,GO:0047747,GO:0004467,GO:0015245,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
There seems to be trailing commas at the end of each line.
I want to get the first term, in this case "10998" and get the number of GO terms related to it.
So my output in this case should be,
Output:
10998,7
But instead it shows 299. I realized overall there are 303 commas in each line. And I'm not able to figure out an easy way to remove trailing commas. Can anyone help me solve this issue?
Thanks!
My Code:
use strict;
use warnings;
open my $IN, '<', 'test.csv' or die "can't find file: $!";
open(CSV, ">GO_MF_counts_Genes.csv") or die "Error!! Cannot create the file: $!\n";
my #genes = ();
my $mf;
foreach my $line (<$IN>) {
chomp $line;
my #array = split(/,/, $line);
my #GO = splice(#array, 4);
my $GO = join(',', #GO);
$mf = count($GO);
print CSV "$array[0],$mf\n";
}
sub count {
my $go = shift #_;
my $count = my #go = split(/,/, $go);
return $count;
}
I'd use juanrpozo's solution for counting but if you still want to go your way, then remove the commas with regex substitution.
$line =~ s/,+$//;
I suggest this more concise way of coding your program.
Note that the line my #data = split /,/, $line discards trailing empty fields (#data has only 11 fields with your sample data) so will produce the same result whether or not trailing commas are removed beforehand.
use strict;
use warnings;
open my $in, '<', 'test.csv' or die "Cannot open file for input: $!";
open my $out, '>', 'GO_MF_counts_Genes.csv' or die "Cannot open file for output: $!";
foreach my $line (<$in>) {
chomp $line;
my #data = split /,/, $line;
printf $out "%s,%d\n", $data[0], scalar grep /^GO:/, #data;
}
You can apply grep to #array
my $mf = grep { /^GO:/ } #array;
assuming $array[0] never matches /^GO:/
For each your line:
foreach my $line (<$IN>) {
my ($first_term) = ($line =~ /(\d+),/);
my #tmp = split('GO', " $line ");
my $nr_of_GOs = #tmp - 1;
print CSV "$first_term,$nr_of_GOs\n";
}

Perl - How to decode or replace ' from database to single quote for browser display

I have found lots of questions close to this but, nothing that helped me solve it. Prob due to my lack of expertise.
PHP has html_entity_decode which could have helped but, Perl does not, I believe.
In my MySQL database I have ' " following lengths in a description like: 12' 6"
I would like it to display as 12' 6".
I have tried:
$string =~ s/:[']:/'/g;
$string =~ s/:["]:/"/g;
$string =~ s/'/'/g;
$string =~ s/"/"/g;
$string =~ s/\'/'/g;
$string =~ s/\"/"/g;
$string =~ s/\'/\'/g;
$string =~ s/\"/\"/g;
perl -pi -e 's:':':g' $_; #crashes.
perl -pi -e 's:":":g' $string #also crashes.
system -pi -e 's:':':g' $_; #crashes.
system -pi -e 's:":":g' $string #also crashes.
I am at a loss. Can someone help?
Have a look at HTML::Entities:
use warnings;
use strict;
use HTML::Entities;
my $str = '12' 6"';
print decode_entities($str);

Trouble Getting Regular Expression To Work

I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:
1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed.
An example of my regex code is:
$file =~ s/<ul(.*)>//gi;
Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.
2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove
<script type="text/javascript">
function getCookies() { return ""; }
</script>
I have tried using various regex codes, but nothing seems to remove these lines. For instance:
$file =~ s/<script type(.*)<\/script>//gi;
Which removes the <script type...> and </script> tags respectively, but leaves the
function getCookies() { return ""; }
...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!
Edit: Sorry all, I'm using Perl!
Also: I just tried using
$file =~ /<script type(.*)<\/script>/sgi
...as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the
function getCookies() { return ""; }
...section stayed. Here is my entire code, including all regex:
use strict;
use warnings;
my $firstarg;
if ($ARGV[0]){
$firstarg = $ARGV[0];
}
open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;
Ok, now that I deleted the <script> regex that was causing one problem, another has been created - using:
$file =~ s/<script type(.*)<\/script>//gi;
removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:
$file =~ s/<script type(.*)<\/script>//mgi;
results in the exact same thing. Using:
$file =~ s/<script type(.*)<\/script>//sgi;
results in the printing of several new line characters, but no other text, same for /msgi.
Urgh, the problems never end... :(
NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.
If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:
#!/usr/bin/perl -w
use strict;
use warnings;
$_ = do { local $/; <DATA> };
# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
# find < ,
# comment <!-- ... -->,
# or comment <? ... ?> ,
# or one of the start tags which require correspond
# end tag plus all to end tag
# or if \s or ="
# then skip to next "
# else [^>]
# >
s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
print;
__END__
<html><title>remove script, ul</title>
<script type="text/javascript">
function getCookies() { return ""; }
</script>
<body>
<ul><li>1
<li>2
<p>paragraph
Output
remove script, ul
1
2
paragraph
NOTE: This regex doesn't work for nested tag-containers e.g.:
<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested <object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
<object data="uri:another">uri:another fallback
</object>!!!this text should be striped too!!!
</object>
Output
Nested <object> example
!!!this text should be striped too!!!
Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::Parser ();
HTML::Parser->new(
ignore_elements => ["script"],
ignore_tags => ["ul"],
default_h => [ sub { print shift }, 'text'],
)->parse_file(\*DATA) or die "error: $!\n";
__END__
<html><title>remove script, ul</title>
<script type="text/javascript">
function getCookies() { return ""; }
</script>
<body>
<ul><li>1
<li>2
<p>paragraph
Output
<html><title>remove script, ul</title>
<body>
<li>1
<li>2
<p>paragraph
To reply your last comment:
perl -e'$file="<script etc>\nfoo\n</script>bar"; $file =~ s/<script.*script>//gis; print $file'
this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.
....
Can you add this:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($file);
before the regexp and give us the result?
.....
Bingo:
line 5 and 6 of your $file =~ list already filter them out:
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
## Here they come:
$file =~ s/<script(.*)>//gi;
$file =~ s/<\/script>//gi;
$file =~ s/<head>//gi;
I'm not sure what programming language you're using, but assuming that you're in perl, try putting the s modifier at the end of the regex:
$file =~ /<script type(.*)<\/script>/sgi
The /s modifier makes the . match any character, including newlines (normally it doesn't include newlines)
Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the s/ in front is for substitutions. In this case, your regex should be:
$file =~ s/<script type(.*)<\/script>/sgi
to remove everything, including the script tags. However, if you just want the content between the tags it is:
$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/$1$2/sgi;
Notice the $1$2 between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) like s/<ul(.*)>/ /gi.
Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:
$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/ /sgi;
and another generic regex for all the other tags:
$file =~ s/<\/?\s*[^>]+>//sgi
I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.
You’re going to have to be a lot more careful than that. See both approaches in this answer.
This:
$file =~ s/<div(.*)>//gi;
won't do what you expect. The '*' operator is greedy. If you have a line like:
hello<div id="foo"><b>bar!</b>baz
it'll substitute as much as it can, leaving only:
hellobaz
You want:
$file =~ s/<div[^>]*>//gi;
or
$file =~ s/<div.*?>//gi;

How can I extract data from HTML tables in Perl?

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract
Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my #headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \#headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", #headers), "\n";
for my $row ($te->rows ) {
print join("\t", #$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.
That's an easy one:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my #stuff = $html =~ />([^<]+)</g;
print join (", ", #stuff), "\n";
See http://codepad.org/qz9d5Bro if you want to try running it.