Perl Encoding - Saving File to UTF8 - html

I have a script that will download www pages, and I want to extract the text and store it in a uniform encoding (UTF8 would be fine). The downloading (UserAgent), Parsing (TreeBuilder) and text extraction seem fine, but I'm not sure I'm saving them correctly.
They dont view when opening the output file in for example notepad++; The original HTML views find in a text editor.
The HTML files typically have
charset=windows-1256 or
charset=UTF-8
So I figured if I could get the UTF8 one to work, then it was just an recoding problem. Here is some of what I have tried, assuming I have an HTML file saved to disk.
my $tree = HTML::TreeBuilder->new;
$tree->parse_file("$inhtml");
$tree->dump;
The output from dump captured for STDOUT views correctly in .txt file only after
Switching the encoding to utf8 in the text editor…
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
if (utf8::is_utf8($formatter->format($tree))) {
print " Is UTF8\n";
}
else {
print " Not UTF8\n";
}
Result Shows this IS UTF8 when the content says it is, and Not UTF8 otherwise.
I have tired
opening an file with ">" and ">:utf8"
binmode(MYFILE, ":utf8");
encode("utf8", $string); (where string is the output of formatter->format(tree))
But nothing seems to work correctly.
Any experts out there know what Im missing?
Thanks in advance!

This example can help you to find what you need:
use strict;
use warnings;
use feature qw(say);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );
open(my $fh_in, "<:encoding(cp1252)", $ARGV[0]) or die $!;
open(my $fh_out, ">:encoding(UTF-8)", $ARGV[1]) or die $!;
my $tree = Object::Destroyer->new(HTML::TreeBuilder->new(), 'delete');
$tree->parse_file($fh_in);
my $h1Element = $tree->look_down("_tag", "h1");
my $h1TrimmedText = $h1Element->as_trimmed_text();
say($fh_out $h1TrimmedText);

I really like the module utf8::all (unfortunately not in core).
Just use utf8::all and you have no worries about IO, when you work only with UTF-8 files.

Related

JSON encoding in Perl output

Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.
Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).
Using the library mentioned by #simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.
Problem:
The source config files look something like this:
$sites = {
"0100101001" => {
mail => 1,
from => 'mail#mail.com',
to => 'mail#mail.com',
subject => 'á é í ó ú',
msg => 'á é í ó ú',
ftp => 0,
sftp => 0,
},
"22222222" => {
[...]
And many more of those.
My "parsing" code is the following:
use strict;
use warnings;
# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;
my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
$sites
);
do $conf;
# my $json = encode_json($sites);
my $yaml = Dump($sites);
binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);
Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.
The output of the above code is this:
[...snip...]
mail: 1
msg: á é í ó ú
sftp: 0
subject: á é í ó ú
[...snip...]
The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.
What am I missing here? Any workaround?
Note: I thought I may have been my shell, but locale outputs this:
❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Which seems ok.
Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.
Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.
The sites config file is UTF-8 encoded. Here are three workarounds:
Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.
If that is not feasible, decode the input before you pass it to the JSON encoder. Something like
open CFG, "<:encoding(utf-8)", $conf;
do { local $/; eval <CFG> };
close CFG;
instead of
do $conf
Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.
There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.

How to generate csv file with default UTF-8 encoding using php

By fetching data from mysql table i am generating csv file and uploading to some ftp folder. But other person using this csv file from ftp side says that it is in ANSI encoding. How can i change that to UTF-8 encoding? For this I am using the below code.
header('Content-Encoding: utf-8');
header('Content-Type: text/csv; charset=utf-8');
$fh1 = fopen($current_csv_name, 'w+');
foreach($csv_data as $curl_response)
{
fputs($fh1, implode($curl_response, ';')."\n");
}
fclose($fh1);
When i download the file and open in notepad and click on save as it is always showing Encoding as ASNI. Where i am doing wrong? Any help would be greatly appreciated.
I believe you need to add the BOM (Byte order Mark) at the beginning of the file, otherwise, the filesystem will always assume your charset to be ANSI by default. Try doing this:
header('Content-Encoding: utf-8');
header('Content-Type: text/csv; charset=utf-8');
$fh1 = fopen($current_csv_name, 'w+');
$bom = chr(0xEF) . chr(0xBB) . chr(0xBF);
fputs($fh1, $bom);
foreach($csv_data as $curl_response)
{
fputs($fh1, implode($curl_response, ';')."\n");
}
fclose($fh1);

decoding JSON issue in Perl code: , or ] expected while parsing array, at character offset

I have written Perl code that was working until recently, when I tried to run it again. The problem seems to originate from the JSON::XS "decode_json" method.
Code Snippet:
use warnings;
use strict;
use MooseX::Singleton;
use Array::Utils qw(:all);
use Data::Dumper;
use JSON::XS qw(encode_json decode_json);
use Storable;
use Tie::IxHash;
open (my $observations_fh, '<', 'observations.json') or die "Could not open observations.json\n";
my $observations_json = <$obserations_fh>;
my #decoded_observations = #{decode_json($observations_json)};
Usually, after this code I was able to go through each JSON component in a for loop and take specific information, but now I get the error:
, or ] expected while parsing array, at character offset 5144816
(before "(end of string)")
I saw a similar question here, but it didn't resolve my problem.
I also have similar json decoding going on that doesn't utilize #{decode_json($variable)}, but when I tried that with this observations.json file, the same error was output.
I also tried just using the JSON module, but same error occurred.
Any insight would be greatly appreciated!
-cookersjs
That probably indicates you have incomplete JSON in $observations_json. Your assumption that the entire file consists of just one line is probably incorrect. Use
my $observations;
{
open (my $observations_fh, '<', 'observations.json')
or die("Can't open observations.json: $!\n");
local $/;
my $observations_json = <$obserations_fh>;
$observations = decode_json($observations_json);
}
If that doesn't help, observations.json doesn't contain valid JSON.

Data Type of Module Output

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/&uuml/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.

Strip HTML from files in a directory with Perl

I asked a question a couple of days ago about stripping HTML from files with PERL. I am a n00b and I've searched the site for answers to my question...but unfortunately I couldn't find anything...this is probably because I'm a n00b and I didn't see the answer when I was looking at it.
So, here is the situation. I have a directory with around 20 gb of text files. I want to strip the HTML from each file and output each file to a unique text file. I've written the program below, which seems to do the trick for the first 12 text files in the directory (there are about 12,000 text files in total)...however...I run into a couple of snags. The first snag is that after the 12th text file has been parsed, then I start getting warnings about deep recursion...and then shortly after this the program quits because I've run out of memory. I imagine that my programming is extremely inefficient. So, I'm wondering if any of you see any obvious errors with my code below that would case me to run out of memory. ...once I figure things out then hopefully I'll be able to contribute.
#!/usr/bin/perl -w
#use strict;
use Benchmark;
#get the HTML-Format package from the package manager.
use HTML::Formatter;
#get the HTML-TREE from the package manager
use HTML::TreeBuilder;
use HTML::FormatText;
$startTime = new Benchmark;
my $direct="C:\\Directory";
my $slash='\\';
opendir(DIR1,"$direct")||die "Can't open directory";
my #New1=readdir(DIR1);
foreach $file(#New1)
{
if ($file=~/^\./){next;}
#Initialize the variable names.
my $HTML=0;
my $tree="Empty";
my $data="";
#Open the file and put the file in variable called $data
{
local $/;
open (SLURP, "$direct$slash"."$file") or die "can't open $file: $!";
#read the contents into data
$data = <SLURP>;
#close the filehandle called SLURP
close SLURP or die "cannot close $file: $!";
if($data=~m/<HTML>/i){$HTML=1;}
if($HTML==1)
{
#the following steps strip out any HTML tags, etc.
$tree=HTML::TreeBuilder->new->parse($data);
$formatter=HTML::FormatText->new(leftmargin=> 0, rightmargin=>60);
$Alldata=$formatter->format($tree);
}
}
#print
my $outfile = "out_".$file;
open (FOUT, "> $direct\\$outfile");
print FOUT "file: $file\nHTML: $HTML\n$Alldata\n","*" x 40, "\n" ;
close(FOUT);
}
$endTime = new Benchmark;
$runTime = timediff($endTime, $startTime);
print ("Processing files took ", timestr($runTime));
You are using up a lot of space with the list of files in #New1.
In addition, if you are using an older version of HTML::TreeBuilder then your objects of this class may need explcitly deleting, as they used to be immune to automatic Perl garbage collection.
Here is a program that avoids both of these problems, by reading the directory incrementallly, and by using HTML::FormatText->format_string to format the text, which implicitly deletes any HTML::TreeBuilder objects that it creates.
In addition, File::Spec makes a tidier job of building absolute file paths, and it is a core module so will not need installing on your system
use strict;
use warnings;
use File::Spec;
use HTML::FormatText;
my $direct = 'C:\Directory';
opendir my $dh, $direct or die "Can't open directory";
while ( readdir $dh ) {
next if /^\./;
my $file = File::Spec->catfile($direct, $_);
my $outfile = File::Spec->catfile($direct, "out_$_");
next unless -f $file;
my $html = do {
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
local $/;
<$fh>;
};
next unless $html =~ /<html/i;
my $formatted = HTML::FormatText->format_string(
$html, leftmargin => 0, rightmargin => 60);
open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);
print $fh "File: $file\n\n";
print $fh "$formatted\n";
print $fh "*" x 40, "\n" ;
close $fh or die qq(Unable to close "$outfile" after writing: $!);
}
What was wrong with the answer to your previous question?
Your opening files for writing without checking the return code. Are you sure the succeed? And in which directory do you thing the files are created?
A better approach would be to:
read files 1 by 1
strip the HTML
write out the new file in the correct directory and checking the return code
something like:
while ( my $file = readdir DIR ) {
....process file
open my $newfile, '>', "$direct/out_$outfile" or die "cannot open $outfile: $!\n";
... etc
}
How to reduce the memory footprint of this application:
Does the problem persist when you add $tree = $tree->delete to the end of your loop?
The perl garbage collector cannot resolve circular references; so you have to destroy the tree manually so you don't run out of memory.
(See the first example in the module documentation at http://metacpan.org/pod/HTML::TreeBuilder)
You should put the readdir inside the loop. The way you coded it, you first read in this gigantic list of files. When you say
my $file;
while (defined($file = readdir DIR1)) {..}
only one entry is actually read at a time. Should save some extra memory.
A few other comments on style:
default values
You give $tree the default value of "Empty". That is completely unnecessary. If you want to show how undefined a variable is, set it to undef, which it is by default. Perl guarantees this initialization.
backslashes
You use backslashes as a directory separator? Stop worrying and just use normal slashes. Unless you are on DOS you can use normal slashes as well, Windows isn't that dumb.
statement modifiers
This line
if ($file=~/^\./){next;}
can be written far more readable as
next if $file =~ /^\./;
consequent use of parens
Your use of parens for function argument lists is inconsequent. You can omit the parens for all built-in functions unless there is ambiguity. I prefer avoiding them, others may find them easier to read. But please stick to a style!
better regex
You test for the existence of /<HTML>/i. What if I told you the html tag can have attributes? You should rather consider testing for /<html/i.
simplification (removes another bug)
Your test
if($data=~m/<HTML>/i){$HTML=1;}
if($HTML==1) {...}
can be written as
$HTML = $data =~ /<html/i;
if ($HTML == 1) {...}
can be written as
$HTML = $data =~ /<html/i
if ($HTML) {...}
can be folded into
if ($data =~ /<html/i) {...}
The way you implemented it, the $HTML variable was never reset to a false value. So once a file contained html, all subsequent files would have been treated as html as well. You can counteract such problems by defining your vars in the innermost sensible scope.
use HTML::FormatText, tribute to #pavel
Use the modules you use to the fullest. Look what I found in the example for HTML::FormatText:
my $string = HTML::FormatText->format_file(
'test.html',
leftmargin => 0, rightmargin => 50
);
You can easily adapt that to circumvent building the tree manually. Why hadn't you tried this approach, as #pavel told you to in your other post? Would have saved you the memory problem...
use strict
Why did you comment out use strict? Getting as much fatal warnings as possible is important when learning a language. Or when writing solid code. That would force you to declare all your variables like $file sensibly. And rather use warnings than the -w switch, which is a bit outdated.
well done
But a very big "well done" on checking the return value of close ;-) That is very un-n00bish!