By fetching data from mysql table i am generating csv file and uploading to some ftp folder. But other person using this csv file from ftp side says that it is in ANSI encoding. How can i change that to UTF-8 encoding? For this I am using the below code.
header('Content-Encoding: utf-8');
header('Content-Type: text/csv; charset=utf-8');
$fh1 = fopen($current_csv_name, 'w+');
foreach($csv_data as $curl_response)
{
fputs($fh1, implode($curl_response, ';')."\n");
}
fclose($fh1);
When i download the file and open in notepad and click on save as it is always showing Encoding as ASNI. Where i am doing wrong? Any help would be greatly appreciated.
I believe you need to add the BOM (Byte order Mark) at the beginning of the file, otherwise, the filesystem will always assume your charset to be ANSI by default. Try doing this:
header('Content-Encoding: utf-8');
header('Content-Type: text/csv; charset=utf-8');
$fh1 = fopen($current_csv_name, 'w+');
$bom = chr(0xEF) . chr(0xBB) . chr(0xBF);
fputs($fh1, $bom);
foreach($csv_data as $curl_response)
{
fputs($fh1, implode($curl_response, ';')."\n");
}
fclose($fh1);
Related
I want to read data from csv files with two possible encodings (UTF-8 and ISO-8859-15). I mean different files with different encodings. Not the same file with two encodings.
Now I can only read data correctly from a utf-8 encoding file. Can I just implement this by adding an extra option? For example . encoding: 'ISO-8859-15'
What i have:
def csv
file = File.open(file.tempfile)
CSV.open(file, csv_options)
end
private
def csv_options
{
col_sep: ";",
headers: true,
return_headers: false,
skip_blanks: true
}
end
Once you know what encoding your file has, you can pass inside the CSV options i.e.
external_encoding: Encoding::ISO_8859_15,
internal_encoding: Encoding::UTF_8
(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).
So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.
Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.
Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).
Using the library mentioned by #simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.
Problem:
The source config files look something like this:
$sites = {
"0100101001" => {
mail => 1,
from => 'mail#mail.com',
to => 'mail#mail.com',
subject => 'á é í ó ú',
msg => 'á é í ó ú',
ftp => 0,
sftp => 0,
},
"22222222" => {
[...]
And many more of those.
My "parsing" code is the following:
use strict;
use warnings;
# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;
my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
$sites
);
do $conf;
# my $json = encode_json($sites);
my $yaml = Dump($sites);
binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);
Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.
The output of the above code is this:
[...snip...]
mail: 1
msg: á é à ó ú
sftp: 0
subject: á é à ó ú
[...snip...]
The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.
What am I missing here? Any workaround?
Note: I thought I may have been my shell, but locale outputs this:
❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Which seems ok.
Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.
Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.
The sites config file is UTF-8 encoded. Here are three workarounds:
Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.
If that is not feasible, decode the input before you pass it to the JSON encoder. Something like
open CFG, "<:encoding(utf-8)", $conf;
do { local $/; eval <CFG> };
close CFG;
instead of
do $conf
Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.
There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.
I have csv file upload and the file can be with UTF-8 encoding or something else like KOI8-R. My question is when I say:
File.new #path, "r:#{encoding}"
result = CSV.read(#uploaded_file, { :headers => true,:encoding => encoding })
encoding is KOI8-R
and I write the result to mysql does Rails 3 automatically converts the values from KOI8-R to UTF-8?
Thanks in advance :)
I think not Rails converts the encoding , but your database does. My advice is to check database settings for the encodings.
I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/ü/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.
I have a script that will download www pages, and I want to extract the text and store it in a uniform encoding (UTF8 would be fine). The downloading (UserAgent), Parsing (TreeBuilder) and text extraction seem fine, but I'm not sure I'm saving them correctly.
They dont view when opening the output file in for example notepad++; The original HTML views find in a text editor.
The HTML files typically have
charset=windows-1256 or
charset=UTF-8
So I figured if I could get the UTF8 one to work, then it was just an recoding problem. Here is some of what I have tried, assuming I have an HTML file saved to disk.
my $tree = HTML::TreeBuilder->new;
$tree->parse_file("$inhtml");
$tree->dump;
The output from dump captured for STDOUT views correctly in .txt file only after
Switching the encoding to utf8 in the text editor…
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
if (utf8::is_utf8($formatter->format($tree))) {
print " Is UTF8\n";
}
else {
print " Not UTF8\n";
}
Result Shows this IS UTF8 when the content says it is, and Not UTF8 otherwise.
I have tired
opening an file with ">" and ">:utf8"
binmode(MYFILE, ":utf8");
encode("utf8", $string); (where string is the output of formatter->format(tree))
But nothing seems to work correctly.
Any experts out there know what Im missing?
Thanks in advance!
This example can help you to find what you need:
use strict;
use warnings;
use feature qw(say);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );
open(my $fh_in, "<:encoding(cp1252)", $ARGV[0]) or die $!;
open(my $fh_out, ">:encoding(UTF-8)", $ARGV[1]) or die $!;
my $tree = Object::Destroyer->new(HTML::TreeBuilder->new(), 'delete');
$tree->parse_file($fh_in);
my $h1Element = $tree->look_down("_tag", "h1");
my $h1TrimmedText = $h1Element->as_trimmed_text();
say($fh_out $h1TrimmedText);
I really like the module utf8::all (unfortunately not in core).
Just use utf8::all and you have no worries about IO, when you work only with UTF-8 files.