special characters with Net::Twitter::Lite - html

I try to send characters like ü, ä, ß, à and so on to twitter. If I use unicode characters in my scripts they come out wrong in twitter. If I use HTML (which is possible in twitter's web-interface and which used to work previously) I see now ü rather than "ü" in the post. Is there a parameter or something that I have to set? Some call to encode/decode? I am using:
use Net::Twitter::Lite::WithAPIv1_1;

I find myself checking out the test suite of perl modules quite often as it is a good source for examples.
Net::Twitter expects decoded characters, not encoded bytes So, sending
encoded utf8 to Net::Twitter will result in double encoded data.
Source: https://metacpan.org/source/MMIMS/Net-Twitter-Lite-0.12006/t/unicode.t

Try
use encoding 'utf8';
at the begining of your script. sometimes this is the solution of many utf8 problems.

Related

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

How to Encode and Decode "Acute accented characters" using Perl

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\ Language in our website.
What we have done in
IF we have a Tab name like Courses Main Page<\h1> in our template file, we have converted that to
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>
where $language is getting the value which user selects when he logs in.
We have a table to maintain this data in our Mysql DB:
CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT
NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of
Element text to a foreign language'
IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'.
I tried with utf8, but the issue which is limited to some tabs got increased to many sections.
So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.
Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
It is very similar to the solution given in htmlentities and é (e acute)
Can any one tell me how to achieve the same in Perl.
Denoting the charset
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, i.e.
Content-Type: text/html; charset=utf-8
If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.
In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.
Actually encoding and decoding
There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice, so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:
use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);
Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.
One example on what these functions do:
$byteString = "Cursos P\xc3\xa1gina Principal."; # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString = "Cursos Página Principal."; # html character reference

Extended ASCII characters show up as junk in MySQL db is inserted through perl

I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.
You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.
There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.

Pinyin tone marked symbols and MySQL

I’m creating a MySQL database storing Chinese characters with associated pīnyīn pronunciations. I’ve set up everything to work in UTF-8 charset, so I’m having no troubles with most of the symbols I’m using. Except, strangely, some of certain latin characters with tone marks, and only when I write them into the database from $_POST, using PHP.
Those are: all characters with an acute accent (á, é, í, ó, ú), except ǘ (?!); and all characters with a grave accent (à, è ì ò ù), again, except ǜ. When they are typed into a form, and that form is submitted to the db, those characters are just cut off, like they never existed. E.g., cháng submits like chng. Any other characters (with a caron, like ǎ, or a macron, like ā) are written in fine, and so are actual Chinese characters.
Again, I’m using UTF-8 everywhere possible, and this sort of problem so far has been only experienced upon submitting data from a form. Before, I ran a script to manually insert an array, containing those characters, to the database, and everything went fine.
Any ideas?
I think you may post pinyin in a numbered format.
e.g. cháng as cha2ng
And dealing with the post information in php script by some mapping methods.
Here's a method to deal with it.
Convert numbered to accentuated Pinyin?
Hopefully, it helps you.
I got a solution!
Before:
SELECT 'liàng' = 'liǎng';
Change to:
SELECT CONVERT('liàng' USING BINARY)= CONVERT('liǎng' USING BINARY) as equal;

UTF-8 Character Encoding in SQL

I am trying out Bullzip's Access to mySQL app on an Access DB full of special chars like é and ä. The app allows you to specify UTF-8 encoding but in the resulting SQL file I get "Vieux Carré" instead of "Vieux Carré".
I tried opening the SQL file in UltraEdit and doing a UTF-8 conversion but it does not resolve this issue as I guess it is converting "é" and never sees the "é"?
What is a Good™ solution for this?
The problem is in the UTF-8 to Unicode conversion into or out of Access. Access, like SQL Server, can only natively store data in ASCII format or Unicode (UTF-16) (With Unicode compression off). In order to ensure a given value was stored properly, you would need to convert it to Unicode on storage and convert it back to UTF-8 on retrieval. You may be able to use the StrConv function for such a purpose.
I have the same problem with Bullzip convertor now, so it could still help someone.
It doesn´t show the special characters right if I have my pc language set to english. I have to switch it back to czech (language of the special characters) and it works now and SQL looks correct.