How to Encode and Decode "Acute accented characters" using Perl - html

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\ Language in our website.
What we have done in
IF we have a Tab name like Courses Main Page<\h1> in our template file, we have converted that to
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>
where $language is getting the value which user selects when he logs in.
We have a table to maintain this data in our Mysql DB:
CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT
NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of
Element text to a foreign language'
IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'.
I tried with utf8, but the issue which is limited to some tabs got increased to many sections.
So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.
Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
It is very similar to the solution given in htmlentities and é (e acute)
Can any one tell me how to achieve the same in Perl.

Denoting the charset
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, i.e.
Content-Type: text/html; charset=utf-8
If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.
In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.
Actually encoding and decoding
There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice, so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:
use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);
Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.
One example on what these functions do:
$byteString = "Cursos P\xc3\xa1gina Principal."; # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString = "Cursos Página Principal."; # html character reference

Related

Perl Remove invalid characters, invalid latin1 characters from string

I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'
wanted to do something similar as:
$desc=~s///gsi;
but is not removing them.
the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478
using a user agent to get the data, seems coming in utf8 but the characters need to be removed:
my $ua = LWP::UserAgent->new ();
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")
If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:
$desc=~s/[^\x00-\x7f]//g
Edit: If you want something more elaborate that supports the entire latin-1 set, you can do this:
use Encode;
$desc=encode('latin-1',$desc,sub {''});
This will remove exactly the characters that cannot be represented by latin-1. Note that this line expects that the utf-8 flag is on for the string $desc and that the resulting string will have the utf-8 flag is off.
Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1 because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15.
The content sent by the web service is XML that contains HTML in the Description tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:
$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge
Here is an example:
$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;
Change your column definition to CHARACTER SET utf8mb4 so that the naughty character does not need to be removed, and can actually be stored.

HTML special character

I need some help, i entered a special character in form like "ä" and it was save in a database, but when I call it using a javascript , a suggestion type, it does not display correctly, it display like this "ä".
Is it possible from the form "ä" it will be save as "&auml" in the database? how is it?
This is caused by misinterpreting a UTF-8 string as ISO-8859-1. The character "ä" is represented in UTF-8 by the two bytes 0xC3 0xA4, but in ISO-8859-1, the byte 0xC3 represents the character "Ã", and the byte 0xA4 represents the character "¤".
Most likely, your program is sending the page as UTF-8 but not telling the browser what encoding was used, so the browser is assuming ISO-8859-1 (the default). You need to send a Content-Type HTTP header that specifies the encoding, such as text/html; charset=UTF-8.
If you haven't already, you should define the character set used by your page.
Add this inside your <head> tags:
<meta charset="utf-8">
There are other character sets as well, but UTF-8, as stated by Google back in 2008, has become one of the most popular today.
This is due to your character set, but there are two character sets that you need to keep in mind. The first is the character set in the database, and the second is the character set in your HTML. They need to match.
Even if you do something like $mysqli->set_charset("utf8"); in your connection script, the database needs to have utf8 also for the data to display properly.

special characters with Net::Twitter::Lite

I try to send characters like ü, ä, ß, à and so on to twitter. If I use unicode characters in my scripts they come out wrong in twitter. If I use HTML (which is possible in twitter's web-interface and which used to work previously) I see now ü rather than "ü" in the post. Is there a parameter or something that I have to set? Some call to encode/decode? I am using:
use Net::Twitter::Lite::WithAPIv1_1;
I find myself checking out the test suite of perl modules quite often as it is a good source for examples.
Net::Twitter expects decoded characters, not encoded bytes So, sending
encoded utf8 to Net::Twitter will result in double encoded data.
Source: https://metacpan.org/source/MMIMS/Net-Twitter-Lite-0.12006/t/unicode.t
Try
use encoding 'utf8';
at the begining of your script. sometimes this is the solution of many utf8 problems.

Extended ASCII characters show up as junk in MySQL db is inserted through perl

I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.
You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.
There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.

Messy code when I open a lua file contain Chinese with MacVim

How to set MacVim display code.
Here is the mess when I open the lua file which create in Windows XP.
gControlMode = 0; -- 1£º¿ªÆôÖØÁ¦¸ÐÓ¦£¬ 0:¿ª´¥ÆÁģʽ
gState = GS_GAME;
sTotalTime = 0; --µ±Ç°¹Ø¿¨»¨µÄ×Üʱ¼ä
The text you posted seems like a Latin-1 (or ISO-8859-1, CP819) decoding of the CP936 encoding (or EUC‑CN, or GB18030 encodings1) of this text2:
gControlMode = 0; -- 1:开启重力感应, 0:开触屏模式
gState = GS_GAME;
sTotalTime = 0; --当前关卡花的总时间
When opening a file, Vim tries the list of encodings specified in the fileencodings option. Usually, latin1 is the last value in this list; reading as Latin-1 will always be successful since it is an 8-bit encoding that maps all 256 values. Thus, Vim is opening your CP936 encoded file as Latin-1.
You have several choices for getting Vim to use another encoding:
You can specify an encoding with the ++enc= option to Vim’s :edit command (this will cause Vim to ignore the fileencodings list for the buffer):
:e ++enc=cp936 /path/to/file
You can apply this to an already-loaded file by leaving off the path:
:e ++enc=cp936
You can add your preferred encoding to fileencodings just before latin1 (e.g. in your ~/.vimrc):
let &fileencodings = substitute(&fileencodings, 'latin1', 'cp936,\0', '')
You can set the encoding option to your desired encoding. This is usually discouraged because it has wide-ranging impacts (see :help encoding).
It might make sense, if possible, to switch your files to UTF-8 since many editors will properly auto-detect UTF-8. Once you have the file loaded properly (see above), Vim can do the conversion like this (set fileencoding, then :write):
:set fenc=utf-8 | w
Vim should pretty much automatically handle reading and writing UTF-8 files (encoding defaults to UTF-8, and utf-8 is in the default fileencodings), but if you are using other editors (i.e. whatever Windows editor edited/created the CP936 file(s)), you may need to configure them to use UTF-8 instead of (e.g.) CP936.
1 I am not familiar with the encodings used for Chinese text, these encodings seem to be identical for the “expected” text.
2 I do not read Chinese, but the presence and locations of the FULLWIDTH COLON and FULLWIDTH COMMA (and Google's translation of this text) make me think this is the text you expected.