Perl Remove invalid characters, invalid latin1 characters from string - mysql

I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'
wanted to do something similar as:
$desc=~s///gsi;
but is not removing them.
the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478
using a user agent to get the data, seems coming in utf8 but the characters need to be removed:
my $ua = LWP::UserAgent->new ();
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")

If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:
$desc=~s/[^\x00-\x7f]//g
Edit: If you want something more elaborate that supports the entire latin-1 set, you can do this:
use Encode;
$desc=encode('latin-1',$desc,sub {''});
This will remove exactly the characters that cannot be represented by latin-1. Note that this line expects that the utf-8 flag is on for the string $desc and that the resulting string will have the utf-8 flag is off.
Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1 because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15.

The content sent by the web service is XML that contains HTML in the Description tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:
$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge
Here is an example:
$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;

Change your column definition to CHARACTER SET utf8mb4 so that the naughty character does not need to be removed, and can actually be stored.

Related

CGI script having trouble sending emojis characters from database

I am storing emojis in a MySQL database expressed in UTF8 Bytes, like "\xf0\x9f\x98\x80", which is the Unicode character U+1F600 GRINNING FACE
It is fine if I copy and paste it in and test it like this
print MAIL "Subject: \xf0\x9f\x98\x80\n";
It works and sends me the emoji.
But if I tell the script to get it from the database and plug it in like this:
print MAIL "Subject: $subject\n";
It will give me the subject: \xf0\x9f\x98\x80
What do I need to do? I thought if I was storing it in bytes it would see it as plain text and it would work.
It seems most likely that you have added the value to the database wrongly.
If you use Perl code and write the string '\xf0\x9f\x98\x80' to the database (note the single quotes) then you will get exactly the symptoms you describe. Your database will contain the sixteen-character ASCII string \xf0\x9f\x98\x80 and it will be displayed as such.
You shouldn't be involved with the UTF-8 encoded bytes; it is far better to specify the Unicode code point either by name or number
All of these produce the same Perl UTF-8-encoded string
$s = "\N{U+1F600}";
$s = "\N{GRINNING FACE}";
$s = "\x{1F600}";
The corresponding encoded bytes are irrelevant to the programmer, but if you must you can use the Encode module like this
use Encode 'decode_utf8';
$s = decode_utf8 "\xf0\x9f\x98\x80";
Another way is to enter the character directly into your code. You will need use utf8 to indicate to the compiler that the source contains non-ASCII UTF-8-encoded characters, like this
use utf8;
$s = "😀";
All of these assignments to $s will produce exactly the same result, and the values will compare as being equal using eq
On the database side you need a MySQL column with a four-byte UTF-8 character set, for instance
column VARCHAR(50) CHARACTER SET utf8mb4
Note that the character set must be utf8mb4 as if you use the earlier utf8 then you would be restricted to three-byte encoding, whereas emoji characters are all four bytes

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

Scrambled umlauts since upgrade from JSON1 to JSON2 in Perl

I wondered why some german umlauts were scrambled on our page.
Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.
Problem here is that i have a hash with strings like
use Data::Dumper;
my $test = {
'fields' => 'überrascht'
};
print Dumper(to_json($test)); gives me
$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } ";
Using the old module using
$json = JSON->new();
print Dumper ($json->to_json($test));
gives me (the correct result)
$VAR1 = '{"fields":[{"title":"überrascht"}]}';
So umlauts are scrammbled using the new JSON 2 module.
What do i need to get them correct?
Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned:
The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this
{
"title" : "Geändert",
},
using the new module i get
{
"title" : "Geändert",
},
The target page is in 8859-1 (latin1).
Any suggestions?
\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.
Then, encoding the string as JSON yields correct output:
use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};
The encode_json assumes UTF-8. Read the documentation for more info.
Output:
$VAR1 = '{"fields":"nicht überrascht"}';
(JSON module version: 2.53)
my $json_text = to_json($data);
is short for
my $json_text = JSON->new->encode($data);
This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)
It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.
But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.
my $json_cp1252 = encode('cp1252', to_json($data));
my $json_utf8 = encode_utf8(to_json($data));
If the appropriate encoding is UTF-8, you can also use any of the following:
my $json_utf8 = to_json($data, { utf8 => 1 });
my $json_utf8 = encode_json($data);
my $json_utf8 = JSON->new->utf8->encode($data);
Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.
Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:
to_json($test, { latin1 => 1 })
Or
JSON->new->latin1->encode($test)
Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.
You can also try to use ascii instead of latin1 as the safest option.
Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.
Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.
If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.
You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.
The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.

perl decode string in cyrillic

I've got a problem with the following string:
$str="this is \321\213\321\213\321\213\321\213\321\213 \321\201\320\277\320\260\321\200\321\202\320\260\321\200";
This string is located in an ascii text file and I want to store in a Mysql db (utf8). \321\231 ... are cyrillic symbols.
What can I do to make \321\213 look like cyrillic characters in Mysql db
This should be described in RFC2047, end look like it was utf7 to utf8 conversion.. dont know excatly.
its "unicode escape"
working variant:
use Encode::Escape;
$var1='\321\213';
print decode 'unicode-escape', $var1;
#correct mysql view in phpmyadmin
$dbh = DBI->connect('DBI:mysql:database=test', 'testuser', 'testpass', { mysql_enable_utf8 => 1});
This is not quoted-printable at all. This is Perl quoted string representation, also know as PERLQQ, of a series of octets. The numbers are octal.
These bytes encode UTF-8 for the most part, but the data contain two errors. Looks like one half of a character each somehow fell off. I have marked it with arrows just below.
my $octets = "this is \321\213\321\213\321\213\321\213\321 \321\201\320\277\320\260\321\200\321\202\320\260\321";
# ↑↑↑↑ ↑↑↑↑
This invalid in UTF-8, but can be repaired. We put the Unicode replacement character.
use Encode qw(decode);
my $characters = decode 'UTF-8', $octets, Encode::FB_DEFAULT | Encode::LEAVE_SRC;
# this is ыыыы� спарта�
This character string can now be simply inserted into the database as usual. The DSN in the connect call for DBI or DBIx::Class must include the attribute mysql_enable_utf8.
connect('DBI:mysql:foobar;mysql_enable_utf8=1', …, …);
You need to convert explicitly the codes to characters. For that you need to know what's the input encoding. I suppose it's iso-8859-5, but it could be windows-1252 or something else.
use Encode qw( decode );
my $str="this is \321\213\321\213\321\213\321\213\321 \321\201\320\277\320\260\321\200\321\202\320\260\321";
my $out .= from_to( "iso-8859-5","utf-8", $str );
I've just seen that your source string is indeed QP, so you need to convert from QP to bytes; that's easy, simply use MIME::QuotedPrint:
use MIME::QuotedPrint ();
my $out = MIME::QuotedPrint::decode($str);
Problem is: perl does not know that the string is UTF-8, so you must turn flag explicitly on.
Encode::_utf8_on($str);

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...