HTML::Entities encoding and single ampersand - html

I'm attempting to use the following line of perl, as described here: Does anyone know of a vim plugin or script to convert special characters to their corresponding HTML entities? - to encode HTML entities in Vim.
%!perl -p -i -e 'BEGIN { use HTML::Entities; use Encode; } $_=Encode::decode_utf8($_) unless Encode::is_utf8($_); $_=Encode::encode("ascii", $_, sub{HTML::Entities::encode_entities(chr shift)});'
It works fine (£ to &pound, curly quotes etc.) except for an ampersand on it's own - & - which is left as it is.
I've tried removing the uf8 decoding, and looked at the CPAN documentation for HTML::Entities.
Answer:
#ZyX has answered the original question, but as others have pointed out in the comments, this is redundant as it's not actually necessary to use HTML entities if you are serving pages with a UTF-8 character set (which I am, both with the meta tag -
<meta charset="utf-8">
and also in the Apache configuration:
AddDefaultCharset utf-8
Indeed it's arguably a bad thing adding them in such cases; the filesize is bigger and the text is obfuscated should anyway want to make use of the source code.
It's essential you ensure whatever editor(s) you use to create files are writing them in UTF-8 as well.

My answer was only encoding characters that are above ascii range. If you want to encode something as html, you should use
$text=HTML::Entities::encode_entities($text);
:
%!perl -MHTML::Entities -MEncode -p -i -e '$_=Encode::decode_utf8($_) unless Encode::is_utf8($_); $_=HTML::Entities::encode_entities($_);'
I was not using this in that answer because TS only requested to encode unicode characters without encoding <, >, & as well.
By the way, you may use $text=HTML::Entities::encode_entities($text, '<>&"'); to encode only really unsafe characters (though I guess this is easily expressed with vimscript:
:let entities={'<': 'lt', '>': 'gt', '&': 'amp', '"': 'quot'}
:execute '%s/['.escape(join(keys(entities), ''), '\-]^').']/\="&".entities[submatch(0)].";"/g'

perl -MHTML::Entities -i -e 'print encode_entities shift'
should work, doesn't it?

Related

Perl Remove invalid characters, invalid latin1 characters from string

I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'
wanted to do something similar as:
$desc=~s///gsi;
but is not removing them.
the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478
using a user agent to get the data, seems coming in utf8 but the characters need to be removed:
my $ua = LWP::UserAgent->new ();
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")
If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:
$desc=~s/[^\x00-\x7f]//g
Edit: If you want something more elaborate that supports the entire latin-1 set, you can do this:
use Encode;
$desc=encode('latin-1',$desc,sub {''});
This will remove exactly the characters that cannot be represented by latin-1. Note that this line expects that the utf-8 flag is on for the string $desc and that the resulting string will have the utf-8 flag is off.
Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1 because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15.
The content sent by the web service is XML that contains HTML in the Description tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:
$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge
Here is an example:
$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;
Change your column definition to CHARACTER SET utf8mb4 so that the naughty character does not need to be removed, and can actually be stored.

mysqldump, avoid escape double quotes

When I have a field with double quotes, mysqldump put the escape character before. For example:
'My "test"' -> mysqldump generates a file with that field like 'My \"test\"'
The problem is that I'm using that file to import some data into a sqlite database, and SQLite doesn't remove the escape character. So I don't need that mysqldump writes escape characters, can I do that?
If on a *nix (unix, linux, mac) machine, I can suggest using sed to replace \" with ":
mysqldump [options] DB [table] | sed -e 's/\\"/"/g'
or, for an exsiting mydump.sql file:
sed -i -e 's/\\"/"/g' mydump.sql
EDIT: searching for a formal MySQL answer , I only found this somewhat related open bug about single quote escaping - https://bugs.mysql.com/bug.php?id=65941
It's also worth mentioning that there are other tools that can dump DBs without escaping quotes with backslash in the first place, such as JetBrains products (DataGrip, IntelliJ etc).
I am unsure what language you are using.
But the concept would be the same in any programming language.
First declare a variable equal to your sqldump.
var yourVariable = sqldump;
Then do a string.replace("\\", "").
Then use that 'cleaned' version to import your data with.
From a google search I found this which is reworking the output of mysqldump so sqlite can use it.
Not tested and 5 years old but I think:
you may find useful information
it may mean other people didn't find a more elegant solution
EDIT:
As per the comments, to answer your question it looks like you don't have a choice but replace \" into ".
Here is a repo with a script which translates mysqldump files into something sqlite can digest.
In particular, for your question, you can find this line:
gsub( /\\"/, "\"" )
Which SQL mode do you use?
Look at the mysqldump manual: http://dev.mysql.com/doc/refman/5.7/en/mysqldump.html
--quote-names, -Q
Quote identifiers (such as database, table, and column names) within
“`” characters. If the ANSI_QUOTES SQL mode is enabled, identifiers
are quoted within “"” characters. This option is enabled by default.
It can be disabled with --skip-quote-names, but this option should be
given after any option such as --compatible that may enable
--quote-names.
--quote-names is enabled by default.
ANSI_QUOTES
Treat “"” as an identifier quote character (like the “`” quote
character) and not as a string quote character. You can still use “`”
to quote identifiers with this mode enabled. With ANSI_QUOTES enabled,
you cannot use double quotation marks to quote literal strings,
because it is interpreted as an identifier.
You can convert escapes in the mysqldump output file.
Check esperlu MySQL to Sqlite converter script:
...
/INSERT/ {
gsub( /\\\047/, "\047\047" )
gsub(/\\n/, "\n")
gsub(/\\r/, "\r")
gsub(/\\"/, "\"")
gsub(/\\\\/, "\\")
gsub(/\\\032/, "\032")
print
next
}
...

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...

How can I decode HTML entities?

Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.
I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze