Cannot properly decode html entities in perl - html

I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.
I have some data in this format
??E??0??<?20120529184453+0200?20120529184453+0200???G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12??W??tP?2??
??|?????
??:o?????tP???B#?????B#??????)0????
49471010550??? ???tP???3??<????????????????
I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.
When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as,  or  into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.
Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of ).
I have tried everything I can think of include running the following substitution in perl after running decode_entities
$var =~ s/&#(\d+);/chr($1)/g
however that does not work at all.
This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.
Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?

You're almost there. Instead of
$var =~ s/&#(\d+);/chr($1)/g
say
$var =~ s/&#(\d+);/chr($1)/ge
The /e modifier instructs Perl to 'e'valuate the replacement pattern.

Related

Trying to get `sed `to fix a phpBB board - but I cannot get my regexp to work

I have an ancient phpBB3 board which has gone through several updates over its 15+ years of existence. Sometimes, in the distant past, such updates would partially fail, leaving all sorts of 'garbage' in the BBCode. I'm now trying to do a 'simple' regexp to match a particular issue and fix it.
What happened was the following... during a database update, long long ago, BBCode tags were, for some reason, 'tagged' with a pseudo-attribute — allegedly for the database updating script to figure out each token that required updating, I guess. This attribute was always a 8-char-long alphanumeric string, 'appended' to the actual BBCode with a semicolon, like this:
[I]something in italic[/I]
...
[I:i9o7y3ew]something in italic[/I:i9o7y3ew]
Naturally, phpBB doesn't recognise this as valid BBCode, and just prints the whole text out.
The replacement regexp is actually very basic:
s/\[(\/?)(.+):[[:alnum:]]{0,8}\]/[\1\2]/gim
You can see a working example on regex110.com (where capture groups use $1 instead of \1). The example given there includes a few examples from the actual database itself. [i] is actually the simplest case; there are plenty of others which are perfectly valid but a bit more complex, thus requiring a (.+) matcher, such as [quote=\"Gwyneth Llewelyn\":2m80kuso].
As you can see from the example on regex110.com, this works :-)
Why doesn't it work under (GNU) sed? I'm using version 4.8 under Linux:
$ sed -i.bak -E "s/\[(\/?)(.+):[[:alnum:]]+\]/[\1\2]/gim" table.sql
Just for the sake of the argument, I tried using [A-Za-z0-9]+ instead of [[:alnum:]]+; I've even tried (.+) (to capture the group and then just discard it)
None produced an error; none did any replacements whatsoever.
I understand that there are many different regexp engines out there (PCRE, PCRE2, Boost, etc. and so forth) so perhaps sed is using a syntax that is inconsistent with what I'm expecting...?
Rationale: well, I could have done this differently; after all, MySQL has built-in regexp replacements, too. However, since this particular table is so big, it takes eternities. I thought I'd be far better off by dumping everything to a text file, doing the replacements there, and importing the table again. There is a catch, though: the file is 95 MBytes in size, which means that most tools I've got (e.g. editors with built-in regexp search & replace) will fail with such a huge exception. One notable exception is good old emacs, which has no trouble with such large files. Alas, emacs cannot match anything, so I thought I'd give sed a try (it should be faster, too). sed takes also close to a minute or so to process the whole file — about the same as emacs, in fact — and has the same result, i.e. no replacements are being made. It seems to me that, although the underlying technology is so different (pure C vs. Emacs-LISP), both these tools somehow rely on similar algorithms... both of which fail.
My understanding is that some libraries use different conventions to signal literal vs. metacharacters and quantifiers. Here is an example from an instruction manual for vim: http://www.vimregex.com/#compare
Indeed, contemporary versions of sed seem to be able to handle two different kinds of conventions (thus the -E flag). The issue I have with my regexp is that I find it very difficult to figure out which convention to apply. Let's start with what I'm used to from PHP, Go, JavaScript and a plethora of other regexp implementations, which use the convention that metacharacters & quantifiers do not get backslashed (while literals do).
Thus, \[(\/?)(.+):[[:alnum:]]+\] presumes that there are a few literal matches for [, ], /, and only these few cases require backslashes.
Using the reverse convention — i.e. literals do not get backslashed, while metacharacters and some quantifies do — this would be written as:
[\(/\?\)\(\.\+\):\[\[:alnum:\]\]\+]
Or so I would think.
Sadly, sed also rejects this with an error — and so do vim and emacs, BTW (they seem to use a similar regexp library, or perhaps even the same one).
So what is the correct way to write my regexp so that sed accepts it (and does what I intend it to do)?
UPDATE
I have since learned that, in the database, phpBB, unlike I assumed, does not store BBCode (!) but rather a variant of HTML (some tags are the same, some are invented on the spot). What happens is that BBCode gets translated into that pseudo-HTML, and back again when displaying; that, at least, explains why phpBB extensions such as Markdown for phpBB — but also BBCode add-ons! — can so easily replace, partially or even totally, whatever is in the database, which will continue to work (to a degree!) even if those extensions get deactivated: the parsed BBCode/Markdown is just converted to this 'special' styling in the database, and, as such, will always be rendered correctly by phpBB3, no matter what.
On other words, fixing those 'broken' phpBB tags requires a bit more processing, and not merely search & replace with a single regexp.
Nevertheless, my question is still pertinent to me. I'm not really an expert with regexps but I know the basics — enough to make my life so much easier! — and it's always good to understand the different 'dialects' used by different platforms.
Notably, instead of using egrep and/or grep -E, I'm fond of using ugrep instead. It uses PCRE2 expressions (with the Boost library), and maybe that's the issue I'm having with the sed engine(s) — the different engines speak different regular expressions dialect, and converting from one grep variant to a different one might not be useful at all (because some options will not 'translate' well enough)...
Using sed
(\[[^:]*) - Retain everything up to but not including the next semi colon after a opening bracket within the parenthesis which can later be returned with back reference \1
[^]]* - Exclude everything else up to but not including the next closing bracket
$ sed -E 's/(\[[^:]*)[^]]*/\1/g' table.sql
[I]something in italic[/I]
...
[I]something in italic[/I]

Python - iterate and replace (IPv4) with (IPv4 plus GeoIP)

I am looking for an elegant way to parse a text file (i.e. a log file containing source and destination IPs and lots of other data) keeping each line intact, and replacing all IPv4 addresses with the same IP followed by a comma and the GeoIP country code of that IP.
I have tried doing this in bash, sed, perl, and python. I tried a hundred perl one-liners and never quite got it because substitution like s/original/replacement/g doesn't want to execute GeoIP lookup in the substitution field. For example:
perl -pe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/($1,system(geoiplookup $1))/g' < log.csv
results in:
"srcip=(110.110.110.110,system(geoiplookup 110.110.110.110))"
instead of the executing geoiplookup.
I've tried this with backticks as well as exec, lots of different punctuation, with the same result.
In Python I tried some code that looks like:
rexp_ip = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
repl = { rexp_ip: rexp_ip+".test" }
---
while line:
line = i.readline()
print(re.sub(rexp_ip, lambda m: str(repl.get(m.group())), line))
It seems pretty close but I'm not sure whether I'm on the right track here.
I would be open to bash, sed, awk, perl, python, or any other solution.
This seems fairly simple to me and I may be over-thinking it!
I am guessing I'm not the first person who has tried this and maybe I'm 'reinventing the wheel' here.
Any insight would be appreciated.
I may have solved my own problem using perl with /e switch--
$ perl -lpe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/(`printf $1;geoiplookup $1`)/eg' < log.csv

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Using spéciâl characters in URLs

I'm developing a website which lets people create their own translator. They can choose the name of the URL, and it is sent to a database and I use .htaccess to redirect website.com/nameoftheirtranslator
to:
website.com/translator.php?name=nameoftheirtranslator
Here's my problem:
Recently, I've noticed that someone has created a translator with special characters in the name -> "LAEFÊVËŠI".
But when it is processed (posted to a php file, then mysqli_real_escape_string) and added to the database it appears as simply "LAEFVI" - so you can see the special characters have been lost somewhere.
I'm not quite sure what to do here, but I think there are two paths:
Try to keep the characters and do some encoding (no idea where to start)
Ditch them and tell users to only use 'normal' characters in the names of their translators (not ideal)
I'm wondering whether it's even possible to have a url like website.com/LAEFÊVËŠI - can that be interpreted by the server?
EDIT1: I notice that stack overflow, on this very question, translates the special characters in my title to .../using-special-characters-in-urls! This seems like a great solution, I guess I could make a function that translates special characters like â to their normal equivalent (like â)? And I suppose I would just ignore other characters like /##"',&? Now that I think of it, there must be some fairly standard/good-practice strategies for getting around problems like this.
EDIT2: Actually, now that I think about it (more) - I really want this thing to be usable by people of any language (not just English), so I would really love to be able to have special characters in the urls. Having said this, I've just found that Google doesn't interpret â as a, so people may have a hard time finding the LAEFÊVËŠI translator if I don't translate the letters to normal characters. Ahh!
Okay, after that crazy episode, here's what happened:
Found out that I was removing all the non alpha-numeric characters with PHP preg_replace().
Altered preg_replace so it only removes spaces and used rawurlencode():
$name = mysqli_real_escape_string($con, rawurlencode( preg_replace("/\s/", '', $name) ));
Now everything is in the database encoded, safe and sound.
Used this rewrite rule RewriteRule ^([^/.]+)$ process.php?name=$1 [B]
Run around in circles for 2 hours thingking my rewrite was wrong because I was getting "page not found"
Realise that process.php didn't have a rawurlencode() to read in the name
$name = rawurlencode($_GET['name']);
Now it works.
WOO!
Sleep time.

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...