I am reading some code that parses the json return from a web site, a Dynamic DNS provider if it matters.
$reply =~ qr/{(?:[^{}]*|(?R))*}/mp;
What will this match?
I am reading this https://perldoc.perl.org/perlretut to try and understand.
So far I understand the following
It will match something that starts with '{' and ends with '}'.
(?: means not to extract the regexp. I assume this is for speed?
Then it will match something that is not '{' or '}' as many times (e.g. fjksfksdh) which is not necessarily proper JSON syntax as far as I can tell.
Then I have no idea what is the meaning of ?R and I struggle putting it all together in the end.
Can someone help me out?
Sorry perl and JSON noob here.
Related
TL;DR I have input that looks like this:
इस परीक्षण के लिए है
Something
Zürich
This data is then piped through a few programs and is ultimately inserted into a mongodb database.
But by the time I query it out and try to display it on a web page it's all garbage.
I've found a lot of questions on how to encode these things but all the answers assume you want everything encoded and do not discuss how to decode it for display.
I only want the "weird" stuff encoded, so for the above I'd like to get some output like this
0x1234;0x8737;0x838784; ...
Something
Z0x8387;rich
which would store fine in a database, and would survive a vim edit or whatever else, but then when I pull it out I want it to render correctly.
So how do I do that, encode in Perl and decode in Javascript?
PS: I don't know what that string of symbols means, just found it somewhere. Sorry if it's offensive or something. Thanks!
Edit:
choroba's answer is a very good start, let's see with an example of what the algorithm produces:
input: 株式会社イノ設計
output: 0x230;0x160;0x170;0x229;0x188;0x143;0x228;0x188;0x154;0x231;0x164;0x190;0x227;0x130;0x164;0x227;0x131;0x142;0x232;0x168;0x173;0x232;0x168;0x136;
Now how do I render that in Javascript? 0xNN was just an example of what I imagine the answer would be but if there's a better way by all means!
Thanks!
Here's an example that produces something similar to what you want:
#! /usr/bin/perl
use warnings;
use strict;
sub escape {
my ($in) = #_;
$in =~ s/([\x{80}-\x{ffff}])/sprintf '0x%d;', ord $1/ger
}
my $in = "Z\N{LATIN SMALL LETTER U WITH DIAERESIS}rich";
my $out = 'Z0x252;rich';
$out eq escape($in) or die escape($in) . "\n$out\n";
You seem to want decimal digits after 0x. That's confusing as 0x usually means hexadecimal. To get hexadecimal codes, change the sprintf template to 0x%x;.
Also note that once someone enters 0x123; into your data directly, the data will become corrupted.
If you use &# instead of 0x at the beginning of each replaced character, the browser will render the characters correctly: Zürich renders as "Zürich".
Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).
I'm developing a website which lets people create their own translator. They can choose the name of the URL, and it is sent to a database and I use .htaccess to redirect website.com/nameoftheirtranslator
to:
website.com/translator.php?name=nameoftheirtranslator
Here's my problem:
Recently, I've noticed that someone has created a translator with special characters in the name -> "LAEFÊVËŠI".
But when it is processed (posted to a php file, then mysqli_real_escape_string) and added to the database it appears as simply "LAEFVI" - so you can see the special characters have been lost somewhere.
I'm not quite sure what to do here, but I think there are two paths:
Try to keep the characters and do some encoding (no idea where to start)
Ditch them and tell users to only use 'normal' characters in the names of their translators (not ideal)
I'm wondering whether it's even possible to have a url like website.com/LAEFÊVËŠI - can that be interpreted by the server?
EDIT1: I notice that stack overflow, on this very question, translates the special characters in my title to .../using-special-characters-in-urls! This seems like a great solution, I guess I could make a function that translates special characters like â to their normal equivalent (like â)? And I suppose I would just ignore other characters like /##"',&? Now that I think of it, there must be some fairly standard/good-practice strategies for getting around problems like this.
EDIT2: Actually, now that I think about it (more) - I really want this thing to be usable by people of any language (not just English), so I would really love to be able to have special characters in the urls. Having said this, I've just found that Google doesn't interpret â as a, so people may have a hard time finding the LAEFÊVËŠI translator if I don't translate the letters to normal characters. Ahh!
Okay, after that crazy episode, here's what happened:
Found out that I was removing all the non alpha-numeric characters with PHP preg_replace().
Altered preg_replace so it only removes spaces and used rawurlencode():
$name = mysqli_real_escape_string($con, rawurlencode( preg_replace("/\s/", '', $name) ));
Now everything is in the database encoded, safe and sound.
Used this rewrite rule RewriteRule ^([^/.]+)$ process.php?name=$1 [B]
Run around in circles for 2 hours thingking my rewrite was wrong because I was getting "page not found"
Realise that process.php didn't have a rawurlencode() to read in the name
$name = rawurlencode($_GET['name']);
Now it works.
WOO!
Sleep time.
Currently I have a powershell proccess that is scanning a SQL Server Table and is reading a columns containing text. Currently we have characters that are in the extended ASCII land that are causing our downstream processes to break. I was orginally idenitfying these differences in SQL Server but it is terrible at text parsing so I decided to write a powershell script to do this that combined regular expressions. I will post the code for that as well to help other lost souls looking for such a regex.
$x = [regex]::Escape("\``~!##$%^&*()_|{}=+:;`"'<,>.?/-")
$y = "([^A-z0-9 \0x005D\0x005B\t\n"+$x+"])"
$a = [regex]::match( $($Row[1]), $y)
The problem comes when I want to display some of the ascii values back in an email saying that I'm scrubbing the data. The numbers don't come out the same as SQL Server. Caution I'm not sure if your results will be the same copying from you browser because these are extended ascii.
In powershell
[int]"–"[-0]; #result 8211 that appears to be wrong
[int]" "[-0]; #result 160 this appears to be right
In SQL Server
select ASCII('–') --result 150
select ASCII(' ') --result 160
What in powershell will help you to get the same results as SQL Server on the ASCII look up, if there is one.
TLDR; So my question is, is the above the correct method to look up ASCII values in powershell because it works for most values but doesn't work for the ASCII value 150 (this is the long dash that is from word).
In SQL Server,
select UNICODE('–')
will return 8211.
I don't think PowerShell supports ANSI, except for I/O; it works in Unicode internally.
I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.
I have some data in this format
??E??0??<?20120529184453+0200?20120529184453+0200???G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12??W??tP?2??
??|?????
??:o?????tP???B#?????B#??????)0????
49471010550??? ???tP???3??<????????????????
I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.
When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as, or into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.
Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of ).
I have tried everything I can think of include running the following substitution in perl after running decode_entities
$var =~ s/&#(\d+);/chr($1)/g
however that does not work at all.
This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.
Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?
You're almost there. Instead of
$var =~ s/&#(\d+);/chr($1)/g
say
$var =~ s/&#(\d+);/chr($1)/ge
The /e modifier instructs Perl to 'e'valuate the replacement pattern.