What is the encoding scheme for this Arabic web page? - html

I am trying to find the encoding scheme for this page (and others) which are surely Arabic, using lower ASCII range Latin characters to encode the contents.
http://www.saintcyrille.com/2011a.htm
http://www.saintcyrille.com/2011b.htm (English version/translation of that same page)
I have seen several sites and even PDF documents with this encoding, but I can't find the name or method of it.
This specific page is from 2011 and I think this is a pre-Unicode method of encoding Arabic that has fallen out of fashion.
Some sample text:
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'

An extraordinary mojibake case. It looks like there is missing high byte in Unicode code points in Arabic text. For instance: ا (U+0627, Arabic Letter Alef) appears as ' (U+0027, Apostrophe).
Let's suppose that missing high byte is always 0x06 in the following PowerShell script (added some more strings from the very end of the page http://www.saintcyrille.com/2011a.htm to your sample text):
$mojibakes = #'
E3'!K
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'
ED'-8'* :
'D#CD 'D5J'EJ 7H'D 'D#3(H9 'D98JE E-(0 ,/'K H'D5HE JF*GJ (9/ B/'3 'D9J/
J-(0 'D*B/E DD'9*1'A (9J/'K 9F JHE 'D9J/ (B/1 'D%EC'F -*I *3*7J9H' 'DE4'1C) AJ 'D5DH'* HB/'3 'D9J/ HFF5- D0DC 'D'3*A'/) EF -AD) 'D*H() 'D,E'9J) JHE 'D,E9) 15 '(1JD 2011 -J+ JGJ# 'D,EJ9 E9'K DFH'D 31 'DE5'D-) ( 9// EF 'D#('! 'DCGF) 3JCHF -'61'K )
(5F/HB 'D5HE) 9F/ E/.D 'DCFJ3) AAJ A*1) 'D#9J'/ *8G1 AJF' #9E'D 'D1-E) H'D5/B'* HE' JB'(DG' H0DC 9ED EB(HD HEE/H-
HDF' H7J/ 'D#ED #F *4'1CH' 'D'-*A'D'* AJ 19J*CE HCD 9'E H#F*E (.J1
'DE3J- B#'E ... -#B'K B#'E
'# -split [System.Environment]::NewLine
Function highByte ([byte]$lowByte, [switch]$moreInfo) {
if ( $moreInfo.IsPresent -and (
$lowByte -lt 0x20 -or $lowByte -gt 0x7f )) {
Write-Host $lowByte -ForegroundColor Cyan
}
if ( $lowByte -eq 0x20 ) { 0,$lowByte } else { 6,$lowByte }
}
foreach ( $mojibake in $mojibakes ) {
$aux = [System.Text.Encoding]::
GetEncoding( 1252).GetBytes( [char[]]$mojibake )
[System.Text.Encoding]::BigEndianUnicode.GetString(
$aux.ForEach({(highByte -lowByte $_)})
)
'' # new line separator for better readability
}
Output (using Google Translate) seems to give a sense roughly similar to English version of the page, after a fashion…
Output: .\SO\70062779.ps1
مساءً
الرياضة الروحية الفصحية
تأملات في سفر التكوين
يلقيها
الأب د سامي حلاق اليسوعي
الذي يأتي خصيصاً من حلب ـ سوريا
ملاحظات غ
الأكل الصيامي طوال الأسبوع العظيم محبذ جداً والصوم ينتهي بعد قداس
العيد
يحبذ التقدم للاعتراف بعيداً عن يوم العيد بقدر الإمكان حتى تستطيعوا
المشاركة في الصلوات وقداس العيد ، وننصح لذلك ، الاستفادة من حفلة
التوبة الجماعية يوم الجمعة رص ابريل زذرر حيث يهيأ الجميع معاً لنوال سر
المصالحة ب عدد من الأباء الكهنة سيكون حاضراً ة
بصندوق الصومة عند مدخل الكنيسة ففي فترة الأعياد تظهر فينا أعمال الرحمة
والصدقات وما يقابلها وذلك عمل مقبول وممدوح
ولنا وطيد الأمل أن تشاركوا الاحتفالات في رعيتكم وكل عام وأنتم بخير
المسيح قـام خخخ حـقاً قـام
Please keep in mind that I do not understand Arabic.
The script does not handle numbers: year 2011 in note #2 is incorrectly transformed to زذرر, for instance;
Handling spaces is unclear: is 0x20 always a space, or should be transformed to ؠ (U+0620, Arabic Letter Kashmiri Yeh)?
moreover, there is that problematic presumption about Unicode range U+0600-U+067F (where are U+0680-U+06FF and others?).

Related

LISP: how to properly encode a slash ("/") with cl-json?

I have code that uses the cl-json library to add a line, {"main" : "build/electron.js"} to a package.json file:
(let ((package-json-pathname (merge-pathnames *app-pathname* "package.json")))
(let
((new-json (with-open-file (package-json package-json-pathname :direction :input :if-does-not-exist :error)
(let ((decoded-package (json:decode-json package-json)))
(let ((main-entry (assoc :main decoded-package)))
(if (null main-entry)
(push '(:main . "build/electron.js") decoded-package)
(setf (cdr main-entry) "build/electron.js"))
decoded-package)))))
(with-open-file (package-json package-json-pathname :direction :output :if-exists :supersede)
(json:encode-json new-json package-json))
)
)
The code works, but the result has an escaped slash:
"main":"build\/electron.js"
I'm sure this is a simple thing, but no matter which inputs I try -- "//", "/", "#//" -- I still get the escaped slash.
How do I just get a normal slash in my output?
Also, I'm not sure if there's a trivial way for me to get pretty-printed output, or if I need to write a function that does this; right now the output prints the entire package.json file to a single line.
Special characters
The JSON Spec indicates that "Any character may be escaped.", but some of them MUST be escaped: "quotation mark, reverse solidus, and the control characters". The linked section is followed by a grammar that show "solidus" (/) in the list of escaped characters. I don't think it is really important in practice (typically it needs not be escaped), but that may explain why the library escapes this character.
How to avoid escaping
cl-json relies on an internal list of escaped characters named +json-lisp-escaped-chars+, namely:
(defparameter +json-lisp-escaped-chars+
'((#\" . #\")
(#\\ . #\\)
(#\/ . #\/)
(#\b . #\Backspace)
(#\f . #\)
(#\n . #\Newline)
(#\r . #\Return)
(#\t . #\Tab)
(#\u . (4 . 16)))
"Mapping between JSON String escape sequences and Lisp chars.")
The symbol is not exported, but you can still refer to it externally with ::. You can dynamically rebind the parameter around the code that needs to use a different list of escaped characters; for example, you can do as follows:
(let ((cl-json::+json-lisp-escaped-chars+
(remove #\/ cl-json::+json-lisp-escaped-chars+ :key #'car)))
(cl-json:encode-json-plist '("x" "1/5")))
This prints:
{"x":"1/5"}

How can I remove characters that are not supported by MySQL's utf8 character set?

How can I remove characters from a string that are not supported by MySQL's utf8 character set? In other words, characters with four bytes, such as "𝜀", that are only supported by MySQL's utf8mb4 character set.
For example,
𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
should become
C = -2.4‰ ± 0.3‰; H = -57‰
I want to load a data file into a MySQL table that has CHARSET=utf8.
MySQL's utf8mb4 encoding is what the world calls UTF-8.
MySQL's utf8 encoding is a subset of UTF-8 that only supports characters in the BMP (meaning characters U+0000 to U+FFFF inclusive).
Reference
So, the following will match the unsupported characters in question:
/[^\N{U+0000}-\N{U+FFFF}]/
Here are three different techniques you can use clean your input:
1: Remove unsupported characters:
s/[^\N{U+0000}-\N{U+FFFF}]//g;
2: Replace unsupported characters with U+FFFD:
s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
3: Replace unsupported characters using a translation map:
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
For example,
use utf8; # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8.
use strict;
use warnings;
use 5.010; # say, //
use charnames ':full'; # Not needed in 5.16+
my %translations = (
"\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}",
# ...
);
$_ = "𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰";
say;
s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
say;
Output:
𝜀C = -2.4‰ ± 0.3‰; 𝜀H = -57‰
εC = -2.4‰ ± 0.3‰; εH = -57‰

How to properly display HTML entities in perl

I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.
I was handling strings that contain contain Chinese characters and strings like Jìngyè.
I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).
I wrote a small code to test different behaviors on 2 strings.
String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".
Below is my code:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
These are my results:
before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466
decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)
chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)
before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號
decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)
chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)
Can someone please explain me why was this happening? And how to solve this so that my String will display properly.
Thank you very much.
Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = #_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}
First, always use use strict; use warnings;!!!
The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.
Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F) is obviously not a byte, so:
$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號
But it's not always obvious that something is wrong. chr(0xE8) could be a byte, so:
$ perl -we'print "\xE8\n"'
�
The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.
Encode's encode is used to provide character encoding. You can also have encode called automatically using the open module.
$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號
$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è

default language in web sites form [duplicate]

I am trying to validate form data from server-side.
my interest is that the user just fill the form by Persian characters.
I am using this code:
$name=trim($_POST['name']);
$name= mysql_real_escape_string($name);
if (preg_match('/^[\u0600-\u06FF]+$/',str_replace("\\\\","",$name))){$err.= "Please use Persian characters!";}
but it is not working!
here is a warning:
Warning: preg_match() [function.preg-match]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 3 in C:\xampp\htdocs\site\form.php on line 38
What can I do?
This 'should' work...
** added a ^ after the opening [ to exclude arabic/farsi characters from the match...
if (preg_match('/^[^\x{600}-\x{6FF}]+$/u', str_replace("\\\\","",$name)))
http://utf8-chartable.de/unicode-utf8-table.pl?start=1536&number=1024&utf8=0x&addlinks=1&htmlent=1
پژگچ in 600 - 6FF range
fa only:
preg_match('/^[پچجحخهعغفقثصضشسیبلاتنمکگوئدذرزطظژؤإأءًٌٍَُِّ\s]+$/u', $input);
en , en-num and fa character:
preg_match('/^([a-zA-Z0-9 پچجحخهعغفقثصضشسیبلاتنمکگوئدذرزطظژؤإأءًٌٍَُِّ])+$/u', $input);
you can set fa-numbers or arabic ي ك
You should use this:
if(preg_match("/^[آ ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی]/", $_POST['name']))

What characters have to be escaped to prevent (My)SQL injections?

I'm using MySQL API's function
mysql_real_escape_string()
Based on the documentation, it escapes the following characters:
\0
\n
\r
\
'
"
\Z
Now, I looked into OWASP.org's ESAPI security library and in the Python port it had the following code (http://code.google.com/p/owasp-esapi-python/source/browse/esapi/codecs/mysql.py):
"""
Encodes a character for MySQL.
"""
lookup = {
0x00 : "\\0",
0x08 : "\\b",
0x09 : "\\t",
0x0a : "\\n",
0x0d : "\\r",
0x1a : "\\Z",
0x22 : '\\"',
0x25 : "\\%",
0x27 : "\\'",
0x5c : "\\\\",
0x5f : "\\_",
}
Now, I'm wondering whether all those characters are really needed to be escaped. I understand why % and _ are there, they are meta characters in LIKE operator, but I can't simply understand why did they add backspace and tabulator characters (\b \t)? Is there a security issue if you do a query:
SELECT a FROM b WHERE c = '...user input ...';
Where user input contains tabulators or backspace characters?
My question is here: Why did they include \b \t in the ESAPI security library? Are there any situations where you might need to escape those characters?
A guess concerning the backspace character: Imagine I send you an email "Hi, here's the query to update your DB as you wanted" and an attached textfile with
INSERT INTO students VALUES ("Bobby Tables",12,"abc",3.6);
You cat the file, see it's okay, and just pipe the file to MySQL. What you didn't know, however, was that I put
DROP TABLE students;\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b
before the INSERT STATEMENT which you didn't see because on console output the backspaces overwrote it. Bamm!
Just a guess, though.
Edit (couldn't resist):
The MySQL manual page for strings says:
\0   An ASCII NUL (0x00) character.
\'   A single quote (“'”) character.
\"   A double quote (“"”) character.
\b   A backspace character.
\n   A newline (linefeed) character.
\r   A carriage return character.
\t   A tab character.
\Z   ASCII 26 (Control-Z). See note following the table.
\\   A backslash (“\”) character.
\%   A “%” character. See note following the table.
\_   A “_” character. See note following the table.
Blacklisting (identifying bad characters) is never the way to go, if you have any other options.
You need to use a conbination of whitelisting, and more importantly, bound-parameter approaches.
Whilst this particular answer has a PHP focus, it still helps plenty and will help explain that just running a string through a char filter doesn't work in many cases. Please, please see Do htmlspecialchars and mysql_real_escape_string keep my PHP code safe from injection?
Where user input contains tabulators or backspace characters?
It's quite remarkable a fact that up to this day most users do believe that it's user input have to be escaped, and such escaping "prevents injections".
Java solution:
public static String filter( String s ) {
StringBuffer buffer = new StringBuffer();
int i;
for( byte b : s.getBytes() ) {
i = (int) b;
switch( i ) {
case 9 : buffer.append( " " ); break;
case 10 : buffer.append( "\\n" ); break;
case 13 : buffer.append( "\\r" ); break;
case 34 : buffer.append( "\\\"" ); break;
case 39 : buffer.append( "\\'" ); break;
case 92 : buffer.append( "\\" );
if( i > 31 && i < 127 ) buffer.append( new String( new byte[] { b } ) );
}
}
return buffer.toString();
}
couldn't one just delete the single quote(s) from user input?
eg: $input =~ s/\'|\"//g;