How to properly display HTML entities in perl - html

I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.
I was handling strings that contain contain Chinese characters and strings like Jìngyè.
I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).
I wrote a small code to test different behaviors on 2 strings.
String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".
Below is my code:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
These are my results:
before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466
decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)
chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)
before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號
decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)
chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)
Can someone please explain me why was this happening? And how to solve this so that my String will display properly.
Thank you very much.
Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = #_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}

First, always use use strict; use warnings;!!!
The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.
Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F) is obviously not a byte, so:
$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號
But it's not always obvious that something is wrong. chr(0xE8) could be a byte, so:
$ perl -we'print "\xE8\n"'
�
The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.
Encode's encode is used to provide character encoding. You can also have encode called automatically using the open module.
$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號
$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è

Related

delete CSV file row based on the value of a column in command line

here is how my dataset looks like, I am trying to filter out country that the 4th column is >= 1000.
Marshall Islands,53127,77,41
Vanuatu,276244,25,70
Solomon Islands,611343,23,142
Sao Tome and Principe,204327,72,147
Belize,374681,46,171
Maldives,436330,39,172
Guyana,777859,27,206
Eswatini,1367254,24,323
Timor-Leste,1296311,30,392
Lesotho,2233339,28,619
Guinea-Bissau,1861283,43,799
Namibia,2533794,49,1242
Gambia,2100568,61,1273
.
.
.
Zimbabwe,16529904,32,5329
(total 77 lines of data)
I have tried to run the following command on my terminal, but it only output 1 line of the dataset to new file.
awk -F, '$4 > 999' original.csv > new.csv
*update, all line except Zimbabwe are ending with ^M$.
Here is desired output
Namibia,2533794,49,1242
Gambia,2100568,61,1273
Burundi,10864245,13,1380
Armenia,2930450,63,1849
Rwanda,12208407,17,2091
Mongolia,3075647,68,2103
Kyrgyzstan,6045117,36,2184
Mauritania,4420184,53,2335
Lao People's Democratic Republic,6858160,34,2357
Liberia,4731906,51,2399
Tajikistan,8921343,27,2407
Sierra Leone,7557212,42,3147
Togo,7797694,41,3210
Chad,14899994,23,3406
Congo,5260750,66,3496
Cambodia,16005373,23,3678
Paraguay,6811297,61,4175
El Salvador,6377853,71,4546
Guinea,12717176,36,4552
Benin,11175692,47,5227
Zimbabwe,16529904,32,5329
Azerbaijan,9827589,55,5439
Burkina Faso,19193383,29,5517
Nepal,29304998,19,5666
Haiti,10981229,54,5968
Somalia,14742523,44,6544
Zambia,17094131,43,7346
Senegal,15850567,47,7409
Bolivia (Plurinational State of),11051600,69,7634
Mali,18541980,42,7708
Tunisia,11532127,69,7916
Guatemala,16913504,51,8572
Dominican Republic,10766998,80,8643
Cuba,11484636,77,8841
Afghanistan,35530082,25,8971
Syrian Arab Republic,18269867,54,9774
Uganda,42862957,23,9942
Yemen,28250420,36,10175
Kazakhstan,18204498,57,10438
Ecuador,16624857,64,10585
Côte d'Ivoire,24294750,50,12227
Kenya,49699863,27,13201
Cameroon,24053727,56,13416
Sudan,40533328,34,13931
Ghana,28833629,55,15976
Myanmar,53370609,30,16183
United Republic of Tanzania,57310020,33,18943
Angola,29784193,65,19312
Ethiopia,104957438,20,21317
Peru,32165484,78,24999
Iraq,38274617,70,26899
Algeria,41318141,72,29771
Viet Nam,95540797,35,33643
Thailand,69037516,49,33966
Democratic Republic of the Congo,81339984,44,35692
South Africa,56717156,66,37348
Colombia,49065613,80,39471
Egypt,97553148,43,41660
Philippines,104918094,47,48978
Bangladesh,164669750,36,59047
Pakistan,197015953,36,71797
Nigeria,190886313,50,94525
Mexico,129163273,80,103159
Indonesia,263991375,55,144295
India,1339180125,34,449965
Does anyone have suggestions on how to fix this issue?
Assuming that your Input_file's last field may have spaces in it. You can also check it by doing cat -e Input_file it will show you where is line ending including hidden spaces at the line end. If this is the case then try following command.
awk 'BEGIN{FS=","} $4+0 > 999' Input_file

What is the encoding scheme for this Arabic web page?

I am trying to find the encoding scheme for this page (and others) which are surely Arabic, using lower ASCII range Latin characters to encode the contents.
http://www.saintcyrille.com/2011a.htm
http://www.saintcyrille.com/2011b.htm (English version/translation of that same page)
I have seen several sites and even PDF documents with this encoding, but I can't find the name or method of it.
This specific page is from 2011 and I think this is a pre-Unicode method of encoding Arabic that has fallen out of fashion.
Some sample text:
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'
An extraordinary mojibake case. It looks like there is missing high byte in Unicode code points in Arabic text. For instance: ا (U+0627, Arabic Letter Alef) appears as ' (U+0027, Apostrophe).
Let's suppose that missing high byte is always 0x06 in the following PowerShell script (added some more strings from the very end of the page http://www.saintcyrille.com/2011a.htm to your sample text):
$mojibakes = #'
E3'!K
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'
ED'-8'* :
'D#CD 'D5J'EJ 7H'D 'D#3(H9 'D98JE E-(0 ,/'K H'D5HE JF*GJ (9/ B/'3 'D9J/
J-(0 'D*B/E DD'9*1'A (9J/'K 9F JHE 'D9J/ (B/1 'D%EC'F -*I *3*7J9H' 'DE4'1C) AJ 'D5DH'* HB/'3 'D9J/ HFF5- D0DC 'D'3*A'/) EF -AD) 'D*H() 'D,E'9J) JHE 'D,E9) 15 '(1JD 2011 -J+ JGJ# 'D,EJ9 E9'K DFH'D 31 'DE5'D-) ( 9// EF 'D#('! 'DCGF) 3JCHF -'61'K )
(5F/HB 'D5HE) 9F/ E/.D 'DCFJ3) AAJ A*1) 'D#9J'/ *8G1 AJF' #9E'D 'D1-E) H'D5/B'* HE' JB'(DG' H0DC 9ED EB(HD HEE/H-
HDF' H7J/ 'D#ED #F *4'1CH' 'D'-*A'D'* AJ 19J*CE HCD 9'E H#F*E (.J1
'DE3J- B#'E ... -#B'K B#'E
'# -split [System.Environment]::NewLine
Function highByte ([byte]$lowByte, [switch]$moreInfo) {
if ( $moreInfo.IsPresent -and (
$lowByte -lt 0x20 -or $lowByte -gt 0x7f )) {
Write-Host $lowByte -ForegroundColor Cyan
}
if ( $lowByte -eq 0x20 ) { 0,$lowByte } else { 6,$lowByte }
}
foreach ( $mojibake in $mojibakes ) {
$aux = [System.Text.Encoding]::
GetEncoding( 1252).GetBytes( [char[]]$mojibake )
[System.Text.Encoding]::BigEndianUnicode.GetString(
$aux.ForEach({(highByte -lowByte $_)})
)
'' # new line separator for better readability
}
Output (using Google Translate) seems to give a sense roughly similar to English version of the page, after a fashion…
Output: .\SO\70062779.ps1
مساءً
الرياضة الروحية الفصحية
تأملات في سفر التكوين
يلقيها
الأب د سامي حلاق اليسوعي
الذي يأتي خصيصاً من حلب ـ سوريا
ملاحظات غ
الأكل الصيامي طوال الأسبوع العظيم محبذ جداً والصوم ينتهي بعد قداس
العيد
يحبذ التقدم للاعتراف بعيداً عن يوم العيد بقدر الإمكان حتى تستطيعوا
المشاركة في الصلوات وقداس العيد ، وننصح لذلك ، الاستفادة من حفلة
التوبة الجماعية يوم الجمعة رص ابريل زذرر حيث يهيأ الجميع معاً لنوال سر
المصالحة ب عدد من الأباء الكهنة سيكون حاضراً ة
بصندوق الصومة عند مدخل الكنيسة ففي فترة الأعياد تظهر فينا أعمال الرحمة
والصدقات وما يقابلها وذلك عمل مقبول وممدوح
ولنا وطيد الأمل أن تشاركوا الاحتفالات في رعيتكم وكل عام وأنتم بخير
المسيح قـام خخخ حـقاً قـام
Please keep in mind that I do not understand Arabic.
The script does not handle numbers: year 2011 in note #2 is incorrectly transformed to زذرر, for instance;
Handling spaces is unclear: is 0x20 always a space, or should be transformed to ؠ (U+0620, Arabic Letter Kashmiri Yeh)?
moreover, there is that problematic presumption about Unicode range U+0600-U+067F (where are U+0680-U+06FF and others?).

Trouble converting a fixed-width file into a csv

sorry if this is a newbie question, but I didn't find the answer to this particular question on stackoverflow.
I have a (very large) fixed-width data file that looks like this:
simplefile.txt
ratno fdate ratname typecode country
12346 31/12/2010 HARTZ 4 UNITED STATES
12444 31/12/2010 CHRISTIE 5 UNITED STATES
12527 31/12/2010 HILL AIR 4 UNITED STATES
15000 31/12/2010 TOKUGAVA INC. 5 JAPAN
37700 31/12/2010 HARTLAND 1 UNITED KINGDOM
37700 31/12/2010 WILDER 1 UNITED STATES
18935 31/12/2010 FLOWERS FINAL SERVICES INC 5 UNITED STATES
37700 31/12/2010 MAPLE CORPORATION 1 CANADA
48614 31/12/2010 SERIAL MGMT L.P. 5 UNITED STATES
1373 31/12/2010 AMORE MGMT GROUP N A 1 UNITED STATES
I am trying to convert it into a csv file using the terminal (the file is too big for Excel) that would look like this:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
I dug a bit around on this site and found a possible solution that relies on the awk shell command:
awk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' "simpletestfile.txt"
However, when I execute the above command in the terminal, it inadvertently also inserts commas in all white spaces, inside the separate words of what is supposed to remain a single field. The result of the above execution is as follows:
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED,STATES
12444,31/12/2010,CHRISTIE,5,UNITED,STATES
12527,31/12/2010,HILL,AIR,4,UNITED,STATES
15000,31/12/2010,TOKUGAVA,INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED,KINGDOM
37700,31/12/2010,WILDER,1,UNITED,STATES
18935,31/12/2010,FLOWERS,FINAL,SERVICES,INC,5,UNITED,STATES
37700,31/12/2010,MAPLE,CORPORATION,1,CANADA
48614,31/12/2010,SERIAL,MGMT,L.P.,5,UNITED,STATES
1373,31/12/2010,AMORE,MGMT,GROUP,N,A,1,UNITED,STATES
How can I avoid inserting commas in white spaces outside of delineated fieldwidths? Thank you!
Your attempt was good, but requires gawk (gnu awk) for the FIELDWIDTHS built-in variable. With gawk:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{$1=$1;print}' file
ratno, fdate, ratname , typecode, country
12346, 31/12/2010, HARTZ , 4 , UNITED STATES
12444, 31/12/2010, CHRISTIE , 5 , UNITED STATES
12527, 31/12/2010, HILL AIR , 4 , UNITED STATES
Assuming you don't want the extra spaces, you can do instead:
$ gawk -v FIELDWIDTHS="5 11 31 9 16" -v OFS=',' '{for (i=1; i<=NF; ++i) gsub(/^ *| *$/, "", $i)}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
If you don't have gnu awk, you can achieve the same results with:
$ awk -v fieldwidths="5 11 31 9 16" '
BEGIN { OFS=","; split(fieldwidths, widths) }
{
rec = $0
$0 = ""
start = 1;
for (i=1; i<=length(widths); ++i) {
$i = substr(rec, start, widths[i])
gsub(/^ *| *$/, "", $i)
start += widths[i]
}
}1' file
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
perl is handy here:
perl -nE ' # read this bottom to top
say join ",",
map {s/^\s+|\s+$//g; $_} # trim leading/trailing whitespace
/^(.{5}) (.{10}) (.{30}) (.{8}) (.*)/ # extract the fields
' simplefile.txt
ratno,fdate,ratname,typecode,country
12346,31/12/2010,HARTZ,4,UNITED STATES
12444,31/12/2010,CHRISTIE,5,UNITED STATES
12527,31/12/2010,HILL AIR,4,UNITED STATES
15000,31/12/2010,TOKUGAVA INC.,5,JAPAN
37700,31/12/2010,HARTLAND,1,UNITED KINGDOM
37700,31/12/2010,WILDER,1,UNITED STATES
18935,31/12/2010,FLOWERS FINAL SERVICES INC,5,UNITED STATES
37700,31/12/2010,MAPLE CORPORATION,1,CANADA
48614,31/12/2010,SERIAL MGMT L.P.,5,UNITED STATES
1373,31/12/2010,AMORE MGMT GROUP N A,1,UNITED STATES
Although, for proper CSV, we need to be a bit cautious about fields containing commas or quotes. If I was feeling less secure about the contents of the file, I'd use this map block:
map {s/^\s+|\s+$//g; s/"/""/g; qq("$_")}
which outputs
"ratno","fdate","ratname","typecode","country"
"12346","31/12/2010","HARTZ","4","UNITED STATES"
"12444","31/12/2010","CHRISTIE","5","UNITED STATES"
"12527","31/12/2010","HILL AIR","4","UNITED STATES"
"15000","31/12/2010","TOKUGAVA INC.","5","JAPAN"
"37700","31/12/2010","HARTLAND","1","UNITED KINGDOM"
"37700","31/12/2010","WILDER","1","UNITED STATES"
"18935","31/12/2010","FLOWERS FINAL SERVICES INC","5","UNITED STATES"
"37700","31/12/2010","MAPLE CORPORATION","1","CANADA"
"48614","31/12/2010","SERIAL MGMT L.P.","5","UNITED STATES"
"1373","31/12/2010","AMORE MGMT GROUP N A","1","UNITED STATES"

Detecting encoding of sting and converting it

I have string:
string <- "{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKR\\u0118G nr 1: Obejmuje obszar wojew\\xf3dztwa pomorskiego z siedzib\\u0105 ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
As you can see I have some codes instead of letters. As far as I know there are UTH-8 codes for polish characters like ą, ć, ź, ó and so on. How can I convert this string to obtain the output
"{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKRĄG nr 1: Obejmuje obszar województwa pomorskiego z siedzibą ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
Here's a regular expression to find all escaped characters in the form \udddd and \xdd. We then take those values, and re-parse them to turn them into characters. Finally we replace the original matched values with the true characters
m <- gregexpr("\\\\u\\d{4}|\\\\x[0-9A_Fa-f]{2}", string)
a <- enc2utf8(sapply(parse(text=paste0('"', regmatches(string,m)[[1]], '"')), eval))
regmatches(string,m)[[1]] <- a
This will do them all. If you only want to do a subset, you could filter the vector of possible replacements.

R: Extracting elements between characters in a web page

I have two lines of info from a web page that I want to parse into a data.frame.
[104] " $1775 / 2br - 1112ft² - Wonderful two bedroom two bathroom with balcony! (14001 NE 183rd Street )"
[269] " var pID = \"4619136687\";"
I'd like it to look like this.
postID |rent|type|size|description |location
4619136687|1775|2br |1112|Wonderful two bedroom...|14001 NE 183rd Street
I was able to use the sub() command to get the ID but I'm not exactly familiar with regex in the sub() command to parse out what I need when there are spaces, such as in line [104].
sub(".*pID = \"(.*)\";.*","\\1", " var pID = \"4619136687\";")
Any help would be wonderful, Thanks!