Reading and Writing XML files with unknown encoding in Perl? - html

I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.
So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:
$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;
Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsing the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.
So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html
The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.
UPDATE:
I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.

Discovering the encoding of a HTML document is hard. See http://blog.whatwg.org/the-road-to-html-5-character-encoding and especially that it requires a "7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while."
This is what I used for a my limited needs in parsing HTML files.
my $CHARACTER_SET_CLASS = '\w:.()-';
# X(HT)?ML: http://www.w3.org/International/O-charset
/\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
# X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
/\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
# CSS: http://www.w3.org/International/questions/qa-css-charset
/\#charset "([^\"]*)"/ ||

To make sure you are producing output in UTF-8, apply the utf8 layer to the output stream using binmode
open FILE, '>output.html';
binmode FILE, ':utf8';
or in the 3-argument open call
open FILE, '>:utf8', 'output.html'
Arbitrary input is trickier. If you are lucky, HTML input will tell you its encoding early on:
wget http://www.google.com/ -O foo ; head -1 foo
<!doctype html><html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},
Ah, there it is: <meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">. Now you may continue to read input as raw bytes and find some way to decode those bytes with the known encoding. CPAN can help with this.

I am referring to the updated part of your question (next time open a new one for a separate topic). This is a hex dump of your file (please refrain in the future from making helpers jump through burning hoops to get at your example data):
0000 3c 78 6d 6c 3e 0d 0a 3c 70 65 72 73 6f 6e 4e 61 <xml>␍␤< personNa
0010 6d 65 3e 47 2e 20 50 65 74 65 72 20 44 61 80 41 me>G. Pe ter Da�A
0020 6c 6f 69 61 3c 2f 70 65 72 73 6f 6e 4e 61 6d 65 loia</pe rsonName
0030 3e 0d 0a 3c 2f 78 6d 6c 3e 0d 0a >␍␤</xml >␍␤
You said you know the character should be ’, but it got totally mangled. It can't be 0x80 in any encoding. This looks like a paste accident where you transferred data between editors/clipboards instead of dealing with just files. If that's not the case, then your cow orker produced a wrong you are not able to right algorithmically.

Related

How to override Content-Type/charset specified in HTTP header using HTML/CSS/JS

Test Case
I have a live test case available here: https://lonelearner.github.io/charset-issue/index.html
Since the HTML has non-ASCII characters, if you want to reliably reproduce this test case on your system, here is how you can reproduce it. You can use any one of these methods to reproduce it:
Fetch the page from above URL.
curl https://lonelearner.github.io/charset-issue/index.html -O
Run this command:
echo "
3c21444f43545950452068746d6c3e0a3c68746d6c3e0a20203c68656164
3e0a202020203c7469746c653e636861727365742069737375653c2f7469
746c653e0a202020203c6d65746120687474702d65717569763d22436f6e
74656e742d547970652220636f6e74656e743d22746578742f68746d6c3b
20636861727365743d69736f2d383835392d31223e0a20203c2f68656164
3e0a20203c626f64793e0a202020203c703ea93c2f703e0a20203c2f626f
64793e0a3c2f68746d6c3e0a
" | xxd -p -r > index.html
Interesting Byte
Let us look at the ISO-8859-1 encoded character that we are concerned about in this question.
$ curl -s https://lonelearner.github.io/charset-issue/index.html | xxd -g1
00000000: 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a <!DOCTYPE html>.
00000010: 3c 68 74 6d 6c 3e 0a 20 20 3c 68 65 61 64 3e 0a <html>. <head>.
00000020: 20 20 20 20 3c 74 69 74 6c 65 3e 63 68 61 72 73 <title>chars
00000030: 65 74 20 69 73 73 75 65 3c 2f 74 69 74 6c 65 3e et issue</title>
00000040: 0a 20 20 20 20 3c 6d 65 74 61 20 68 74 74 70 2d . <meta http-
00000050: 65 71 75 69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 equiv="Content-T
00000060: 79 70 65 22 20 63 6f 6e 74 65 6e 74 3d 22 74 65 ype" content="te
00000070: 78 74 2f 68 74 6d 6c 3b 20 63 68 61 72 73 65 74 xt/html; charset
00000080: 3d 69 73 6f 2d 38 38 35 39 2d 31 22 3e 0a 20 20 =iso-8859-1">.
00000090: 3c 2f 68 65 61 64 3e 0a 20 20 3c 62 6f 64 79 3e </head>. <body>
000000a0: 0a 20 20 20 20 3c 70 3e a9 3c 2f 70 3e 0a 20 20 . <p>.</p>.
000000b0: 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e 0a </body>.</html>.
In the row before the last one (line at offset 000000a0), the 9th byte is a9. That is our interesting byte. That is an ISO-8859-1 representation of the copyright sign. Note that this is ISO-8859-1 encoded symbol, not UTF-8. If it had been UTF-8 encoded, the bytes would be c2 a9.
META Tag
To ensure that the content of this HTML file is interpreted as ISO-8859-1 encoded data, there is this <meta> tag in the HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Local Behavior
If you open this file on your system locally with a browser, you would most likely see an output like this:
This is expected because when opening the file locally, there is no HTTP server sending HTTP headers. So the iso-8859-1 encoding specified in the <meta> tag is honored.
GitHub Behaviour
If you access the URL https://lonelearner.github.io/charset-issue/index.html with a browser, you would most likely see an output like this:
This is also expected. If you notice the page is served with GitHub Pages and GitHub Pages server always returns HTTP header that specifies ISO-8859-1 encoding.
$ curl -sI https://lonelearner.github.io/charset-issue/index.html | grep -i content-type
content-type: text/html; charset=utf-8
Since HTTP header specifies the character encoding, the character encoding in <meta> tag is no longer honored.
Question
Is there anyway I can override the character encoding specified in the HTTP header using HTML, JavaScript or CSS to tell the browser that this content should be interpreted as ISO-8859-1 encoding even if the HTTP header says otherwise?
I know I can always write the copyright symbol as © or encode the symbol in UTF-8 in the file, but let us consider such solutions to be outside the scope of this question because here are the constraints I am dealing with:
The content of the <body> is made available to me as ISO-8859-1 encoded text.
I cannot modify the content of the <body>. I must use the ISO-8859-1 encoded text in my HTML.
I can modify anything within the <head> tag. So I can add JavaScript, CSS or any other tricks that can solve this problem.
Is there anyway I can override the character encoding specified in the HTTP header using HTML, JavaScript or CSS to tell the browser that this content should be interpreted as ISO-8859-1 encoding even if the HTTP header says otherwise?
No. The HTTP header is authoritative w3:
"...The HTTP header has a higher precedence than the in-document meta
declarations, content authors should always take into account whether
the character encoding is already declared in the HTTP header. If it
is, the meta element must be set to declare the same encoding."

Looking for decoding algorithm for datetime in MYSQL. See examples, reward for solution

Have tried some of the online references as wells as unix time form at etc. but none of these seem to work. See the examples below.
running Mysql 5.5.5 in ubuntu. innodb engine.
nothing is custom. This is using a built in datetime function.
Here are some examples with the 6 byte hex string and the decoded message below. We are looking for the decoding algorithm. i.e.how to turn the 6 byte hex string into the correct date/time. The algorithm must work correctly on the examples below. The right most byte seems to indicate difference in seconds correctly for small small differences in time between records. i.e. we show an example with 14 sec difference.
full records,nicely highlighted and formated word doc here.
https://www.dropbox.com/s/zsqy9o2rw1h0e09/mysql%20datetime%20examples%20.docx?dl=0
link to formatted word document with the examples.
contact frank%simrex.com re. reward.
replace % with #
hex strings and decoded date/time pairs are below.
pulled from healthy file running mysql
12 51 72 78 B9 46 ... 2014-10-22 16:53:18
12 51 72 78 B9 54 ... 2014-10-22 16:53:32
12 51 72 78 BA 13 ... 2014-10-22 16:55:23
12 51 72 78 CC 27 ... 2014-10-22 17:01:51
here you go.
select str_to_date(conv(replace('12 51 72 78 CC 27',' ', ''), 16, 10), '%Y%m%d%H%i%s')

Node streams: remote stream is returning JSON, but on('data') shows buffer

I'm just starting out with Node streams.
My library's demo code uses:
stream.pipe(process.stdout, {end: true});
Which works fine, printing chunks of JSON to standard output.
I'd like to use:
stream.on('data', function(chunk) {
console.log(chunk)
}
But I get a binary buffer instead:
chunk! <Buffer 7b 22 73 74 72 65 61 6d 22 3a 22 20 2d 2d 2d 5c 75 30 30 33 65 20 35 35 32 38 38 36 39 62 30 30 33 37 5c 6e 22 7d>
Is there a way I can use on('data') and see the JSON?
I believe you should run stream.setEncoding('utf8') on your stream, so node.js core will decode utf8 automatically.
You should probably not use chunk.toString('utf8') like suggested earlier because it can garble unicode characters on the boundaries, unless you're sure that the data will be in one block.
Use chunck.toString ('utf8'). Also, the Buffer class has other encodings too!

Inserting binary into MySQL from node.js

I'm using the crypto module to create salts and hashes for storage in my database. I'm using SHA-512, if that's relevant.
What I have is a 64 byte salt, presently in the form of a "SlowBuffer", created by crypto.randomBytes(64, . . .). Example:
<SlowBuffer 91 0d e9 23 c0 8e 8c 32 5c d6 0b 1e 2c f0 30 0c 17 95 5c c3 95 41 fd 1e e7 6f 6e f0 19 b6 34 1a d0 51 d3 b2 8d 32 2d b8 cd be c8 92 e3 e5 48 93 f6 a7 81 ...>
I also have a 64-byte hash that is currently a string. Example:
'de4c2ff99fb34242646a324885db79ca9ef82a5f4b36c657b83ecf6931c008de87b6daf99a1c46336f36687d0ab1fc9b91f5bc07e7c3913bec3844993fd2fbad'
In my database, I have two fields, called passhash and passsalt, which are binary(64)s.
I'm using the mysql module (require('mysql')) to add the row. How can I include the binaries for insertion?
First of all, I'm no longer using the mysql module, but the mysql2 module (because it supports prepared statements). This changes roughly nothing about the question, but I'd like to mention that those reading this who are using 'mysql' should probably use 'mysql2'.
Second, both of these modules can take Buffers as parameters for these fields. This works perfectly. To do what I was originally attempting to do, I have it like this:
var hash; //Pretend initialized as a 64-bit buffer
var salt; //" "
connection.query('insert into users set?', {..., passhash: hash, passsalt: salt,..., callback});
Additionally, I didn't realize that the crypto "digest" function had a default behavior with no parameter, which is to return as a Buffer.
This is not the best worded answer, but that's because no one seems to be paying much attention to this question, and it was my question. :) If you would like more clarification, feel free to comment.

Extract JSON-response into parameter

I need to extract this specific JSON-field into a parameter for my performance test in Visual Studio:
"ExamAnswerId": "757a3735-e626-412b-934c-e577c6963d51"
the problem occurs when I try to do this manually by right clicking the response and click "add extraction rule". The text is split up into 3 different rows with lots of unreadable numbers next to it like this:
"0x00000000 7B 22 45 78 61 6D 41 6E 73 77 65 72 49 64 22 3A {"ExamAnswerId":
0x00000010 22 37 35 37 61 33 37 33 35 2D 65 36 32 36 2D 34 "757a3735-e626-4
This will sound dumb, but I somehow need to extract 3 different parameters, only because I can't copy/paste it -- and this is also where I think I fail.
the ExamAnswerId is important for me to fullfill another webrequest later on, but I can't seem to pass it on properly.
all input greatly appreciated !
Did you see this response that got posted? http://social.msdn.microsoft.com/Forums/en-US/vstest/thread/b26114a2-7a24-45eb-b5d1-01e9165045b0/
Just use the Extract Test like they suggest and you should work fine. Your Starts With could be "ExamAnswerId": " and your Ends With could be ". HTH.
I had a similar problem with the Extraction rule. Had to escape the quotes to get the condition to work. Like this:
Starts With: \"ExamAnswerId\":\"
Ends With: \"
Do following steps :
Add all available ExamAnswerId in a CSV file .
Now add CSV as Data source.
Lets suppose CSV file name is testdata and tableName=testData#csv and columnName=ExamAnswerId.
Please note when you will add data source you will see table name.
Replace this:
["ExamAnswerId": "757a3735-e626-412b-934c-e577c6963d51"]
by this:
["ExamAnswerId": "{{testdata.testdata#csv.ExamAnswerId}}"]
I managed session in JSON. The following link worked for me:
https://social.msdn.microsoft.com/Forums/vstudio/en-US/b26114a2-7a24-45eb-b5d1-01e9165045b0/cant-fetch-json-value-and-extract-to-parameter?forum=vstest
Example:
0x00000000 7B 22 53 65 73 73 69 6F 6E 22 3A 22 63 66 39 37 {"Session":"cf97
0x00000010 64 33 65 61 2D 36 39 38 33 2D 34 31 37 30 2D 38
I created "Extrat_Text" with variable MySessionID
Left Boundary "Session":
Right Boundary ",
Then I passed {{MySessionID}} in subsequent request in place of Session.
Or we can use regex (Positive look ahead and Positive look behind)
For example I want to get access_token property in a JSON result looks like this
{"token_type":"Bearer","expires_in":"3600","ext_expires_in":"0","expires_on":"1474420129","not_before":"1474416229","resource":"5fe3f443","access_token":"eyJ0eXAiOiJKV1QiLCJhbGci"}
I can use this regex:
(?<=\"access_token\"\:\").*(?=\")