Is there a way to specify certain part of a html file as another encoding?
The default encoding for the (generated) html is utf-8. However, some of the included data to be inserted in the html is in another encoding. It's something like:
<div>
the normal html in utf-8
</div>
<div>
<%= raw_data_in_another_encoding %>
</div>
Is there a way to hint a browser to render the 2nd <div> in another encoding? thanks
No, the entire file must have a single encoding. If you're saving a plain .html file, you'll have to convert the entire file to one encoding.
If you're using a server-side scripting language, however, you can always convert text from one encoding to another. You might designate UTF-8 as the encoding for the page, and then when you encounter bits of content currently encoded in, say, latin1, you can simply convert it to UTF-8 before outputting it.
How you do that, of course, would depend on the particular server-side language you're using.
In PHP, you could do:
echo iconv('ISO-8859-1', 'UTF-8', $someLatin1Text);
You can send any arbitrary encoding at any point in your HTTP response stream, but generally your client won't be able to deal with it. In HTML, multiple encodings in the same document simply aren't permitted. Or even gracefully handled by any modern client except perhaps by accident.
If you are using Ruby (guessing based only on your naming conventions), you can convert a string from one encoding to another using the iconv library. If you're using something else, there's most likely a similar alternative. PHP and Python both offer some encoding translation options based on the iconv library. In the .Net Framework, you can use the Encoding class to grab the suitable source encoding, and call GetBytes with your source byte array as the parameter to get a string suitable for further manipulation.
Numerical character references are another option, if you are primarily using another encoding and only occasionally using characters outside of that encoding's supported range. However, you're generally going to stay saner by converting to and from UTF-8 from legacy encodings.
I think you can't but if you need some text be showed in a different encoding, you can do a "translating function". I had a similar problem with an english page where I had to add some spannish messages, so I do something like this:
function spanishEncoding (string) {
var res = string;
res = res.replace( /á/g, "\u00e1" );
res = res.replace( /Á/g, "\u00c1" );
res = res.replace( /é/g, "\u00e9" );
res = res.replace( /É/g, "\u00c9" );
res = res.replace( /í/g, "\u00ed" );
res = res.replace( /Í/g, "\u00cd" );
res = res.replace( /ó/g, "\u00f3" );
res = res.replace( /Ó/g, "\u00d3" );
res = res.replace( /ú/g, "\u00fa" );
res = res.replace( /Ú/g, "\u00da" );
res = res.replace( /ñ/g, "\u00f1" );
res = res.replace( /Ñ/g, "\u00d1" );
return res; };
var newDiv = window.content.document.createElement("div");
newDiv.appendChild(window.content.document.createTextNode("Esta página")); //This shows "Esta p*Â!*gina"
var anotherDiv = window.content.document.createElement("div");
anotherDiv.appendChild(window.content.document.createTextNode(spanishEncoding("Esta página"))); //This shows "Esta página"
Hope it help you!
Related
i have extracted the fix message as below from Unix server and now need to convert this message into JSON. how can we do this?
8=FIXT.1.1|9=449|11=ABCD1|35=AE|34=1734|49=REPOFIXUAT|52=20140402-11:38:34|56=TR_UAT_VENDOR|1128=8|15=GBP|31=1.7666|32=50000000.00|55=GBP/USD|60=20140402-11:07:33|63=B|64=20140415|65=OR|75=20140402|150=F|167=FOR|194=1.7654|195=0.0012|460=4|571=7852455|1003=2 USD|1056=88330000.00|1057=N|552=1|54=2|37=20140402-12:36:48|11=NOREF|453=4|448=ZERO|447=D|452=3|448=MBY2|447=D|452=1|448=LMEB|447=D|452=16|448=DOR|447=D|452=11|826=0|78=1|79=default|80=50000000.00|5967=88330000.00|10=111
Note: I tried to make this a comment on the answer provided by #selbie, but the text was too long for a comment, so I am making it an answer.
#selbie's answer will work most of the time, but there are two edge cases in which it could fail.
First, in a tag=value field where the value is of type STRING, it is legal for value to contain the = character. To correctly cope with this possibility, the Java statement:
pair = item.split("=");
should be changed to:
pair = item.split("=", 2);
The second edge case is when there are a pair of fields, the first of which is of type LENGTH and the second is of type DATA. In this case, the value of the LENGTH fields specifies the length of the DATA field (without the delimiter), and it is legal for the value of the DATA field to contain the delimiter character (ASCII character 1, but denoted as | in both the question and Selbie's answer). Selbie's code cannot be modified in a trivial manner to deal with this edge case. Instead, you will need a more complex algorithm that consults a FIX data dictionary to determine the type of each field.
Since you didn't tag your question for any particular programming language, I'll give you a few sample solutions:
In javascript:
let s = "8=FIXT.1.1|9=449|11=ABCD1|35=AE|34=1734|49=REPOFIXUAT|52=20140402-11:38:34|56=TR_UAT_VENDOR|1128=8|15=GBP|31=1.7666|32=50000000.00|55=GBP/USD|60=20140402-11:07:33|63=B|64=20140415|65=OR|75=20140402|150=F|167=FOR|194=1.7654|195=0.0012|460=4|571=7852455|1003=2 USD|1056=88330000.00|1057=N|552=1|54=2|37=20140402-12:36:48|11=NOREF|453=4|448=ZERO|447=D|452=3|448=MBY2|447=D|452=1|448=LMEB|447=D|452=16|448=DOR|447=D|452=11|826=0|78=1|79=default|80=50000000.00|5967=88330000.00|10=111"
let obj = {};
items = s.split("|")
items.forEach(item=>{
let pair = item.split("=");
obj[pair[0]] = pair[1];
});
let jsonString = JSON.stringify(obj);
Python:
import json
s = "8=FIXT.1.1|9=449|11=ABCD1|35=AE|34=1734|49=REPOFIXUAT|52=20140402-11:38:34|56=TR_UAT_VENDOR|1128=8|15=GBP|31=1.7666|32=50000000.00|55=GBP/USD|60=20140402-11:07:33|63=B|64=20140415|65=OR|75=20140402|150=F|167=FOR|194=1.7654|195=0.0012|460=4|571=7852455|1003=2 USD|1056=88330000.00|1057=N|552=1|54=2|37=20140402-12:36:48|11=NOREF|453=4|448=ZERO|447=D|452=3|448=MBY2|447=D|452=1|448=LMEB|447=D|452=16|448=DOR|447=D|452=11|826=0|78=1|79=default|80=50000000.00|5967=88330000.00|10=111"
obj = {}
for item in s.split("|"):
pair = item.split("=")
obj[pair[0]] = pair[1]
jsonString = json.dumps(obj)
Porting the above solutions to other languages is an exercise for yourself. There's comments below about semantic ordering and handling cases where the the = or | chars are part of the content. That's on you to explore if you need to support those scenarios.
I have struggle below question with days, and posted same question earlier and didn't get any positive feedback.
Im using mysql in build aes_encrypt method to encrypt new and existing data.
https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html
SET ##SESSION.block_encryption_mode = 'aes-256-ecb';
INSERT INTO test_aes_ecb ( column_one, column_two )
values ( aes_encrypt('text','key'), aes_encrypt('text', 'key'));
I used ecb ciper, so It no need to use iv for that. Issue is I can't decrypt it from node.js side.
Im using sequelize and tried to call data through model --> decrypt from node side.
I tried with below libraries,
"aes-ecb": "^1.3.15",
"aes256": "^1.1.0",
"crypto-js": "^4.1.1",
"mysql-aes": "0.0.1",
Below are code snippets from sequelize call
async function testmysqlAESModel () {
const users = await test.findAll();
console.log('users', users[0].column_one);
var decrypt = AES.decrypt( users[0].column_one, 'key' );
}
Its returning buffer data and couldn't decrypt from node side, Can someone provide proper example for that? Im struggling for days.
EDIT
Inserted record to mysql as below query.
SET ##SESSION.block_encryption_mode = 'aes-256-ecb';
INSERT INTO test_aes_ecb ( id, column_one, column_two )
VALUES (1, 2,AES_ENCRYPT('test',UNHEX('gVkYp3s6v9y$B&E)H#McQeThWmZq4t7w')));
In nodejs called like this,
testmysqlAESModel();
async function testmysqlAESModel () {
const users = await test.findAll();
console.log('users', users[0].column_one);
var decipher = crypto.createDecipheriv(algorithm, Buffer.from("gVkYp3s6v9y$B&E)H#McQeThWmZq4t7w", "hex"), "");
var encrypted = Buffer.from(users[0].column_one); // Note that this is what is stored inside your database, so that corresponds to users[0].column_one
var decrypted = decipher.update(encrypted, 'binary', 'utf8');
decrypted += decipher.final('utf8');
console.log(decrypted);
}
Im getting below error,
I used below link to create 256bit key.
https://www.allkeysgenerator.com/Random/Security-Encryption-Key-Generator.aspx
Still couldn't fix, can you provide sample project or any kind of supporting code snippet for that ?
There are multiple issues here:
Ensure that your key has the correct length. AES is specified for certain key length (i.e. 128, 196 and 256 bit). if you use any other key length, then your key will be padded (zero extended) or truncated by the crypto library. This is a non-standard process, and different implementations will do this differently. To avoid this, use a key in the correct length and store it has hex instead of ascii (to avoid charset issues)
Potential issues regarding password to key inference. Some AES implementations use methods to infer keys from passwords/passphrases. Since you are using raw keys in MySQL, you do not want to infer anything but want to use raw keys in NodeJS as well. This means that if you are using the native crypto module, that you want to use createDecipheriv instead of createDecipher.
Caution: The AES mode you are using (ECB) is inherently insecure, because equal input leads to equal output. There are ways around that using other AES modes, such as CBC or GCM. You have been warned.
Example:
MySQL SELECT AES_ENCRYPT('text',UNHEX('F3229A0B371ED2D9441B830D21A390C3')) as test; returns the buffer [145,108,16,83,247,49,165,147,71,115,72,63,152,29,218,246];
Decoding this in Node could look like this:
var crypto = require('crypto');
var algorithm = 'aes-128-ecb';
var decipher = crypto.createDecipheriv(algorithm, Buffer.from("F3229A0B371ED2D9441B830D21A390C3", "hex"), "");
var encrypted = Buffer.from([145,108,16,83,247,49,165,147,71,115,72,63,152,29,218,246]); // Note that this is what is stored inside your database, so that corresponds to users[0].column_one
var decrypted = decipher.update(encrypted, 'binary', 'utf8');
decrypted += decipher.final('utf8');
console.log(decrypted);
This prints text again.
Note that F3229A0B371ED2D9441B830D21A390C3 is the key in this example, you would obviously have to create your own. Just ensure that your key has the same length as the example, and is a valid hex string.
I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.
Here's an example of such a string:
"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"
When I try parse the string for example with javascript's JSON.parse() and print it out I get:
"nejnižšà bod: 0 mnm Benátky\n"
While it should be
"nejnižší bod: 0 mnm Benátky\n"
I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.
I've been able to figure out these characters so far:
'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',
So what is this weird encoding? Is there any known tool that can correctly decode it?
The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:
function decode(s) {
let d = new TextDecoder;
let a = s.split('').map(r => r.charCodeAt());
return d.decode(new Uint8Array(a));
}
let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);
https://developer.mozilla.org/docs/Web/API/TextDecoder
You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8
The following code should work in python3.x:
import re
re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)
The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.
This command fixes a file like this in Emacs:
(defun k/format-facebook-backup ()
"Normalize a Facebook backup JSON file."
(interactive)
(save-excursion
(goto-char (point-min))
(let ((inhibit-read-only t)
(size (point-max))
bounds str)
(while (search-forward "\"\\u" nil t)
(message "%.f%%" (* 100 (/ (point) size 1.0)))
(setq bounds (bounds-of-thing-at-point 'string))
(when bounds
(setq str (--> (json-parse-string (buffer-substring (car bounds)
(cdr bounds)))
(string-to-list it)
(apply #'unibyte-string it)
(decode-coding-string it 'utf-8)))
(setf (buffer-substring (car bounds) (cdr bounds))
(json-serialize str))))))
(save-buffer))
Thanks to Jen's excellent question and Shawn's comment.
Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.
What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.
This is how I do it in Ruby (see gist):
require 'json'
require 'uri'
bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/
txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
$1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end
good_data = JSON.load(txt)
With bytes_re we catch all sequences of bad Unicode characters.
Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.
The [1...-1] is to remove quotes inserted by #to_json
I wanted to explain the code because question is not ruby specific and reader may use another language.
I guess somebody can do it with a sufficiently ugly sed command..
Just adding the general rule how to get from something like '\u00c5\u0098' to 'Ř'. Putting together the last two letters from the \u parts gets you c5 and 98 which are the two bytes of the utf-8 representation. UTF-8 encodes the code point in two bytes like this: 110xxxxx 10xxxxxx, where x are the actual bits of the character code. You can take the two bytes, use & to get the x parts, put them one after the next and read that as a number and you get the 0x158, which is the code for 'Ř'.
My javascript implementation:
function fixEncoding(s) {
var reg = /\\u00([a-f0-9]{2})\\u00([a-f0-9]{2})/gi;
return s.replace(reg, function(a, m1, m2){
b1 = parseInt(m1,16);
b2 = parseInt(m2,16);
var maskedb1 = b1 & 0x1F;
var maskedb2 = b2 & 0x3F;
var result = (maskedb1 << 6) | maskedb2;
return String.fromCharCode(result);
})
}
I have an html file which is read as a string.. i want to parse that and get values using <TD colSpan=2>Value :
So there are around 10 values i should get from the html file.. how can i do that.. i am trying to use something like
startindex endindex getsubstring
sMainBeginKeyword = "<td>Value : ";
sBeginKeyword = "<td>Value : ";
sEndKeyword = "</td>";
main_begin_index = result.indexOf(sMainBeginKeyword);
while (main_begin_index != -1) {
begin_index = main_begin_index;
end_index = result.indexOf(sEndKeyword, begin_index);
String deloc= result.substring(begin_index + sBeginKeyword.length(), end_index);
But this looks complicated and i can not retrieve more values .. As i have a lot of values with different keywords..
This sort of thing really does need to be done using an XML or DOM parser: Trying to do it with string searches is setting yourself up for failure.
If you loaded the HTML into an XML or DOM parser, the task you're trying to do would be trivial to achieve using XPath notation to find the relevant elements.
You haven't specified which language or platform you're working on (and the code sample you've given is insufficient to be sure either), so it's hard to be any more specific.
Hope that helps.
In my application iOS I need to export some data into CSV or HTML format. How can I do this?
RegexKitLite comes with an example of how to read a csv file into an NSArray of NSArrays, and to go in the reverse direction is pretty trivial.
It'd be something like this (warning: code typed in browser):
NSArray * data = ...; //An NSArray of NSArrays of NSStrings
NSMutableString * csv = [NSMutableString string];
for (NSArray * line in data) {
NSMutableArray * formattedLine = [NSMutableArray array];
for (NSString * field in line) {
BOOL shouldQuote = NO;
NSRange r = [field rangeOfString:#","];
//fields that contain a , must be quoted
if (r.location != NSNotFound) {
shouldQuote = YES;
}
r = [field rangeOfString:#"\""];
//fields that contain a " must have them escaped to "" and be quoted
if (r.location != NSNotFound) {
field = [field stringByReplacingOccurrencesOfString:#"\"" withString:#"\"\""];
shouldQuote = YES;
}
if (shouldQuote == YES) {
[formattedLine addObject:[NSString stringWithFormat:#"\"%#\"", field]];
} else {
[formattedLine addObject:field];
}
}
NSString * combinedLine = [formattedLine componentsJoinedByString:#","];
[csv appendFormat:#"%#\n", combinedLine];
}
[csv writeToFile:#"/path/to/file.csv" atomically:NO];
The general solution is to use stringWithFormat: to format each row. Presumably, you're writing this to a file or socket, in which case you would write a data representation of each string (see dataUsingEncoding:) to the file handle as you create it.
If you're formatting a lot of rows, you may want to use initWithFormat: and explicit release messages, in order to avoid running out of memory by piling up too many string objects in the autorelease pool.
And always, always, always remember to escape the values correctly before passing them to the formatting method.
Escaping (along with unescaping) is a really good thing to write unit tests for. Write a function to CSV-format a single row, and have test cases that compare its result to correct output. If you have a CSV parser on hand, or you're going to need one, or you just want to be really sure your escaping is correct, write unit tests for the parsing and unescaping as well as the escaping and formatting.
If you can start with a single record containing any combination of CSV-special and/or SQL-special characters, format it, parse the formatted string, and end up with a record equal to the one you started with, you know your code is good.
(All of the above applies equally to CSV and to HTML. If possible, you might consider using XHTML, so that you can use XML validation tools and parsers, including NSXMLParser.)
CSV - comma separated values.
I usually just iterate over the data structures in my application and output one set of values per line, values within set separated with comma.
struct person
{
string first_name;
string second_name;
};
person tony = {"tony", "momo"};
person john = {"john", "smith"};
would look like
tony, momo
john, smith