Convert UTF-8 to ANSI in tcl - tcl

proc pub:write { nick host handle channel arg } {
set fid [open /var/www/test.txt w]
puts $fid "█████████████████████████████████████████████████████████████████"
puts $fid "██"
close $fid
}
when i open i Webbrowser its Result so :
█████████████████████████████████████████████████████████████████
but it should :
█████████████████████████████████████████████████████████████████

Welcome to the yawning pit of complexity that is string encodings. You've got to get two things right to make what you're trying to do work. READ EVERYTHING BELOW BEFORE MAKING CHANGES as it all interacts horribly.
The character needs to be written to the file using the right encoding. This is done by configuring the encoding on the channel, which defaults to a system-specific value that is usually but not always right.
I'm taking a very wild guess that an encoding like “cp437 DOSLatinUS” is the right one.
fconfigure $fid -encoding cp437
However, Tcl's usually pretty good at picking the right thing to do by default.
Also, there's a huge number of different encodings. Some are very similar to each other and picking which one to use is a bit of a black art. The usual best bet is to stick with utf8 when possible, and otherwise to use the correct encoding (defined by protocol or by the system) and take a vast amount of care. This is really complicated!
You've also got to get the character into Tcl correctly in the first place. This means that the character has to be encoded in the source file, and Tcl has to read that file with the right encoding. Since the file is being written by another program (your editor usually) there's all sorts of potential for trouble. If you can discover what encoding is being used there (usually a matter of complete guesswork) then you can use the -encoding option to tclsh or source to allow Tcl to figure out what is going on.
Alternatively, stick with the ASCII subset in your source as that's pretty reliably handled the same whatever encoding is in use. You do this by converting each █ to the Tcl escape sequence \u2588. At least like that, you can be sure that you're only hunting down problems with the output encoding.
When debugging this thing, only change one thing at a time before retesting as there's a lot of bits that can go wrong and poison what is going on in ways that produce weird results downstream. I advise trying the escape sequence first as that at least means that you know that the input data is correct; once you know that you're not pushing garbage in, you can try hunting down whether you're actually getting problems with getting garbage out and what to do about it.
Finally, be aware that mixing in networking in this makes the problems about ten times harder…

Related

Reading the wrong number of bytes from a binary file

I have the following code:
set myfile "the path to my file"
set fsize [file size $myfile]
set fp [open $myfile r]
fconfigure $fp -translation binary
set data [read $fp $fsize]
close $fp
puts $fsize
puts [string bytelength $data]
And it shows that the bytes read are different from the bytes requested. The bytes requested match what the filesystem shows; the actual bytes read are 22% more (requested 29300, got 35832). I tested this on Windows, with Tcl 8.6.
Use string length. Don't use string bytelength. It gives the “wrong” answers, or rather it answers a question you probably don't want to ask.
More Depth
The string bytelength command returns the length in bytes of the data in Tcl's internal almost-UTF-8 encoding. If you're not working with Tcl's C API directly, you really have no sensible use for that value, and C code is actually pretty able to get the value without that command. For ASCII text, the length and the byte-length are the same, but for binary data or text with NULs or characters greater than U+00007F (the Unicode character that is equivalent to ASCII DEL), the values will differ. By contrast, the string length command knows how to handle binary data correctly, and will report the number of bytes in the byte-string that you read in. We plan to deprecate the string bytelength command, as it turns out to be a bug in someone's code almost every time they use it.
(I'm guessing that your input data actually has 6532 bytes outside the range 1–127 in it; the other bytes internally use a two-byte representation in almost-UTF-8. Fortunately, Tcl doesn't actually convert into that format until it needs to, and instead uses a compact array of bytes in this case; you're forcing it by asking for the string bytelength.)
Background Information
The question of “how much memory is actually being used by Tcl to read this data” is quite hard to answer, because Tcl will internally mutate data to hold it in the form that is most efficient for the operations you are applying to it. Because Tcl's internal types are all precisely transparent (i.e., conversions to and from them don't lose information) we deliberately don't talk about them much except from an optimisation perspective; as a programmer, you're supposed to pretend that Tcl has no types other than string of unicode characters.
You can peel the veil back a bit with the tcl::unsupported::representation command (introduced in 8.6). Don't use the types for decisions on what to do in your code, as that is really not something guaranteed by the language, but it does let you see a lot more about what is really going on under the covers. Just remember, the values that you see are not the same as the values that Tcl's implementation thinks about. Thinking about the values that you see (without that magic command) will keep you thinking about things that it is correct to write.

In relative terms, how fast should TCL on Windows 10 be?

I have the latest TCL build from Active State installed on a desktop and laptop both running Windows 10. I'm new to TCL and a novice developer and my reason for learning TCL is to enhance my value on the F5 platform. I figured a good first step would be to stop the occasional work I do in VBScript and port that to TCL. Learning the language itself is coming along alright, but I'm worried my project isn't viable due to performance. My VBScripts absolutely destroy my TCL scripts in performance. I didn't expect that outcome as my understanding was TCL was so "fast" and that's why it was chosen by F5 for iRules etc.
So the question is, am I doing something wrong? Is the port for Windows just not quite there? Perhaps I misunderstood the way in which TCL is fast and it's not fast for file parsing applications?
My test application is a firewall log parser. Take a log with 6 million hits and find the unique src/dst/port/policy entries and count them; split up into accept and deny. Opening the file and reading the lines is fine, TCL processes 18k lines/second while VBScript does 11k. As soon as I do anything with the data, the tide turns. I need to break the four pieces of data noted above from the line read and put in array. I've "split" the line, done a for-next to read and match each part of the line, that's the slowest. I've done a regexp with subvariables that extracts all four elements in a single line, and that's much faster, but it's twice as slow as doing four regexps with a single variable and then cleaning the excess data from the match away with trims. But even this method is four times slower than VBScript with ad-hoc splits/for-next matching and trims. On my desktop, i get 7k lines/second with TCL and 25k with VBscript.
Then there's the array, I assume because my 3-dimensional array isn't a real array that searching through 3x as many lines is slowing it down. I may try to break up the array so it's looking through a third of the data currently. But the truth is, by the time the script gets to the point where there's a couple hundred entries in the array, it's dropped from processing 7k lines/second to less than 2k. My VBscript drops from about 25k lines to 22k lines. And so I don't see much hope.
I guess what I'm looking for in an answer, for those with TCL experience and general programming experience, is TCL natively slower than VB and other scripts for what I'm doing? Is it the port for Windows that's slowing it down? What kind of applications is TCL "fast" at or good at? If I need to try a different kind of project than reading and manipulating data from files I'm open to that.
edited to add code examples as requested:
while { [gets $infile line] >= 0 } {
some other commands I'm cutting out for the sake of space, they don't contribute to slowness
regexp {srcip=(.*)srcport.*dstip=(.*)dstport=(.*)dstint.*policyid=(.*)dstcount} $line -> srcip dstip dstport policyid
the above was unexpectedly slow. the fasted way to extract data I've found so far
regexp {srcip=(.*)srcport} $line srcip
set srcip [string trim $srcip "cdiloprsty="]
regexp {dstip=(.*)dstport} $line dstip
set dstip [string trim $dstip "cdiloprsty="]
regexp {dstport=(.*)dstint} $line dstport
set dstport [string trim $dstport "cdiloprsty="]
regexp {policyid=(.*)dstcount} $line a policyid
set policyid [string trim $policyid "cdiloprsty="]
Here is the array search that really bogs down after a while:
set start [array startsearch uList]
while {[array anymore uList $start]} {
incr f
#"key" returns the NAME of the association and uList(key) the VALUE associated with name
set key [array nextelement uList $start]
if {$uCheck == $uList($key)} {
##puts "$key CONDITOIN MET"
set flag true
adduList $uCheck $key $flag2
set flag2 false
break
}
}
Your question is still a bit broad in scope.
F5 has published some comment why they choose Tcl and how it is fast for their specific usecases. This is actually a bit different to a log parsing usecase, as they do all the heavy lifting in C-code (via custom commands) and use Tcl mostly as a fast dispatcher and for a bit of flow control. And Tcl is really good at that compared to various other languages.
For things like log parsing, Tcl is often beaten in performance by languages like Python and Perl in simple benchmarks. There are a variety of reasons for that, here are some of them:
Tcl uses a different regexp style (DFA), which are more robust for nasty patterns, but slower for simple patterns.
Tcl has a more abstract I/O layer than for example Python, and usually converts the input to unicode, which has some overhead if you do not disable it (via fconfigure)
Tcl has proper multithreading, instead of a global lock which costs around 10-20% performance for single threaded usecases.
So how to get your code fast(er)?
Try a more specific regular expression, those greedy .* patterns are bad for performance.
Try to use string commands instead of regexp, some string first commands followed by string range could be faster than a regexp for these simple patterns.
Use a different structure for that array, you probably want either a dict or some form of nested list.
Put your code inside a proc, do not put it all in a toplevel script and use local variables instead of globals to make the bytecode faster.
If you want, use one thread for reading lines from file and multiple threads for extracting data, like a typical producer-consumer pattern.

Difference tcl script tkconsole to load gro file in VMD

My problem is simple: I'm trying to write a tcl script to use $grofile instead writing every time I need this file name.
So, what I did in TkConsole was:
% set grofile "file.gro"
% mol load gro ${grofile}
and, indeed, I succeeded uploading the file.
In the script I have the same lines, but still have this error:
wrong # args: should be "set varName ?newValue?"
can't read "grofile": no such variable
I tried to solve my problem with
% set grofile [./file.gro]
and I have this error,
invalid command name "./file.gro"
can't read "grofile": no such variable
I tried also with
% set grofile [file ./file.gro r]
and I got the first error, again.
I haven't found any simple way to avoid using the explicit name of the file I want to upload. It seems like you only can use the most trivial, but tedious way:
mol load file.gro
mol addfile file.xtc
and so on and so on...
Can you help me with a brief explanation about why in the TkConsole I can upload the file and use it as a variable while I can not in the tcl script?
Also, if you have where is my mistake, I will appreciate it.
I apologize if it is basic, but I could not find any answer. Thanks.
I add the head of my script:
set grofile "sim.part0001_protein_lipid.gro"
set xtcfile "protein_lipid.xtc"
set intime "0-5ms"
set system "lower"
source view_change_render.tcl
source cg_bonds.tcl
mol load gro $grofile xtc ${system}_${intime}_${xtcfile}
It was solved, thanks for your help.
You may think you've typed the same thing, but you haven't. I'm guessing that your real filename has spaces in it, and that you've not put double-quotes around it. That will confuse set as Tcl's general parser will end up giving set more arguments than it expects. (Tcl's general parser does not know that set only takes one or two arguments, by very long standing policy of the language.)
So you should really do:
set grofile "file.gro"
Don't leave the double quotes out if you have a complicated name.
Also, this won't work:
set grofile [./file.gro]
because […] is used to indicate running something as a command and using the result of that. While ./file.gro is actually a legal command name in Tcl, it's… highly unlikely.
And this won't work:
set grofile [file ./file.gro r]
Because the file command requires a subcommand as a first argument. The word you give is not one of the standard file subcommands, and none of them accept those arguments anyway, which look suitable for open (though that returns a channel handle suitable for use with commands like gets and read).
The TkConsole is actually pretty reasonable as quick-and-dirty terminal emulations go (given that it omits a lot of the complicated cases). The real problem is that you're not being consistently accurate about what you're really typing; that matters hugely in most programming languages, not just Tcl. You need to learn to be really exacting; cut-n-paste when creating a question helps a lot.

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Excel does not display currency symbol(for example ¥) generated in my tcl code

I actually am generating an MS Excel file with the currencies and if you see the file I generated (tinyurl.com/currencytestxls), opening it in the text editor shows the correct symbol but somehow, MS Excel does not display the symbol. I am guessing there is some issue with the encoding. Any thoughts?
Here is my tcl code to generate the symbol:
set yen_val [format %c 165]
Firstly, this does produce a Yen symbol (I put format string in double quotes here just for clarity with the formatting):
format "%c" 165
You can then pass it around just fine. The problem is likely to come when you try to output it; when Tcl writes a string to the outside world (with the possible exception of the terminal on Windows, as that's tricky) it encodes that string into a definite byte sequence. The default encoding is the one reported by:
encoding system
But you can see what it is and change it for any channel (if you pass in the new name):
fconfigure $theChannel -encoding $theEncoding
For example, on my system (which uses UTF-8, which can handle any character):
% fconfigure stdout -encoding
utf-8
% puts [format %c 165]
¥
If you use an encoding that cannot represent a particular character, the replacement character for that encoding is used instead. For many encodings, that's a “?”. When you are sending data to another program (including to a web server or to a browser over the internet) it is vital that both sides agree on what the encoding of the data is. Sometimes this agreement is by convention (e.g., the system encoding), sometimes it is defined by the protocol (HTTP headers have this clearly defined), and sometimes this is done by explicitly transferred metadata (HTTP content).
If you're writing a CSV file to be ingested by Excel, use either the “unicode” or the “utf-8” encoding and make sure you put the byte-order mark in correctly. Tcl doesn't write BOMs automatically (because it's the wrong thing to do in some cases). To write a BOM, do this as the first thing when you start writing the file:
puts -nonewline $channel "\ufeff"