TCL command with spaces parsing as  - tcl

I'm trying to send command which includes spaces,
I got the following error -
"system1 -loadChange "+10" -portAddress "10.10.X.X/1/15â€"
ecah space replaced by Â.
The full command is:
av::perform ChangeLoadForPort system1 -loadChange "+10" -portAddress "10.10.X.X/1/15”
X.X is the IP.

The sequence   (a  followed by a space) is the ISO 8859-1 interpretation of the sequence of bytes that make a non-breaking space in UTF-8. (It's also the same with ISO 8859-15 and a few other encodings.) †is another tell-tale. That you're seeing this points to two problems:
Various bits and pieces of the systems you're dealing with are disagreeing on what encodings are in use. That's Bad News, but would be just theoretically a problem if your code stuck to ASCII. (Almost all encodings have the majority of ASCII as a subset.)
Your code has unexpected non-ASCII characters in it, and that's just not going to work. You're probably being sabotaged by whatever program you're using to edit text; a programmer's editor is strongly recommended, and not a word processor.

Related

What's the difference between typing the Encoding of a Unicode character or just copying the character?

For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).

OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor)

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks
I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

Is there a way to encode spaces inside a Tcl string?

There is a third party software that linked with Tcl library and is parsing given tcl script.
One if it's commands takes a file name or list of file names. When there are spaces in file name/path it chokes... considers it a list of names, even if I enclose it in curly braces or double quotes.
Can a string/path (spaces within) be encode in a way that other Tcl commands will still interpret it correctly?
The following are the standard escape sequences for a space in Tcl:
“\ ” — a backslash followed by a space.
“\040” — a backslash followed by a 0 and the octal for 32.
“\x20” — a backslash followed by a x and the two-digit hexadecimal for 32. (Only really suitable if the character following is not a hex digit because of a bug in many versions of Tcl.)
“\u0020” — a backslash followed by a u and the four-digit hexadecimal for 32.
“\U000020” — a backslash followed by a U and the six-digit hexadecimal for 32. (Introduced in Tcl 8.6 as part of migration path to supporting newer Unicode characters.)
One of those might work in your situation. Or might if you quote the backslash with another backslash enough times; you're effectively playing “guess how badly wrong someone's software is” at this point. (It sounds like stuff is going through an ill-advised eval somewhere. Maybe several times. Maybe different amounts on different code paths. That's awful if it is true…)

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.

What is the difference between plaintext and binary data?

Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.