TCL: what is the difference between format , binary format , scan and binary scan commands? - tcl

Can anyone explain me the differences between scan and binary scan .
format and binary format .
I am getting confusion with the binary commands .

To understand the difference between command sets manipulating binary and string data you have to understand the distinction between these two kinds of data.
In Tcl, as in many (most?) high-level languages, strings are rather abstract — that is, they are described in pretty high-level terms. Particularly in Tcl, strings are defined to have the following properties:
They contain characters from the Unicode repertoire.
The Tcl runtime provides the set of standard commands to operate on strings — such as indexing, searching, appending to, extracting a substring etc.
Note that many things are left out from this definition:
The encoding in which these Unicode characters are stored.
How exactly they are stored (NUL-terminated arrays? linked lists of unsigned longs? something else?).
(To put it into a more interesting perspective, Tcl is able to transparently change the underlying representations of strings it manages — between UTF-8 and UTF-16 encoded sequences. But here we're talking about the reference Tcl implementation, and other implementations (such as Jacl for instance) are free to do something else completely.)
The same approach is used to manipulate all the other kinds of data in the Tcl interpreter. Say, integer numbers are stored using native platform "integers" (roughly "as in C") but they are transparently upgraded into arbitrary sized integers if an arithmetic operation is about to overflow the platform-sized result.
So long as you don't leave the comfortable world of the Tcl interpreter, this is all you should know about the data types it manages. But now there's the outside world. In it, abstract concepts which are Tcl strings do not exist. Say, if you need to communicate to some other program over a network socket or by means of using a file or whatever other kind of media, you have to get down to the level of exact layouts of raw bytes which are described by "wire protocols" and file formats or whatever applies to your case. This is where "binaries" come into play: they allow you to precisely specify how the data is laid out so that it's ready to be transferred to the outside world or be consumed from it — binary format makes these "binaries" and binary scan reads them.
Note that certain Tcl commands for working with the outside world are "smart by default" — for instance, the open command which opens files by default assumes they are textual and are encoded in the default system encoding (which is deduced, broadly speaking, from the environment). You can then use the chan configure (of fconfigure — in older versions of Tcl) command to either change this encoding or completely inhibit conversions by specifying the channel is in "binary mode". The same applies to EOL conversions.
Note also that there are specialized packages for Tcl that effectively hide the complexities of working with a particular wire/file format. To present one example, the tdom package works with XML; when you manipulate XML using this package, you're not concerned with how exactly XML must be represented when, say, saved to a file — you just work with tdom's objects, native Tcl strings etc.

The docs are pretty good and contain examples:
scan: http://www.tcl.tk/man/tcl8.6/TclCmd/scan.htm
format: http://www.tcl.tk/man/tcl8.6/TclCmd/format.htm
binary scan: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M42
binary format: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M16
Maybe you could ask a more specific question?

The format command assembles strings of characters, the binary format command assembles strings of bytes. The scan and binary scan commands do the reverse, extracting formation from character strings and byte strings respectively.
Note that Tcl happens to map byte strings neatly onto character strings where the characters are in the range \u0000–\u00FF, and there are other operations for getting information into and out of binary strings that are sometimes relevant. Most notably, encoding convertto and encoding convertfrom: encoding convertto formats a string as a sequence of bytes that represent that string in a given encoding (an operation which can lose information) and encoding converfrom goes in the opposite direction.
So what encoding are Tcl's strings really in? Well, none really. Or many. The logical level works with character sequences exclusively, and the implementation will actually move things back and forth (mostly between a variant of UTF-8 and UCS-2, though with optimisations for handling byte strings via arrays of unsigned char) as necessary. While this is not always perfectly efficient, most code never notices what's going on due to the type-caching used.
If you have Tcl 8.6, you can peek behind the covers to observe the types with an unsupported command:
# Output is human-readable; experiment to see what it says for you
puts [tcl::unsupported::representation $MyString]
Don't use this to base functional decisions on; Tcl is very happy to mutate types out from under your feet. But it can help when finding out why your code is unexpectedly slow. (Note also that types attach to values, and not to variables.)

Related

Matching MS Access accented characters - collation in MS Access

My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?
You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.

Convert comma to dot in Python or MySQL

I have a Python script which collects data and sends it to my MySQL table.
I noticed that the "Cost" sometimes is 0,95 which results in 0 in my table since my table use "0.95" instead of "0,95".
I assume the best solution is to convert the , to . in my Python script by using:
variable.replace(",", ".")
However, couldn't one solution be to change format in my MySQL table? So that I store numbers in this format:
1100
0,95
0,1
150000
My Django Model
cost = models.DecimalField(max_digits=10, decimal_places=4, default=None)
Any feedback on how to best solve this issue?
Thanks
Your first instinct is correct: convert the "unusual" (comma-decimal) input into the standard format that MySQL used by default (dot-decimal) at the first point where you receive it.
there's lots of ways to write numbers
Be careful, though that you don't get stung by people using commas as thousands separators like "3,203,907.23", or the European form "3.203.907,23", the Swiss "3'203'907,23' or even this form, which is widely used in India: "32,03,907.71" (yes, I did mean to type only two digits there!)
To make your life easier, the rule for currencies is relatively simple:
where a dot or comma is followed by only two digits at the end of the string, that character is acting as the decimal separator.
Once you know which is the decimal separator, you can safely remove all other non-digits from the string, change the decimal separator you found to . then use any standard library string-to-number conversion.
Storage format isn't presentation format
Yes, you can tell MySQL to use comma as its decimal separator, but doing that will break so much of your code - including the parts of the framework that read from the database and expect dot-decimal numbers - that you'll regret doing it that way very quickly...
There's a general principle at work here: you should do your data storage and processing using a format that is easy to process, interchangeable with other systems, and understood by other software developers.
Consider what happens if you need to allow a different framework to access your MySQL database to generate reports... whoever develops that software (and it may be you) will be glad that the numbers are all stored the way numbers are "always" stored in databases.
Convert on the way in, re-convert on the way out
Where you need to accept input in a different format, convert that input into your standardised format as early as possible.
When you need to use an output format, do the conversion to that format as late as possible.
The idea is to keep as much of your system "unexceptional" as possible. A programmer who has to remember what numeric format will in force at the time when a given method is called is not a happy programmer.
P.S.
The option you're talking about in MySQL is an example of this pattern: it doesn't change how numeric data is stored. All that changes is how you pass numbers to MySQL and how it presents them back to you.

Viewstate: 2 different formats?

Trying to scrape a webpage, I hit the necessity to work with ASP.NET's __VIEWSTATE variables. So, ever the optimist, I decided to read up on those variables, and their formats. Even though classified as Open Source by Microsoft, I couldn't find any formal definition:
Everybody agrees the first step to do is decode the string, using a Base64 decoder. Great - that works...
Next - and this is where the confusion sets in:
Roughly 3/4 of the decoders seem to use binary values (characters whose values indicate the the type of field which is follow). Here's an example of such a specification. This format also seems to expect a 'signature' of 0xFF 0x01 as first two bytes.
The rest of the articles (such as this one) describe a format where the fields in the format are separated (or marked) by t< ... >, p< ... >, etc. (this seems to be the case of the page I'm interested in).
Even after looking at over a hundred pages, I didn't find any mention about the existence of two formats.
My questions are: Are there two different formats of __VIEWSTATE variables in use, or am I missing something basic? Is there any formal description of the __VIEWSTATE contents somewhere?
The view state is serialized and deserialized by the
System.Web.UI.LosFormatter class—the LOS stands for limited object
serialization—and is designed to efficiently serialize certain types
of objects into a base-64 encoded string. The LosFormatter can
serialize any type of object that can be serialized by the
BinaryFormatter class, but is built to efficiently serialize objects
of the following types:
Strings
Integers
Booleans
Arrays
ArrayLists
Hashtables
Pairs
Triplets
Everything you need to know about ViewState: Understanding View State

MySql collecting database only English but not other

I have a comment section and submission form that any of my member can submit.
If my member post in English I will receive an email update and the comment will be post no problem in English. But if they use other than English such an example of Thai language. Then what happen all the words let say for example สวัสดี it will appear as ??????
I don't know why, but I went to check on my php.ini file and the unicode/encoded setted to UTF8 and also on the MySQL collation setted to UTF8 as well. I make sure the meta setted to UTF8 as well on the .html/.php files, but still causing the same problem.
Any suggestion what else I missed to configure?
Make sure you are using multibyte safe string functions or you might be losing your UTF-8 encoding.
From the PHP mbstring manual:
While there are many languages in
which every necessary character can be
represented by a one-to-one mapping to
an 8-bit value, there are also several
languages which require so many
characters for written communication
that they cannot be contained within
the range a mere byte can code (A byte
is made up of eight bits. Each bit can
contain only two distinct values, one
or zero. Because of this, a byte can
only represent 256 unique values (two
to the power of eight)). Multibyte
character encoding schemes were
developed to express more than 256
characters in the regular bytewise
coding system.
When you manipulate (trim, split,
splice, etc.) strings encoded in a
multibyte encoding, you need to use
special functions since two or more
consecutive bytes may represent a
single character in such encoding
schemes. Otherwise, if you apply a
non-multibyte-aware string function to
the string, it probably fails to
detect the beginning or ending of the
multibyte character and ends up with a
corrupted garbage string that most
likely loses its original meaning.
mbstring provides multibyte specific
string functions that help you deal
with multibyte encodings in PHP. In
addition to that, mbstring handles
character encoding conversion between
the possible encoding pairs. mbstring
is designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and
many single-byte encodings for
convenience
I just found out that what is causing the problem
in php.ini
line mbstring.internal_encodingit was setted to something else so I setted it to UTF-8 then magical! now everything worked!

compact binary representation of json [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 months ago.
Improve this question
Are there any compact binary representations of JSON out there? I know there is BSON, but even that webpage says "in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON".
I'm looking for a format that's as compact as possible, preferably some kind of open standard?
You could take a look at the Universal Binary JSON specification. It won't be as compact as Smile because it doesn't do name references, but it is 100% compatible with JSON (where as BSON and BJSON define data structures that don't exist in JSON so there is no standard conversion to/from).
It is also (intentionally) criminally simple to read and write with a standard format of:
[type, 1-byte char]([length, 4-byte int32])([data])
So simple data types begin with an ASCII marker code like 'I' for a 32-bit int, 'T' for true, 'Z' for null, 'S' for string and so on.
The format is by design engineered to be fast-to-read as all data structures are prefixed with their size so there is no scanning for null-terminated sequences.
For example, reading a string that might be demarcated like this (the []-chars are just for illustration purposes, they are not written in the format)
[S][512][this is a really long 512-byte UTF-8 string....]
You would see the 'S', switch on it to processing a string, see the 4-byte integer that follows it of "512" and know that you can just grab in one chunk the next 512 bytes and decode them back to a string.
Similarly numeric values are written out without a length value to be more compact because their type (byte, int32, int64, double) all define their length of bytes (1, 4, 8 and 8 respectively. There is also support for arbitrarily long numbers that is extremely portable, even on platforms that don't support them).
On average you should see a size reduction of roughly 30% with a well balanced JSON object (lots of mixed types). If you want to know exactly how certain structures compress or don't compress you can check the Size Requirements section to get an idea.
On the bright side, regardless of compression, the data will be written in a more optimized format and be faster to work with.
I checked the core Input/OutputStream implementations for reading/writing the format into GitHub today. I'll check in general reflection-based object mapping later this week.
You can just look at those two classes to see how to read and write the format, I think the core logic is something like 20 lines of code. The classes are longer because of abstractions to the methods and some structuring around checking the marker bytes to make sure the data file is a valid format; things like that.
If you have really specific questions like the endianness (Big) of the spec or numeric format for doubles (IEEE 754) all of that is covered in the spec doc or just ask me.
Hope that helps!
Yes: Smile data format (see Wikipedia entry. It has public Java implementation, C version in the works at github (libsmile). It has benefit of being more compact than JSON (reliably), but being 100% compatible logical data model, so it is easy and possible to convert back and forth with textual JSON.
For performance, you can see jvm-serializers benchmark, where smile competes well with other binary formats (thrift, avro, protobuf); sizewise it is not the most compact (since it does retain field names), but does much better with data streams where names are repeated.
It is being used by projects like Elastic Search and Solr (optionally), Protostuff-rpc supports it, although it is not as widely as say Thrift or protobuf.
EDIT (Dec 2011) -- there are now also libsmile bindings for PHP, Ruby and Python, so language support is improving. In addition there are measurements on data size; and although for single-record data alternatives (Avro, protobuf) are more compact, for data streams Smile is often more compact due to key and String value back reference option.
gzipping JSON data is going to get you good compression ratios with very little effort because of its universal support. Also, if you're in a browser environment, you may end up paying a greater byte cost in the size of the dependency from a new library than you would in actual payload savings.
If your data has additional constraints (such as lots of redundant field values), you may be able to optimize by looking at a different serialization protocol rather than sticking to JSON. Example: a column-based serialization such as Avro's upcoming columnar store may get you better ratios (for on-disk storage). If your payloads contain lots of constant values (such as columns that represent enums), a dictionary compression approach may be useful too.
Another alternative that should be considered these days is CBOR (RFC 7049), which has an explicitly JSON-compatible model with a lot of flexibility. It is both stable and meets your open-standard qualification, and has obviously had a lot of thought put into it.
Have you tried BJSON ?
Try to use the js-inflate to make and unmake blobs.
https://github.com/augustl/js-inflate
This is perfect and I use a lot.
You might also want to take a look at a library I wrote. It's called minijson, and it was designed for this very purpose.
It's Python:
https://github.com/Dronehub/minijson
Have you tried AVRO? Apache Avro
https://avro.apache.org/