How do computers differentiate between letters and numbers in binary? - binary

I was just curious because 65 is the same as the letter A
If this is the wrong stack sorry.

"65 is the same as the letter A": It is true if you say it is. But not saying more than that isn't very useful.
There is no text but encoded text. There are no numbers but encoded numbers. To the CPU, some number encodings are native, everything else is just undifferentiated data.
(Some data is just data for programs, other data is the CPU instructions of programs. It's a security problem if a CPU executes data as instructions inappropriately. Some architectures keep program data and instructions separate.)
Common native number encodings are signed and unsigned integers of 1, 2, 4, and 8 bytes and IEEE-754 single and double precision floating point numbers. Signed integers are usually two's-complement. Multi-byte integers have a byte ordering (or endianness) because on typical machines each byte is individually addressable. If a number encoding is not native, a program library is needed to process such data.
Text is a sequence of encoded characters from a character set. There are hundreds of character sets. A character set is an assignment of a conceptual character to a number called a codepoint. Sometimes the conceptual characters are categorized as lowercase letter, digit, symbol, etc. A codepoint value is mapped to bytes using a character encoding. Most character sets have one encoding, but Unicode has several. Some character sets are subsets of other character sets—such relationships are not generally useful because exactly one character set is used in any one context.
A program is a set of instructions that operate on data. It must apply the correct operations to the right data. So, it is the program that differentiates between text and number, usually by its location or flow path.
Stored data must be in a known layout of encoded text and numbers. Sometimes the layout is stored also. The layout is called metadata. Without the metadata accompanying the data, or being agreed upon, the data cannot be used.
It's all quite simple with appropriate bookkeeping. But there are several methods of bookkeeping so there is no general solution to how to handle data without metadata. Methods include: Well-known and/or registered file extensions, HTTP headers, MIME types, HTML meta charset tag, XML encoding declaration. Some methods only work in a certain context, such as audio/video codecs having a four-character code (FourCC), and unix shell scripts with a shebang. Some methods only help narrow guessing, such as file signatures. Needless to say, guessing should be avoided; it leads to security issues and data loss.
Unfortunately, text files are often without metadata. It is particularly important to agree upon or separately communicate the metadata.
Data without metadata is "binary". So the writer of text must agree with the reader on which character encoding is to be used. Similarly, for all types of data. Here reader and writer are both humans and programs.

Short answer. They don't. Longer answer, every binary combination between 00000000 and 11111111 has a character representation in the ASCII character set. 01000001 just happens to be the first capital letter in the Latin alphabet that was designated over 30 years ago. There are other character sets, and code pages that represent different letter, numbers, non-printable and accented letters. It's entirely possible that the binary 01000001 could be a lower case z with a tilde over the top in a different character set. 'computers' don't know (or care) what a particular binary representation means to humans.

Related

File encoded in Latin-1 but read in UTF-8 could deal in any problem? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What is the difference between UTF-8 and ISO-8859-1?
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding.
Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.
UTF
UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.
Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.
ISO-8859
ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.
The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.
Windows Code Pages
It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.
For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol € and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.
Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.
ASCII: 7 bits. 128 code points.
ISO-8859-1: 8 bits. 256 code points.
UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.
Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:
#!/usr/bin/env python3
c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))
Output:
©
b'\xc2\xa9'
b'\xa9'
ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).
One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead.
For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).
The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way: the HTML spec says that all encodings in the Encoding spec must be supported, and no more.
Also of interest, HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points; per https://html.spec.whatwg.org/#numeric-character-reference-end-state, … will produce U+2026 rather than U+0085.
From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.
My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.
Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

What's the exact meaning of the statement "Since ASCII used 7 bits for the character, it could only represent 128 different characters"?

I come across the below statement while studying about HTML Character Sets and Character Encoding :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
When we convert any decimal value from the ASCII character set to its binary equivalent it comes down to a 7-bits long binary number.
E.g. For Capital English Letter 'E' the decimal value of 69 exists in ASCII table. If we convert '69' to it's binary equivalent it comes down to the 7-bits long binary number 1000101
Then, why in the ASCII Table it's been mentioned as a 8-bits long binary number 01000101 instead of a 7-bits long binary number 1000101 ?
This is contradictory to the statement
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
The above statement is saying that ASCII used 7 bits for the character.
Please clear my confusion about considering the binary equivalent of a decimal value. Whether should I consider a 7-bits long binary equivalent or a 8-bits long binary equivalent of any decimal value from the ASCII Table? Please explain to me in an easy to understand language.
Again, consider the below statement :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
According to the above statement how does the number of characters(128) that ASCII supports relates to the fact that ASCII uses 7 bits to represent any character?
Please clear the confusion.
Thank You.
In most processors, memory is byte-addressable and not bit-addressable. That is, a memory address gives the location of an 8-bit value. So, almost all data is manipulated in multiples of 8 bits at a time.
If we were to store a value that has by its nature only 7 bits, we would very often use one byte per value. If the data is a sequence of such values, as text might be, we would still use one byte per value to make counting, sizing, indexing and iterating easier.
When we describe the value of a byte, we often show all of its bits, either in binary or hexadecimal. If a value is some sort of integer (say of 1, 2, 4, or 8 bytes) and its decimal representation would be more understandable, we would write the decimal digits for the whole integer. But in those cases, we might lose the concept of how many bytes it is.
BTW—HTML doesn't have anything to do with ASCII. And, Extended ASCII isn't one encoding. The fundamental rule of character encodings is to read (decode) with the encoding the text was written (encoded) with. So, a communication consists of the transferring of bytes and a shared understanding of the character encoding. (That makes saying "Extended ASCII" so inadequate as to be nearly useless.)
An HTML document represents a sequence of Unicode characters. So, one of the Unicode character encodings (UTF-8) is the most common encoding for an HTML document. Regardless, after it is read, the result is Unicode. An HTML document could be encoded in ASCII but, why do that? If you did know it was ASCII, you could just as easily know that it's UTF-8.
Outside of HTML, ASCII is used billions—if not trillions—of times per second. But, unless you know exactly how it pertains to your work, forget about it, you probably aren't using ASCII.

Representation of numbers in the computer

In the representation of inputs in the computer, the numbers are taken as characters and encoded with Ascii code or are they converted directly to binary? in other way: When my input is considered as integer and not a character?
Both are possible, and it depends on the application. In other words the software programmer decides. In general, binary representation is more efficient in terms of storage requirements and processing speed. Therefore binary representation is more usual, but there are good examples when it is better to keep numbers as strings:
to avoid problems with conversions
phone numbers
when no adequate binary representation is available (e.g. 100 digits of pi)
numbers where no processing takes places
to be continued ...
The most basic building block of electronic data is a bit. It can have only 2 values, 0 and 1. Other data structures are built from collection of bits, such as an 8-bit byte, or a 32-bit float.
When a collection of bits needs to be used to represent a character, a certain encoding is used to give lexical meaning to these bits, such as ASCII, UTF8 and others.
When you want to display character information to the screen, you use a graphical layer to draw pixels representing the "character" (collection of bits with matching encoding) to the screen.

Why is it useful to know how to convert between numeric bases?

We are learning about converting Binary to Decimal (and vice-versa) as well as other base-conversion methods, but I don't understand the necessity of this knowledge.
Are there any real-world uses for converting numbers between different bases?
When dealing with Unicode escape codes— '\u2014' in Javascript is — in HTML
When debugging— many debuggers show all numbers in hex
When writing bitmasks— it's more convenient to specify powers of two in hex (or by writing 1 << 4)
In this article I describe a concrete use case. In short, suppose you have a series of bytes you want to transfer using some transport mechanism, but you cannot simply pass the payload as bytes, because you are not able to send binary content. Let's say you can only use 64 characters for encoding the payload. A solution to this problem is to convert the bytes (8-bit characters) into 6-bit characters. Here the number conversion comes into play. Consider the series of bytes as a big number whose base is 256. Then convert it into a number with base 64 and you are done. Each digit of the new base 64 number now denotes a character of your encoded payload...
If you have a device, such as a hard drive, that can only have a set number of states, you can only count in a number system with that many states.
Because a computer's byte only have on and off, you can only represent 0 and 1. Therefore a base2 system is used.
If you have a device that had 3 states, you could represent 0, 1 and 2, and therefore count in a base 3 system.

How can you reverse engineer a binary thrift file?

I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?
Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.