Marshalling of a 32 bit int to a 16 bit int maching - language-agnostic

I want to implement and understand the concept of marshalling over my own rpc mechanism (toy really). While I get the idea behind endian-ness, i am not sure how to handle 32bit and 16 bit ints. So the problem is that machine A has int represented at 32 bit and it wants to call a function int foo(int x) over an rpc call; however the server where this int is represented is 16 bit. Sending just the lower 16 bits would loose information and is not desirable.
I know IDL's work to solve this problem. But in this case lets say I use an IDL that "defines" int to be 32 bit. While this case works for my scenario, in the case of machine A with 16 bit int, 2 bytes will always be wasted when transmitting over the network.
If we flip the IDL to be 16bit, then the user has to manually split its local int and do something fancy, completely breaking the transparency of an RPC.
So what is the right way used in actual implementations?
thanks.

Usually, IDLs define several platform independent types (UInt8, Int8, UInt16, Int16, UInt32, Int32, UInt64, Int64) and few platform dependant, such as int, uint. The platform dependant types have only limited use, such as a size/index of arrays. It is recommended to use platform independent types for everything else.
If a parameter is declared in IDL as Int32 then on any platform it MUST be Int32. If it's declared as Int, then it depends on platform.
For example, COM VARENUM and VARIANT , as you can see there are platform independent types (such as SHORT (VT_UI2), LONG (VT_UI4), LONGLONG (VT_UI8)) and also machine types (such as INT (VT_INT)).

Related

Is there a way to encode integer values that are larger than 32 bits using bs-json?

I've been using strings to represent decoded JSON integers larger than 32 bits. It seems the string_of_int is capable of dealing with large integer inputs. So a decoder, written (in the Json.Decode namespace):
id: json |> field("id", int) |> string_of_int, /* 'id' is string */
is succefully dealing with integers of at least 37 bits.
Encoding, on the other hand, is proving troublesome for me. The remote server won't accept a string representation, and is expecting an int64. Is it possible to make bs-json support the int64 type? I was hoping something like this could be made to work:
type myData = { id: int64 };
let encodeMyData = (data:myData) => Json.Encode.(object_([("id", int64(myData.id)]))
Having to roll my own encoder is not nearly as formidable as a decoder, but ... I'd rather not.
You don't say exactly what problem you have with encoding. The int encoder does literally nothing except change the type, trusting that the int value is actually valid. So I would assume it's the int_of_string operation that causes problems. But that begs the question, if you can successfully decode it as an int, why are you then converting it to a string?
The underlying problem here is that JavaScript doesn't have 64 bit integers. The max safe integer is 253 - 1. JavaScript doesn't actually have integers at all, only floats, which can represent a certain range of integers, but can't efficiently do integer arithmetic unless they're converted to either 32-bit or 64-bit ints. And so for whatever reason, probably consistent overflow handling, it was decided in the EcmaScript specification that binary bitwise operations should operate on 32-bit integers. And so that opened the possibility for an internal 32-bit representation, a notation for creating 32-bit integers, and the possibility of optimized integer arithmetic on those.
So to your question:
Would it be "safe" to just add external int64 : int64 -> Js.Json.t = "%identity" to the encoder files?
No, because there's no 64-bit integer representation in JavaScript, int64 values are represented as an array of two Numbers I believe, but is also an internal implementation detail that's subject to change. Just casting it to Js.Json.t will not yield the result you expect.
So what can you do then?
I would recommend using float. In most respects this will behave exactly like JavaScript numbers, giving you access to its full range.
Alternatively you can use nativeint, which should behave like floats except for division, where the result is truncated to a 32-bit integer.
Lastly, you could also implement your own int_of_string to create an int that is technically out of range by using a couple of lightweight JavaScript functions directly, though I wouldn't really recommend doing this:
let bad_int_of_string = str =>
str |> Js.Float.fromString |> Js.Math.floor_int;

Why is the int class lowercase but the String class is not?

I've noticed this across a few languages, but I'll make my question specific to AS3. Why is an int class lowercase but a String or Number is not?
var myInt:int = 0;
var myString:String = "";
var myNum:Number = 0;
That's simply primitive values vs. Objects. You can do something like String.substring() because it's an Object but you can't do anything with int, it's just a number.
==== EDIT ====
According to the comment below, the int in AS3 is a class so you can call some methods of it. However, it is still a primitive type. The difference is explained here: http://www.adobe.com/devnet/actionscript/learning/as3-fundamentals/data-types.html
"Primitive values are usually faster than complex values because ActionScript 3 stores primitive values in a special way that makes memory and speed optimizations possible.
Note: For readers interested in the technical details, ActionScript 3 stores primitive values internally as immutable objects. The fact that they are stored as immutable objects means that passing by reference is effectively the same as passing by value. This cuts down on memory usage and increases execution speed, because references are usually significantly smaller than the values themselves."
This is most likely due to the fact that they followed the ECMAScript3 standard where you on page 14 of the PDF find the definition of primitive values of type Number, String, and Boolean.
Page 26 of the PDF lists future reserved keywords, where int appears. Naturally if they want to support unsigned integers, it should be called uint.
Personally I think it would make more sense to name them Int and Uint (or do like Java)

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

Why 'fputc' use an INT as its parameter instead of CHAR?

standard C lib:
int fputc(int c , FILE *stream);
And such behaviors occured many times, e.g:
int putc(int c, FILE *stream);
int putchar(int c);
why not use CHAR as it really is?
If use INT is necessary, when should I use INT instead of CHAR?
Most likely (in my opinion, since much of the rationale behind early C is lost in the depths of time), it it was simply to mirror the types used in the fgetc type functions which must be able to return any real character plus the EOF special character. The fgetc function gets the next character converted to an int, and uses a special marker value EOF to indicate the end of the stream.
To do that, they needed the wider int type since a char isn't quite large enough to hold all possible characters plus one more thing.
And, since the developers of C seemed to prefer a rather minimalist approach to code, it makes sense that they would use the same type, to allow for code such as:
filecopy(ifp, ofp)
FILE *ifp;
FILE *ofp;
{
int c;
while ((c = fgetc (ifp)) != EOF)
fputc (c, ofp);
}
No char parameters in K&R C
One reason is that in early versions1 of C there were no char parameters.
Yes, you could declare a parameter as char or float but it was considered int or double. Therefore, it would have, then, been somewhat misleading to document an interface as taking a char argument.
I believe this is still true today for functions declared without prototypes, in order for it to be possible to interoperate with older code.
1. Early, but still widespread. C was a quick success and became the first (and still, mostly, the only) widely successful systems programming language.

What is boxing and unboxing and what are the trade offs?

I'm looking for a clear, concise and accurate answer.
Ideally as the actual answer, although links to good explanations welcome.
Boxed values are data structures that are minimal wrappers around primitive types*. Boxed values are typically stored as pointers to objects on the heap.
Thus, boxed values use more memory and take at minimum two memory lookups to access: once to get the pointer, and another to follow that pointer to the primitive. Obviously this isn't the kind of thing you want in your inner loops. On the other hand, boxed values typically play better with other types in the system. Since they are first-class data structures in the language, they have the expected metadata and structure that other data structures have.
In Java and Haskell generic collections can't contain unboxed values. Generic collections in .NET can hold unboxed values with no penalties. Where Java's generics are only used for compile-time type checking, .NET will generate specific classes for each generic type instantiated at run time.
Java and Haskell have unboxed arrays, but they're distinctly less convenient than the other collections. However, when peak performance is needed it's worth a little inconvenience to avoid the overhead of boxing and unboxing.
* For this discussion, a primitive value is any that can be stored on the call stack, rather than stored as a pointer to a value on the heap. Frequently that's just the machine types (ints, floats, etc), structs, and sometimes static sized arrays. .NET-land calls them value types (as opposed to reference types). Java folks call them primitive types. Haskellions just call them unboxed.
** I'm also focusing on Java, Haskell, and C# in this answer, because that's what I know. For what it's worth, Python, Ruby, and Javascript all have exclusively boxed values. This is also known as the "Everything is an object" approach***.
*** Caveat: A sufficiently advanced compiler / JIT can in some cases actually detect that a value which is semantically boxed when looking at the source, can safely be an unboxed value at runtime. In essence, thanks to brilliant language implementors your boxes are sometimes free.
from C# 3.0 In a Nutshell:
Boxing is the act of casting a value
type into a reference type:
int x = 9;
object o = x; // boxing the int
unboxing is... the reverse:
// unboxing o
object o = 9;
int x = (int)o;
Boxing & unboxing is the process of converting a primitive value into an object oriented wrapper class (boxing), or converting a value from an object oriented wrapper class back to the primitive value (unboxing).
For example, in java, you may need to convert an int value into an Integer (boxing) if you want to store it in a Collection because primitives can't be stored in a Collection, only objects. But when you want to get it back out of the Collection you may want to get the value as an int and not an Integer so you would unbox it.
Boxing and unboxing is not inherently bad, but it is a tradeoff. Depending on the language implementation, it can be slower and more memory intensive than just using primitives. However, it may also allow you to use higher level data structures and achieve greater flexibility in your code.
These days, it is most commonly discussed in the context of Java's (and other language's) "autoboxing/autounboxing" feature. Here is a java centric explanation of autoboxing.
In .Net:
Often you can't rely on what the type of variable a function will consume, so you need to use an object variable which extends from the lowest common denominator - in .Net this is object.
However object is a class and stores its contents as a reference.
List<int> notBoxed = new List<int> { 1, 2, 3 };
int i = notBoxed[1]; // this is the actual value
List<object> boxed = new List<object> { 1, 2, 3 };
int j = (int) boxed[1]; // this is an object that can be 'unboxed' to an int
While both these hold the same information the second list is larger and slower. Each value in the second list is actually a reference to an object that holds the int.
This is called boxed because the int is wrapped by the object. When its cast back the int is unboxed - converted back to it's value.
For value types (i.e. all structs) this is slow, and potentially uses a lot more space.
For reference types (i.e. all classes) this is far less of a problem, as they are stored as a reference anyway.
A further problem with a boxed value type is that it's not obvious that you're dealing with the box, rather than the value. When you compare two structs then you're comparing values, but when you compare two classes then (by default) you're comparing the reference - i.e. are these the same instance?
This can be confusing when dealing with boxed value types:
int a = 7;
int b = 7;
if(a == b) // Evaluates to true, because a and b have the same value
object c = (object) 7;
object d = (object) 7;
if(c == d) // Evaluates to false, because c and d are different instances
It's easy to work around:
if(c.Equals(d)) // Evaluates to true because it calls the underlying int's equals
if(((int) c) == ((int) d)) // Evaluates to true once the values are cast
However it is another thing to be careful of when dealing with boxed values.
Boxing is the process of conversion of a value type into a reference type. Whereas Unboxing is the conversion of a reference type into a value type.
EX: int i = 123;
object o = i;// Boxing
int j = (int)o;// UnBoxing
Value Types are: int, char and structures, enumerations.
Reference Types are:
Classes,interfaces,arrays,strings and objects
The language-agnostic meaning of a box is just "an object contains some other value".
Literally, boxing is an operation to put some value into the box. More specifically, it is an operation to create a new box containing the value. After boxing, the boxed value can be accessed from the box object, by unboxing.
Note that objects (not OOP-specific) in many programming languages are about identities, but values are not. Two objects are same iff. they have identities not distinguishable in the program semantics. Values can also be the same (usually under some equality operators), but we do not distinguish them as "one" or "two" unique values.
Providing boxes is mainly about the effort to distinguish side effects (typically, mutation) from the states on the objects otherwise probably invisible to the users.
A language may limit the allowed ways to access an object and hide the identity of the object by default. For example, typical Lisp dialects has no explicit distinctions between objects and values. As a result, the implementation has the freedom to share the underlying storage of the objects until some mutation operations occurs on the object (so the object must be "detached" after the operation from the shared instance to make the effect visible, i.e. the mutated value stored in the object could be different than the other objects having the old value). This technique is sometimes called object interning.
Interning makes the program more memory efficient at runtime if the objects are shared without frequent needs of mutation, at the cost that:
The users cannot distinguish the identity of the objects.
There are no way to identify an object and to ensure it has states explicitly independent to other objects in the program before some side effects have actually occur (and the implementation does not aggressively to do the interning concurrently; this should be the rare case, though).
There may be more problems on interoperations which require to identify different objects for different operations.
There are risks that such assumptions can be false, so the performance is actually made worse by applying the interning.
This depends on the programming paradigm. Imperative programming which mutates objects frequently certainly would not work well with interning.
Implementations depending on COW (copy-on-write) to ensure interning can incur serious performance degradation in concurrent environments.
Even local sharing specifically for a few internal data structures can be bad. For example, ISO C++ 11 did not allow sharing of the internal elements of std::basic_string for this reason exactly, even at the cost of breaking the ABI on at least one mainstream implementation (libstdc++).
Boxing and unboxing incur performance penalties. This is obvious especially when these operations can be naively avoided by hand but actually not easy for the optimizer. The concrete measurement of the cost depends (on per-implementation or even per-program basis), though.
Mutable cells, i.e. boxes, are well-established facilities exactly to resolve the problems of the 1st and 2nd bullets listed above. Additionally, there can be immutable boxes for implementation of assignment in a functional language. See SRFI-111 for a practical instance.
Using mutable cells as function arguments with call-by-value strategy implements the visible effects of mutation being shared between the caller and the callee. The object contained by an box is effectively "called by shared" in this sense.
Sometimes, the boxes are referred as references (which is technically false), so the shared semantics are named "reference semantics". This is not correct, because not all references can propagate the visible side effects (e.g. immutable references). References are more about exposing the access by indirection, while boxes are the efforts to expose minimal details of the accesses like whether indirection or not (which is uninterested and better avoided by the implementation).
Moreover, "value semantic" is irrelevant here. Values are not against to references, nor to boxes. All the discussions above are based on call-by-value strategy. For others (like call-by-name or call-by-need), no boxes are needed to shared object contents in this way.
Java is probably the first programming language to make these features popular in the industry. Unfortunately, there seem many bad consequences concerned in this topic:
The overall programming paradigm does not fit the design.
Practically, the interning are limited to specific objects like immutable strings, and the cost of (auto-)boxing and unboxing are often blamed.
Fundamental PL knowledge like the definition of the term "object" (as "instance of a class") in the language specification, as well as the descriptions of parameter passing, are biased compared to the the original, well-known meaning, during the adoption of Java by programmers.
At least CLR languages are following the similar parlance.
Some more tips on implementations (and comments to this answer):
Whether to put the objects on the call stacks or the heap is an implementation details, and irrelevant to the implementation of boxes.
Some language implementations do not maintain a contiguous storage as the call stack.
Some language implementations do not even make the (per thread) activation records a linear stack.
Some language implementations do allocate stacks on the free store ("the heap") and transfer slices of frames between the stacks and the heap back and forth.
These strategies has nothing to do boxes. For instance, many Scheme implementations have boxes, with different activation records layouts, including all the ways listed above.
Besides the technical inaccuracy, the statement "everything is an object" is irrelevant to boxing.
Python, Ruby, and JavaScript all use latent typing (by default), so all identifiers referring to some objects will evaluate to values having the same static type. So does Scheme.
Some JavaScript and Ruby implementations use the so-called NaN-boxing to allow inlining allocation of some objects. Some others (including CPython) do not. With NaN boxing, a normal double object needs no unboxing to access its value, while a value of some other types can be boxed in a host double object, and there is no reference for double or the boxed value. With the naive pointer approach, a value of host object pointer like PyObject* is an object reference holding a box whose boxed value is stored in the dynamically allocated space.
At least in Python, objects are not "everything". They are also not known as "boxed values" unless you are talking about interoperability with specific implementations.
The .NET FCL generic collections:
List<T>
Dictionary<TKey, UValue>
SortedDictionary<TKey, UValue>
Stack<T>
Queue<T>
LinkedList<T>
were all designed to overcome the performance issues of boxing and unboxing in previous collection implementations.
For more, see chapter 16, CLR via C# (2nd Edition).
Boxing and unboxing facilitates value types to be treated as objects. Boxing means converting a value to an instance of the object reference type. For example, Int is a class and int is a data type. Converting int to Int is an exemplification of boxing, whereas converting Int to int is unboxing. The concept helps in garbage collection, Unboxing, on the other hand, converts object type to value type.
int i=123;
object o=(object)i; //Boxing
o=123;
i=(int)o; //Unboxing.
Like anything else, autoboxing can be problematic if not used carefully. The classic is to end up with a NullPointerException and not be able to track it down. Even with a debugger. Try this:
public class TestAutoboxNPE
{
public static void main(String[] args)
{
Integer i = null;
// .. do some other stuff and forget to initialise i
i = addOne(i); // Whoa! NPE!
}
public static int addOne(int i)
{
return i + 1;
}
}