What is the preferred way to store different versions of data? - language-agnostic

When you're writing an application that needs to read and work with two versions of data in the same way, what is the best way to structure your classes to represent that data. I have come up with three scenarios:
Common Base/Specific Children
Data Union
Distinct Structures
Version 1 Car Example
byte DoorCount
int Color
byte HasMoonroof
byte HasSpoiler
float EngineSize
byte CylinderCount
Version 2 Car
byte DoorCount
int Color
enum:int MoonRoofType
enum:int TrunkAccessories
enum:int EngineType
Common Base/Specific Children
With this method, there is a base class of common fields between the two versions of data and a child class for each version of the data.
class Car {
byte DoorCount;
int Color;
}
class CarVersion1 : Car {
byte HasMoonroof;
byte HasSpoiler;
float EngineSize;
byte CylinderCount;
}
class CarVersion2 : Car {
int MoonRoofType;
int TrunkAccessories;
int EngineType;
}
Strengths
OOP Paradigm
Weaknesses
Existing child classes will have to change if a new version is released that removes a common field
Data for one conceptual unit is split between two definitions not because of any division meaningful to itself.
Data Union
Here, a Car is defined as the union of the fields of Car across all versions of the data.
class Car {
CarVersion version;
byte DoorCount;
int Color;
int MoonRoofType; //boolean if Version 1
int TrunkAccessories; //boolean if Version 1
int EngineType; //CylinderCount if Version 1
float EngineSize; //Not used if Version2
}
Strengths
Um... Everything is in one place.
Weaknesses
Forced case driven code.
Difficult to maintain when another version is release or legacy is removed.
Difficult to conceptualize. The meanings of the fields changed based on the version.
Distinct Structures
Here the structures have no OOP relationship to each other. However, interfaces may be implemented by both classes if/when the code expects to treat them in the same fashion.
class CarVersion1 {
byte DoorCount;
int Color;
byte HasMoonroof;
byte HasSpoiler;
float EngineSize;
byte CylinderCount;
}
class CarVersion2 {
byte DoorCount;
int Color;
int MoonRoofType;
int TrunkAccessories;
int EngineType;
}
Strengths
Straightforward approach
Easy to maintain if a new version is added or legacy is removed.
Weaknesses
It's an anti-pattern.
Is there a better way that I didn't think of? It's probably obvious that I favor the last methodology, but is the first one better?

Why is the third option, distinct structures for each version, a bad idea or anti-pattern?
If the two versions of data structures are used in a common application/module - they will have to implement the same interface. Period. It is definitely untenable to write two different application modules to handle two different versions of data structure. The fact that the underlying data model is extremely different should be irrelevant. After all, the goal of writing objects is to achieve a practical level of encapsulation.
As you continue writing code in this way, you should eventually find places where the code in both classes are similar or redundant. If you move these common pieces of code out of the various version classes, you may eventually end up with version classes that not only implement the same interface, but can also implement the same base/abstract class. Voila, you've found your way to your "first" option.
I think this is the best path in an environment with constantly evolving data. It requires some diligence and "looking behind" on older code, but worth the benefits of code clarity and reusable components.
Another thought: in your example, the base class is "Car". In my opinion, it hardly ever turns out that the base class is so "near" to it's inheritors. A more realistic set of base classes or interfaces might be "Versionable", "Upgradeable", "OptionContainer", etc. Just speaking from my experience, YMMV.

use the second approach and enhance it with the interfaces. remember that you can implement multiple interfaces "versions" which gives you the power of backward compatibility! i hope that you'll get what i meant to say ;)

Going on the following requirement:
an application that needs to read and work with two versions of data in the same way
I would say that the most important thing is that you funnel all logic through a data abstraction layer, so that none of your logic will have to care about whether you're using version 1, 2 or n of the data.
One way to do this is to have just one data class, that is the most "buffed up" version of the data. Basically, it would have MoonRoofType, but not HasMoonRoof since that can be inferred. This class should not have any obsolete properties either, since it's up to the data abstraction layer to decide what the default values should be.
In the end, you'll have an application that doesn't care about the data versions at all.
As for the data abstraction layer, you may or may not want to have data classes for every version. Most likely, all you'll need is one class for every version of the data structure with Save and Load methods for storing/creating the data instances used by your application logic.

Related

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

What is ADT? (Abstract Data Type)

I am currently studying about Abstract Data Types (ADT's) but I don't get the concept at all. Can someone please explain to me what this actually is? Also what is collection, bag, and List ADT? in simple terms?
Abstract Data Type(ADT) is a data type, where only behavior is defined but not implementation.
Opposite of ADT is Concrete Data Type (CDT), where it contains an implementation of ADT.
Examples:
Array, List, Map, Queue, Set, Stack, Table, Tree, and Vector are ADTs. Each of these ADTs has many implementations i.e. CDT. The container is a high-level ADT of above all ADTs.
Real life example:
book is Abstract (Telephone Book is an implementation)
The Abstact data type Wikipedia article has a lot to say.
In computer science, an abstract data type (ADT) is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics. An abstract data type is defined indirectly, only by the operations that may be performed on it and by mathematical constraints on the effects (and possibly cost) of those operations.
In slightly more concrete terms, you can take Java's List interface as an example. The interface doesn't explicitly define any behavior at all because there is no concrete List class. The interface only defines a set of methods that other classes (e.g. ArrayList and LinkedList) must implement in order to be considered a List.
A collection is another abstract data type. In the case of Java's Collection interface, it's even more abstract than List, since
The List interface places additional stipulations, beyond those specified in the Collection interface, on the contracts of the iterator, add, remove, equals, and hashCode methods.
A bag is also known as a multiset.
In mathematics, the notion of multiset (or bag) is a generalization of the notion of set in which members are allowed to appear more than once. For example, there is a unique set that contains the elements a and b and no others, but there are many multisets with this property, such as the multiset that contains two copies of a and one of b or the multiset that contains three copies of both a and b.
In Java, a Bag would be a collection that implements a very simple interface. You only need to be able to add items to a bag, check its size, and iterate over the items it contains. See Bag.java for an example implementation (from Sedgewick & Wayne's Algorithms 4th edition).
A truly abstract data type describes the properties of its instances without commitment to their representation or particular operations. For example the abstract (mathematical) type Integer is a discrete, unlimited, linearly ordered set of instances. A concrete type gives a specific representation for instances and implements a specific set of operations.
Notation of Abstract Data Type(ADT)
An abstract data type could be defined as a mathematical model with a
collection of operations defined on it. A simple example is the set of
integers together with the operations of union, intersection defined
on the set.
The ADT's are generalizations of primitive data type(integer, char
etc) and they encapsulate a data type in the sense that the definition
of the type and all operations on that type localized to one section
of the program. They are treated as a primitive data type outside the
section in which the ADT and its operations are defined.
An implementation of an ADT is the translation into statements of
a programming language of the declaration that defines a variable to
be of that ADT, plus a procedure in that language for each
operation of that ADT. The implementation of the ADT chooses a
data structure to represent the ADT.
A useful tool for specifying the logical properties of data type is
the abstract data type. Fundamentally, a data type is a collection of
values and a set of operations on those values. That collection and
those operations form a mathematical construct that may be implemented
using a particular hardware and software data structure. The term
"abstract data type" refers to the basic mathematical concept that defines the data type.
In defining an abstract data type as mathamatical concept, we are not
concerned with space or time efficinecy. Those are implementation
issue. Infact, the defination of ADT is not concerned with
implementaion detail at all. It may not even be possible to implement
a particular ADT on a particular piece of hardware or using a
particular software system. For example, we have already seen that an
ADT integer is not universally implementable.
To illustrate the concept of an ADT and my specification method,
consider the ADT RATIONAL which corresponds to the mathematical
concept of a rational number. A rational number is a number that can
be expressed as the quotient of two integers. The operations on
rational numbers that, we define are the creation of a rational number
from two integers, addition, multiplication and testing for equality.
The following is an initial specification of this ADT.
/* Value defination */
abstract typedef <integer, integer> RATIONAL;
condition RATIONAL [1]!=0;
/*Operator defination*/
abstract RATIONAL makerational (a,b)
int a,b;
preconditon b!=0;
postcondition makerational [0] =a;
makerational [1] =b;
abstract RATIONAL add [a,b]
RATIONAL a,b;
postcondition add[1] = = a[1] * b[1]
add[0] = a[0]*b[1]+b[0]*a[1]
abstract RATIONAL mult [a, b]
RATIONAL a,b;
postcondition mult[0] = = a[0]*b[a]
mult[1] = = a[1]*b[1]
abstract equal (a,b)
RATIONAL a,b;
postcondition equal = = |a[0] * b[1] = = b[0] * a[1];
An ADT consists of two parts:-
1) Value definition
2) Operation definition
1) Value Definition:-
The value definition defines the collection of values for the ADT and
consists of two parts:
1) Definition Clause
2) Condition Clause
For example, the value definition for the ADT RATIONAL states that
a RATIONAL value consists of two integers, the second of which does
not equal to 0.
The keyword abstract typedef introduces a value definitions and the
keyword condition is used to specify any conditions on the newly
defined data type. In this definition the condition specifies that the
denominator may not be 0. The definition clause is required, but the
condition may not be necessary for every ADT.
2) Operator Definition:-
Each operator is defined as an abstract junction with three parts.
1)Header
2)Optional Preconditions
3)Optional Postconditions
For example the operator definition of the ADT RATIONAL includes the
operations of creation (makerational), addition (add) and
multiplication (mult) as well as a test for equality (equal). Let us
consider the specification for multiplication first, since, it is the
simplest. It contains a header and post-conditions, but no
pre-conditions.
abstract RATIONAL mult [a,b]
RATIONAL a,b;
postcondition mult[0] = a[0]*b[0]
mult[1] = a[1]*b[1]
The header of this definition is the first two lines, which are just
like a C function header. The keyword abstract indicates that it is
not a C function but an ADT operator definition.
The post-condition specifies, what the operation does. In a
post-condition, the name of the function (in this case, mult) is used
to denote the result of an operation. Thus, mult [0] represents
numerator of result and mult 1 represents the denominator of the
result. That is it specifies, what conditions become true after the
operation is executed. In this example the post-condition specifies
that the neumerator of the result of a rational multiplication equals
integer product of numerators of the two inputs and the denominator
equals th einteger product of two denominators.
List
In computer science, a list or sequence is an abstract data type that
represents a countable number of ordered values, where the same value
may occur more than once. An instance of a list is a computer
representation of the mathematical concept of a finite sequence; the
(potentially) infinite analog of a list is a stream. Lists are a basic
example of containers, as they contain other values. If the same value
occurs multiple times, each occurrence is considered a distinct item
The name list is also used for several concrete data structures that
can be used to implement abstract lists, especially linked lists.
Image of a List
Bag
A bag is a collection of objects, where you can keep adding objects to
the bag, but you cannot remove them once added to the bag. So with a
bag data structure, you can collect all the objects, and then iterate
through them. You will bags normally when you program in Java.
Image of a Bag
Collection
A collection in the Java sense refers to any class that implements the
Collection interface. A collection in a generic sense is just a group
of objects.
Image of collections
Actually Abstract Data Types is:
Concepts or theoretical model that defines a data type logically
Specifies set of data and set of operations that can be performed on that data
Does not mention anything about how operations will be implemented
"Existing as an idea but not having a physical idea"
For example, lets see specifications of some Abstract Data Types,
List Abstract Data Type: initialize(), get(), insert(), remove(), etc.
Stack Abstract Data Type: push(), pop(), peek(), isEmpty(), isNull(), etc.
Queue Abstract Data Type: enqueue(), dequeue(), size(), peek(), etc.
One of the simplest explanation given on Brilliant's wiki:
Abstract data types, commonly abbreviated ADTs, are a way of
classifying data structures based on how they are used and the
behaviors they provide. They do not specify how the data structure
must be implemented or laid out in memory, but simply provide a
minimal expected interface and set of behaviors. For example, a stack
is an abstract data type that specifies a linear data structure with
LIFO (last in, first out) behavior. Stacks are commonly implemented
using arrays or linked lists, but a needlessly complicated
implementation using a binary search tree is still a valid
implementation. To be clear, it is incorrect to say that stacks are
arrays or vice versa. An array can be used as a stack. Likewise, a
stack can be implemented using an array.
Since abstract data types don't specify an implementation, this means
it's also incorrect to talk about the time complexity of a given
abstract data type. An associative array may or may not have O(1)
average search times. An associative array that is implemented by a
hash table does have O(1) average search times.
Example for ADT: List - can be implemented using Array and LinkedList, Queue, Deque, Stack, Associative array, Set.
https://brilliant.org/wiki/abstract-data-types/?subtopic=types-and-data-structures&chapter=abstract-data-types
ADT are a set of data values and associated operations that are precisely independent of any paticular implementaition. The strength of an ADT is implementaion is hidden from the user.only interface is declared .This means that the ADT is various ways
Abstract Data type is a mathematical module that includes data with various operations. Implementation details are hidden and that's why it is called abstract. Abstraction allowed you to organise the complexity of the task by focusing on logical properties of data and actions.
In programming languages, a type is some data and the associated operations. An ADT is a user defined data aggregate and the operations over these data and is characterized by encapsulation, the data and operations are represented, or at list declared, in a single syntactic unit, and information hiding, only the relevant operations are visible for the user of the ADT, the ADT interface, in the same way that a normal data type in the programming language. It's an abstraction because the internal representation of the data and implementation of the operations are of no concern to the ADT user.
Before defining abstract data types, let us considers the different
view of system-defined data types. We all know that by default all
primitive data types (int, float, etc.) support basic operations such
as addition and subtraction. The system provides the implementations
for the primitive data types. For user-defined data types, we also
need to define operations. The implementation for these operations can
be done when we want to actually use them. That means in general,
user-defined data types are defined along with their operations.
To simplify the process of solving problems, we combine the data
structures with their operations and we call this "Abstract Data
Type". (ADT's).
Commonly used ADT'S include: Linked List, Stacks, Queues, Binary Tree,
Dictionaries, Disjoint Sets (Union and find), Hash Tables and many
others.
ADT's consist of two types:
1. Declaration of data.
2. Declaration of operation.
Simply Abstract Data Type is nothing but a set of operation and set of data is used for storing some other data efficiently in the machine.
There is no need of any perticular type declaration.
It just require a implementation of ADT.
To solve problems we combine the data structure with their operations. An ADT consists of two parts:
Declaration of Data.
Declaration of Operation.
Commonly used ADT's are Linked Lists, Stacks, Queues, Priority Queues, Trees etc. While defining ADTs we don't need to worry about implementation detals. They come into picture only when we want to use them.
Abstract data type are like user defined data type on which we can perform functions without knowing what is there inside the datatype and how the operations are performed on them . As the information is not exposed its abstracted. eg. List,Array, Stack, Queue. On Stack we can perform functions like Push, Pop but we are not sure how its being implemented behind the curtains.
ADT is a set of objects and operations, no where in an ADT’s definitions is there any mention of how the set of operations is implemented. Programmers who use collections only need to know how to instantiate and access data in some pre-determined manner, without concerns for the details of the collections implementations. In other words, from a user’s perspective, a collection is an abstraction, and for this reason, in computer science, some collections are referred to as abstract data types (ADTs). The user is only concern with learning its interface, or the set of operations its performs...more
in a simple word: An abstract data type is a collection of data and operations that work on that data. The operations both describe the data to the rest of the program and allow the rest of the program to change the data. The word “data” in “abstract data type” is used loosely. An ADT might be a graphics window with all the operations that affect it, a file and file operations, an insurance-rates table and the operations on it, or something else.
from code complete 2 book
Abstract data type is the collection of values and any kind of operation on these values. For example, since String is not a primitive data type, we can include it in abstract data types.
ADT is a data type in which collection of data and operation works on that data . It focuses on more the concept than implementation..
It's up to you which language you use to make it visible on the earth
Example:
Stack is an ADT while the Array is not
Stack is ADT because we can implement it by many languages,
Python c c++ java and many more , while Array is built in data type
An abstract data type, sometimes abbreviated ADT, is a logical description of how we view the data and the operations that are allowed without regard to how they will be implemented. This means that we are concerned only with what the data is representing and not with how it will eventually be constructed.
https://runestone.academy/runestone/books/published/pythonds/Introduction/WhyStudyDataStructuresandAbstractDataTypes.html
Abstractions give you only information(service information) but not implementation.
For eg: When you go to withdraw money from an ATM machine, you just know one thing i.e put your ATM card to the machine, click the withdraw option, enter the amount and your money is out if there is money.
This is only what you know about ATM machines. But do you know how you are receiving money?? What business logic is going on behind? Which database is being called? Which server at which location is being invoked?? No, you only know is service information i.e you can withdraw money. This is an abstraction.
Similarly, ADT gives you an overview of data types: what they are / can be stored and what operations you can perform on those data types. But it doesn’t provide how to implement that. This is ADT. It only defines the logical form of your data types.
Another analogy is :
In a car or bike, you only know when you press the brake your vehicle will stop. But do you know how the bike stops when you press the brake??? No, means implementation detail is being hidden. You only know what brake does when you press but don’t know how it does.
An abstract data type(ADT) is an abstraction of a data structure that provides only the interface to which the data structure must adhere. The interface does not give any specific details about how something should be implemented or in what programming language.
The term data type is as the type of data which a particular variable can hold - it may be an integer, a character, a float, or any range of simple data storage representation. However, when we build an object oriented system, we use other data types, known as abstract data type, which represents more realistic entities.
E.g.: We might be interested in representing a 'bank account' data type, which describe how all bank account are handled in a program. Abstraction is about reducing complexity, ignoring unnecessary details.

Is there any advantage in using Vector.<Object> in place of a standard Array?

Because of the inability to create Vectors dynamically, I'm forced to create one with a very primitive type, i.e. Object:
var list:Vector.<Object> = new Vector.<Object>();
I'm assuming that Vector gains its power from being typed as closely as possible, rather than the above, but I may be wrong and there are in-fact still gains when using the above in place of a normal Array or Object:
var list:Array = [];
var list:Object = {};
Does anyone have any insight on this?
You will not gain any benefits from Vector.< Object > compared to Array or vice versa. Also the underlying data structure will be the same even if you have a tighter coupled Vector such as Vector.< Foo >. The only optimization gains will be if you use value types. The reason for this is that Ecmascript will still be late binding and all reference objects share the same referencing byte structure.
However, in Ecmascript 4 (of which Actionscript is an implementation) the Vector generic datatype adds bounds checking to element access (the non-vector will simply grow the array), so the functionality varies slightly and consequently the number of CPU clock cycles will vary a little bit. This is negligible however.
One advantage I've seen is that coding is a bit easier with vectors, because FlashDevelop (and most coding tools for as3) can do code hinting better. so I can do myVector. and see my methods and functions, array won't let you do that without casting myArr[2] as myObject (thought this kind of casting is rumoured to make it faster, not slower)
Array's sort functions are faster however, but if it is speed you're after, you might be better served by linked lists (pending the application)
I think using vectors is the proper way to be coding, but not necessarily better.
Excellent question- Vectors have a tremendous value! Vector. vs Array is a bad example of the differences though and benchmarks may be similar. However, Vector. vs Array is DEFINITELY better both memory and processing. The speed improvement comes from Flash not needing to "box" and "unbox" the values (multiple mathematical operations required for this). Also, Array cannot allocate memory as effectively as a typed Vector. Strict-typing collections are almost always better.
Benchmarks:
http://jacksondunstan.com/articles/636
http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/
Even .NET suffers from boxing collections (Array):
http://msdn.microsoft.com/en-us/library/ms173196.aspx
UPDATE:
I've been corrected! Only primitive numeric types get a performance enhancement
from Vectors. You won't see any improvement with Array vs Vector.<Object>.

Should Tuples Subclass Each Other?

Given a set of tuple classes in an OOP language: Pair, Triple and Quad, should Triple subclass Pair, and Quad subclass Triple?
The issue, as I see it, is whether a Triple should be substitutable as a Pair, and likewise Quad for Triple or Pair. Whether Triple is also a Pair and Quad is also a Triple and a Pair.
In one context, such a relationship might be valuable for extensibility - today this thing returns a Pair of things, tomorrow I need it to return a Triple without breaking existing callers, who are only using the first two of the three.
On the other hand, should they each be distinct types? I can see benefit in stronger type checking - where you can't pass a Triple to a method that expects a Pair.
I am leaning towards using inheritance, but would really appreciate input from others?
PS: In case it matters, the classes will (of course) be generic.
PPS: On a way more subjective side, should the names be Tuple2, Tuple3 and Tuple4?
Edit: I am thinking of these more as loosely coupled groups; not specifically for things like x/y x/y/z coordinates, though they may be used for such. It would be things like needing a general solution for multiple return values from a method, but in a form with very simple semantics.
That said, I am interested in all the ways others have actually used tuples.
Different length of tuple is a different type. (Well, in many type systems anyways.) In a strongly typed language, I wouldn't think that they should be a collection.
This is a good thing as it ensures more safety. Places where you return tuples usually have somewhat coupled information along with it, the implicit knowledge of what each component is. It's worse if you pass in more values in a tuple than expected -- what's that supposed to mean? It doesn't fit inheritance.
Another potential issue is if you decide to use overloading. If tuples inherit from each other, then overload resolution will fail where it should not. But this is probably a better argument against overloading.
Of course, none of this matters if you have a specific use case and find that certain behaviours will help you.
Edit: If you want general information, try perusing a bit of Haskell or ML family (OCaml/F#) to see how they're used and then form your own decisions.
It seems to me that you should make a generic Tuple interface (or use something like the Collection mentioned above), and have your pair and 3-tuple classes implement that interface. That way, you can take advantage of polymorphism but also allow a pair to use a simpler implementation than an arbitrary-sized tuple. You'd probably want to make your Tuple interface include .x and .y accessors as shorthand for the first two elements, and larger tuples can implement their own shorthands as appropriate for items with higher indices.
Like most design related questions, the answer is - It depends.
If you are looking for conventional Tuple design, Tuple2, Tuple3 etc is the way to go. The problem with inheritance is that, first of all Triplet is not a type of Pair. How would you implement the equals method for it? Is a Triplet equal to a Pair with first two items the same? If you have a collection of Pairs, can you add triplet to it or vice versa? If in your domain this is fine, you can go with inheritance.
Any case, it pays to have an interface/abstract class (maybe Tuple) which all these implement.
it depends on the semantics that you need -
a pair of opposites is not semantically compatible with a 3-tuple of similar objects
a pair of coordinates in polar space is not semantically compatible with a 3-tuple of coordinates in Euclidean space
if your semantics are simple compositions, then a generic class Tuple<N> would make more sense
I'd go with 0,1,2 or infinity. e.g. null, 1 object, your Pair class, or then a collection of some sort.
Your Pair could even implement a Collection interface.
If there's a specific relationship between Three or Four items, it should probably be named.
[Perhaps I'm missing the problem, but I just can't think of a case where I want to specifically link 3 things in a generic way]
Gilad Bracha blogged about tuples, which I found interesting reading.
One point he made (whether correctly or not I can't yet judge) was:
Literal tuples are best defined as read only. One reason for this is that readonly tuples are more polymorphic. Long tuples are subtypes of short ones:
{S. T. U. V } <= {S. T. U} <= {S. T} <= {S}
[and] read only tuples are covariant:
T1 <= T2, S1 <= S2 ==> {S1. T1} <= {S2. T2}
That would seem to suggest my inclination to using inheritance may be correct, and would contradict amit.dev when he says that a Triple is not a Pair.

What is boxing and unboxing and what are the trade offs?

I'm looking for a clear, concise and accurate answer.
Ideally as the actual answer, although links to good explanations welcome.
Boxed values are data structures that are minimal wrappers around primitive types*. Boxed values are typically stored as pointers to objects on the heap.
Thus, boxed values use more memory and take at minimum two memory lookups to access: once to get the pointer, and another to follow that pointer to the primitive. Obviously this isn't the kind of thing you want in your inner loops. On the other hand, boxed values typically play better with other types in the system. Since they are first-class data structures in the language, they have the expected metadata and structure that other data structures have.
In Java and Haskell generic collections can't contain unboxed values. Generic collections in .NET can hold unboxed values with no penalties. Where Java's generics are only used for compile-time type checking, .NET will generate specific classes for each generic type instantiated at run time.
Java and Haskell have unboxed arrays, but they're distinctly less convenient than the other collections. However, when peak performance is needed it's worth a little inconvenience to avoid the overhead of boxing and unboxing.
* For this discussion, a primitive value is any that can be stored on the call stack, rather than stored as a pointer to a value on the heap. Frequently that's just the machine types (ints, floats, etc), structs, and sometimes static sized arrays. .NET-land calls them value types (as opposed to reference types). Java folks call them primitive types. Haskellions just call them unboxed.
** I'm also focusing on Java, Haskell, and C# in this answer, because that's what I know. For what it's worth, Python, Ruby, and Javascript all have exclusively boxed values. This is also known as the "Everything is an object" approach***.
*** Caveat: A sufficiently advanced compiler / JIT can in some cases actually detect that a value which is semantically boxed when looking at the source, can safely be an unboxed value at runtime. In essence, thanks to brilliant language implementors your boxes are sometimes free.
from C# 3.0 In a Nutshell:
Boxing is the act of casting a value
type into a reference type:
int x = 9;
object o = x; // boxing the int
unboxing is... the reverse:
// unboxing o
object o = 9;
int x = (int)o;
Boxing & unboxing is the process of converting a primitive value into an object oriented wrapper class (boxing), or converting a value from an object oriented wrapper class back to the primitive value (unboxing).
For example, in java, you may need to convert an int value into an Integer (boxing) if you want to store it in a Collection because primitives can't be stored in a Collection, only objects. But when you want to get it back out of the Collection you may want to get the value as an int and not an Integer so you would unbox it.
Boxing and unboxing is not inherently bad, but it is a tradeoff. Depending on the language implementation, it can be slower and more memory intensive than just using primitives. However, it may also allow you to use higher level data structures and achieve greater flexibility in your code.
These days, it is most commonly discussed in the context of Java's (and other language's) "autoboxing/autounboxing" feature. Here is a java centric explanation of autoboxing.
In .Net:
Often you can't rely on what the type of variable a function will consume, so you need to use an object variable which extends from the lowest common denominator - in .Net this is object.
However object is a class and stores its contents as a reference.
List<int> notBoxed = new List<int> { 1, 2, 3 };
int i = notBoxed[1]; // this is the actual value
List<object> boxed = new List<object> { 1, 2, 3 };
int j = (int) boxed[1]; // this is an object that can be 'unboxed' to an int
While both these hold the same information the second list is larger and slower. Each value in the second list is actually a reference to an object that holds the int.
This is called boxed because the int is wrapped by the object. When its cast back the int is unboxed - converted back to it's value.
For value types (i.e. all structs) this is slow, and potentially uses a lot more space.
For reference types (i.e. all classes) this is far less of a problem, as they are stored as a reference anyway.
A further problem with a boxed value type is that it's not obvious that you're dealing with the box, rather than the value. When you compare two structs then you're comparing values, but when you compare two classes then (by default) you're comparing the reference - i.e. are these the same instance?
This can be confusing when dealing with boxed value types:
int a = 7;
int b = 7;
if(a == b) // Evaluates to true, because a and b have the same value
object c = (object) 7;
object d = (object) 7;
if(c == d) // Evaluates to false, because c and d are different instances
It's easy to work around:
if(c.Equals(d)) // Evaluates to true because it calls the underlying int's equals
if(((int) c) == ((int) d)) // Evaluates to true once the values are cast
However it is another thing to be careful of when dealing with boxed values.
Boxing is the process of conversion of a value type into a reference type. Whereas Unboxing is the conversion of a reference type into a value type.
EX: int i = 123;
object o = i;// Boxing
int j = (int)o;// UnBoxing
Value Types are: int, char and structures, enumerations.
Reference Types are:
Classes,interfaces,arrays,strings and objects
The language-agnostic meaning of a box is just "an object contains some other value".
Literally, boxing is an operation to put some value into the box. More specifically, it is an operation to create a new box containing the value. After boxing, the boxed value can be accessed from the box object, by unboxing.
Note that objects (not OOP-specific) in many programming languages are about identities, but values are not. Two objects are same iff. they have identities not distinguishable in the program semantics. Values can also be the same (usually under some equality operators), but we do not distinguish them as "one" or "two" unique values.
Providing boxes is mainly about the effort to distinguish side effects (typically, mutation) from the states on the objects otherwise probably invisible to the users.
A language may limit the allowed ways to access an object and hide the identity of the object by default. For example, typical Lisp dialects has no explicit distinctions between objects and values. As a result, the implementation has the freedom to share the underlying storage of the objects until some mutation operations occurs on the object (so the object must be "detached" after the operation from the shared instance to make the effect visible, i.e. the mutated value stored in the object could be different than the other objects having the old value). This technique is sometimes called object interning.
Interning makes the program more memory efficient at runtime if the objects are shared without frequent needs of mutation, at the cost that:
The users cannot distinguish the identity of the objects.
There are no way to identify an object and to ensure it has states explicitly independent to other objects in the program before some side effects have actually occur (and the implementation does not aggressively to do the interning concurrently; this should be the rare case, though).
There may be more problems on interoperations which require to identify different objects for different operations.
There are risks that such assumptions can be false, so the performance is actually made worse by applying the interning.
This depends on the programming paradigm. Imperative programming which mutates objects frequently certainly would not work well with interning.
Implementations depending on COW (copy-on-write) to ensure interning can incur serious performance degradation in concurrent environments.
Even local sharing specifically for a few internal data structures can be bad. For example, ISO C++ 11 did not allow sharing of the internal elements of std::basic_string for this reason exactly, even at the cost of breaking the ABI on at least one mainstream implementation (libstdc++).
Boxing and unboxing incur performance penalties. This is obvious especially when these operations can be naively avoided by hand but actually not easy for the optimizer. The concrete measurement of the cost depends (on per-implementation or even per-program basis), though.
Mutable cells, i.e. boxes, are well-established facilities exactly to resolve the problems of the 1st and 2nd bullets listed above. Additionally, there can be immutable boxes for implementation of assignment in a functional language. See SRFI-111 for a practical instance.
Using mutable cells as function arguments with call-by-value strategy implements the visible effects of mutation being shared between the caller and the callee. The object contained by an box is effectively "called by shared" in this sense.
Sometimes, the boxes are referred as references (which is technically false), so the shared semantics are named "reference semantics". This is not correct, because not all references can propagate the visible side effects (e.g. immutable references). References are more about exposing the access by indirection, while boxes are the efforts to expose minimal details of the accesses like whether indirection or not (which is uninterested and better avoided by the implementation).
Moreover, "value semantic" is irrelevant here. Values are not against to references, nor to boxes. All the discussions above are based on call-by-value strategy. For others (like call-by-name or call-by-need), no boxes are needed to shared object contents in this way.
Java is probably the first programming language to make these features popular in the industry. Unfortunately, there seem many bad consequences concerned in this topic:
The overall programming paradigm does not fit the design.
Practically, the interning are limited to specific objects like immutable strings, and the cost of (auto-)boxing and unboxing are often blamed.
Fundamental PL knowledge like the definition of the term "object" (as "instance of a class") in the language specification, as well as the descriptions of parameter passing, are biased compared to the the original, well-known meaning, during the adoption of Java by programmers.
At least CLR languages are following the similar parlance.
Some more tips on implementations (and comments to this answer):
Whether to put the objects on the call stacks or the heap is an implementation details, and irrelevant to the implementation of boxes.
Some language implementations do not maintain a contiguous storage as the call stack.
Some language implementations do not even make the (per thread) activation records a linear stack.
Some language implementations do allocate stacks on the free store ("the heap") and transfer slices of frames between the stacks and the heap back and forth.
These strategies has nothing to do boxes. For instance, many Scheme implementations have boxes, with different activation records layouts, including all the ways listed above.
Besides the technical inaccuracy, the statement "everything is an object" is irrelevant to boxing.
Python, Ruby, and JavaScript all use latent typing (by default), so all identifiers referring to some objects will evaluate to values having the same static type. So does Scheme.
Some JavaScript and Ruby implementations use the so-called NaN-boxing to allow inlining allocation of some objects. Some others (including CPython) do not. With NaN boxing, a normal double object needs no unboxing to access its value, while a value of some other types can be boxed in a host double object, and there is no reference for double or the boxed value. With the naive pointer approach, a value of host object pointer like PyObject* is an object reference holding a box whose boxed value is stored in the dynamically allocated space.
At least in Python, objects are not "everything". They are also not known as "boxed values" unless you are talking about interoperability with specific implementations.
The .NET FCL generic collections:
List<T>
Dictionary<TKey, UValue>
SortedDictionary<TKey, UValue>
Stack<T>
Queue<T>
LinkedList<T>
were all designed to overcome the performance issues of boxing and unboxing in previous collection implementations.
For more, see chapter 16, CLR via C# (2nd Edition).
Boxing and unboxing facilitates value types to be treated as objects. Boxing means converting a value to an instance of the object reference type. For example, Int is a class and int is a data type. Converting int to Int is an exemplification of boxing, whereas converting Int to int is unboxing. The concept helps in garbage collection, Unboxing, on the other hand, converts object type to value type.
int i=123;
object o=(object)i; //Boxing
o=123;
i=(int)o; //Unboxing.
Like anything else, autoboxing can be problematic if not used carefully. The classic is to end up with a NullPointerException and not be able to track it down. Even with a debugger. Try this:
public class TestAutoboxNPE
{
public static void main(String[] args)
{
Integer i = null;
// .. do some other stuff and forget to initialise i
i = addOne(i); // Whoa! NPE!
}
public static int addOne(int i)
{
return i + 1;
}
}