Related
std::string cmd = get_itcl_obj_name() + " set_ude_jobname " + name;
Tcl_Obj* cmdobj = Tcl_NewStringObj(cmd.c_str(),-1 );
if(Tcl_EvalObjEx(interp, cmdobj, TCL_EVAL_GLOBAL) == TCL_OK)
{
return true;
}
else
{
return false;
}
I have this part of code which is called thousands of time and consuming high memory, so i am not able to decide if this tcl object pointer needs to be deleted?
A Tcl_Obj points to a piece of memory that is 24 bytes long on a 32-bit system (I've not measured on 64-bit systems, but it could be up to 48 bytes, depending on how the compiler packs the structure). It can also point to further memory that can be variable in size; gigabytes is most certainly possible on 64-bit systems. In particular, the string representation of the value can be hanging off it, as can other things like list or dictionary representations. (The details are very variable under the covers.)
So yes, they should be released properly once no longer needed! While you never need to do that at the Tcl language level (other than by using unset on global variables occasionally) it's vital that you get the reference count management correct at the Tcl C API level.
In the specific case of the code you're working with, you need to both Tcl_IncrRefCount(cmdobj) before the call to Tcl_EvalObjEx and Tcl_DecrRefCount(cmdobj) after it. Since you're using C++, a helper RAII class might make this easier.
Or you could use Tcl_GlobalEval, which these days is just a wrapper around the same underlying basic function but does the Tcl_Obj memory management for you.
std::string cmd = get_itcl_obj_name() + " set_ude_jobname " + name;
if (Tcl_GlobalEval(interp, cmd.c_str()) == TCL_OK) {
return true;
} else {
return false;
}
(There are some minor performance differences, but they really don't matter. Or if they did, we'd advise a more substantial rewrite to use Tcl_EvalObjv, but that's a much larger change. Either that or maybe to do some caching of the Tcl_Obj so that Tcl can hang a bytecode compilation off the back-end of it usefully. All of which are quite a lot larger changes than you've got in mind!)
Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.
How do you go about implementing FSM(EDIT:Finite State Machine) states?
I usually think about an FSM like a set of functions,
a dispatcher,
and a thread as to indicate the current running state.
Meaning, I do blocking calls to functions/functors representing
states.
Just now I have implemented one in a different style,
where I still represent states with function(object)s, but the thread
just calls a state->step() method, which tries to return
as quickly as possible. In case the state has finished and a
transition should take place, it indicates that accordingly.
I would call this the 'polling' style since the functions mostly look
like:
void step()
{
if(!HaveReachedGoal)
{
doWhateverNecessary();
return; // get out as fast as possible
}
// ... test perhaps some more subgoals
indicateTransition();
}
I am aware that it is an FSM within an FSM.
It feels rather simplistic, but it has certain advantages.
While a thread being blocked, or held in some kind of
while (!CanGoForward)checkGoForward();
loop can be cumbersome and unwieldy,
the polling felt much easier to debug.
That's because the FSM object regains control after
every step, and putting out some debug info is a breeze.
Well I am deviating from my question:
How do you implement states of FSMs?
The state Design Pattern is an interesting way of implementing a FSM:
http://en.wikipedia.org/wiki/State_pattern
It's a very clean way of implementing the FSM but it can be messy depending on the complexity of your FSM (but not the amount of states). However, the advantages are that:
you eliminate code duplication (especially if/else statements)
It is easier to extend with new states
Your classes have better cohesion so all related logic is in one place - this should also make your code easier to writ tests for.
There is a Java and C++ implementation at this site:
http://www.vincehuston.org/dp/state.html
There’s always what I call the Flying Spaghetti Monster’s style of implementing FSMs (FSM-style FSMs): using lotsa gotos. For example:
state1:
do_something();
goto state2;
state2:
if (condition) goto state1;
else goto state3;
state3:
accept;
Very nice spaghetti code :-)
I did it as a table, a flat array in the memory, each cell is a state. Please have a look at the cvs source of the abandoned DFA project. For example:
class DFA {
DFA();
DFA(int mychar_groups,int mycharmap[256],int myi_state);
~DFA();
void add_trans(unsigned int from,char sym,unsigned int to);
void add_trans(unsigned int from,unsigned int symn,unsigned int to);
/*adds a transition between state from to state to*/
int add_state(bool accepting=false);
int to(int state, int symn);
int to(int state, char sym);
void set_char(char used_chars[],int);
void set_char(set<char> char_set);
vector<int > table; /*contains the table of the dfa itself*/
void normalize();
vector<unsigned int> char_map;
unsigned int char_groups; /*number of characters the DFA uses,
char_groups=0 means 1 character group is used*/
unsigned int i_state; /*initial state of the DFA*/
void switch_table_state(int first,int sec);
unsigned int num_states;
set<int > accepting_states;
};
But this was for a very specific need (matching regular expressions)
I remember my first FSM program. I wrote it in C with a very simple switch statement. Switching from one state to another or following through to the next state seemed natural.
Then I progressed to use a table lookup approach. I was able to write some very generic coding style using this approach. However, I was caught out a couple of times when the requirements changed and I have to support some extra events.
I have not written any FSMs lately. The last one I wrote was for a comms module in C++ where I used a "state design pattern" in conjunction with a "command pattern" (action).
If you are creating a complex state machine then you may want to check out SMC - the State Machine Compiler. This takes a textual representation of a state machine and compiles it into the language of your choice - it supports Java, C, C++, C#, Python, Ruby, Scala and many others.
What do you mean by Atomic instructions?
How does the following become Atomic?
TestAndSet
int TestAndSet(int *x){
register int temp = *x;
*x = 1;
return temp;
}
From a software perspective, if one does not want to use non-blocking synchronization primitives, how can one ensure Atomicity of instruction? is it possible only at Hardware or some assembly level directive optimization can be used?
Some machine instructions are intrinsically atomic - for example, reading and writing properly aligned values of the native processor word size is atomic on many architectures.
This means that hardware interrupts, other processors and hyper-threads cannot interrupt the read or store and read or write a partial value to the same location.
More complicated things such as reading and writing together atomically can be achieved by explicit atomic machine instructions e.g. LOCK CMPXCHG on x86.
Locking and other high-level constructs are built on these atomic primitives, which typically only guard a single processor word.
Some clever concurrent algorithms can be built using just the reading and writing of pointers e.g. in linked lists shared between a single reader and writer, or with effort, multiple readers and writers.
Below are some of my notes on Atomicity that may help you understand the meaning. The notes are from the sources listed at the end and I recommend reading some of them if you need a more thorough explanation rather than point-form bullets as I have. Please point out any errors so that I may correct them.
Definition :
From the Greek meaning "not divisible into smaller parts"
An "atomic" operation is always observed to be done or not done, but
never halfway done.
An atomic operation must be performed entirely or not performed at
all.
In multi-threaded scenarios, a variable goes from unmutated to
mutated directly, with no "halfway mutated" values
Example 1 : Atomic Operations
Consider the following integers used by different threads :
int X = 2;
int Y = 1;
int Z = 0;
Z = X; //Thread 1
X = Y; //Thread 2
In the above example, two threads make use of X, Y, and Z
Each read and write are atomic
The threads will race :
If thread 1 wins, then Z = 2
If thread 2 wins, then Z=1
Z will will definitely be one of those two values
Example 2 : Non-Atomic Operations : ++/-- Operations
Consider the increment/decrement expressions :
i++; //increment
i--; //decrement
The operations translate to :
Read i
Increment/decrement the read value
Write the new value back to i
The operations are each composed of 3 atomic operations, and are not atomic themselves
Two attempts to increment i on separate threads could interleave such that one of the increments is lost
Example 3 - Non-Atomic Operations : Values greater than 4-Bytes
Consider the following immutable struct :
struct MyLong
{
public readonly int low;
public readonly int high;
public MyLong(int low, int high)
{
this.low = low;
this.high = high;
}
}
We create fields with specific values of type MyLong :
MyLong X = new MyLong(0xAAAA, 0xAAAA);
MyLong Y = new MyLong(0xBBBB, 0xBBBB);
MyLong Z = new MyLong(0xCCCC, 0xCCCC);
We modify our fields in separate threads without thread safety :
X = Y; //Thread 1
Y = X; //Thread 2
In .NET, when copying a value type, the CLR doesn't call a constructor - it moves the bytes one atomic operation at a time
Because of this, the operations in the two threads are now four atomic operations
If there is no thread safety enforced, the data can be corrupted
Consider the following execution order of operations :
X.low = Y.low; //Thread 1 - X = 0xAAAABBBB
Y.low = Z.low; //Thread 2 - Y = 0xCCCCBBBB
Y.high = Z.high; //Thread 2 - Y = 0xCCCCCCCC
X.high = Y.high; //Thread 1 - X = 0xCCCCBBBB <-- corrupt value for X
Reading and writing values greater than 32-bits on multiple threads on a 32-bit operating system without adding some sort of locking to make the operation atomic is likely to result in corrupt data as above
Processor Operations
On all modern processors, you can assume that reads and writes of naturally aligned native types are atomic as long as :
1 : The memory bus is at least as wide as the type being read or written
2 : The CPU reads and writes these types in a single bus transaction, making it impossible for other threads to see them in a half-completed state
On x86 and X64 there is no guarantee that reads and writes larger than eight bytes are atomic
Processor vendors define the atomic operations for each processor in a Software Developer's Manual
In single processors / single core systems it is possible to use standard locking techniques to prevent CPU instructions from being interrupted, but this can be inefficient
Disabling interrupts is another more efficient solution, if possible
In multiprocessor / multicore systems it is still possible to use locks but merely using a single instruction or disabling interrupts does not guarantee atomic access
Atomicity can be achieved by ensuring that the instructions used assert the 'LOCK' signal on the bus to prevent other processors in the system from accessing the memory at the same time
Language Differences
C#
C# guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Operations on value types that take more than four bytes (double, long, etc.) are not guaranteed to be atomic
The CLI guarantees that reads and writes of variables of value type that are the size (or smaller) of the processor's natural pointer size are atomic
Ex - running C# on a 64-bit OS in a 64-bit version of the CLR performs reads and writes of 64-bit doubles and long integers atomically
Creating atomic operations :
.NET provodes the Interlocked Class as part of the System.Threading namespace
The Interlocked Class provides atomic operations such as increment, compare, exchange, etc.
using System.Threading;
int unsafeCount;
int safeCount;
unsafeCount++;
Interlocked.Increment(ref safeCount);
C++
C++ standard does not guarantee atomic behavior
All C / C++ operations are presumed non-atomic unless otherwise specified by the compiler or hardware vendor - including 32-bit integer assignment
Creating atomic operations :
The C++ 11 concurrency library includes the - Atomic Operations Library ()
The Atomic library provides atomic types as a template class to use with any type you want
Operations on atomic types are atomic and thus thread-safe
struct AtomicCounter
{
std::atomic< int> value;
void increment(){
++value;
}
void decrement(){
--value;
}
int get(){
return value.load();
}
}
Java
Java guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Assignments to volatile longs and doubles are also guaranteed to be atomic
Java provides a small toolkit of classes that support lock-free thread-safe programming on single variables through java.util.concurrent.atomic
This provides atomic lock-free operations based on low-level atomic hardware primitives such as compare-and-swap (CAS) - also called compare and set :
CAS form - boolean compareAndSet(expectedValue, updateValue );
This method atomically sets a variable to the updateValue if it currently holds the expectedValue - reporting true on success
import java.util.concurrent.atomic.AtomicInteger;
public class Counter
{
private AtomicInteger value= new AtomicInteger();
public int increment(){
return value.incrementAndGet();
}
public int getValue(){
return value.get();
}
}
Sources
http://www.evernote.com/shard/s10/sh/c2735e95-85ae-4d8c-a615-52aadc305335/99de177ac05dc8635fb42e4e6121f1d2
Atomic comes from the Greek ἄτομος (atomos) which means "indivisible". (Caveat: I don't speak Greek, so maybe it's really something else, but most English speakers citing etymologies interpret it this way. :-)
In computing, this means that the operation, well, happens. There isn't any intermediate state that's visible before it completes. So if your CPU gets interrupted to service hardware (IRQ), or if another CPU is reading the same memory, it doesn't affect the result, and these other operations will observe it as either completed or not started.
As an example... let's say you wanted to set a variable to something, but only if it has not been set before. You might be inclined to do this:
if (foo == 0)
{
foo = some_function();
}
But what if this is run in parallel? It could be that the program will fetch foo, see it as zero, meanwhile thread 2 comes along and does the same thing and sets the value to something. Back in the original thread, the code still thinks foo is zero, and the variable gets assigned twice.
For cases like this, the CPU provides some instructions that can do the comparison and the conditional assignment as an atomic entity. Hence, test-and-set, compare-and-swap, and load-linked/store-conditional. You can use these to implement locks (your OS and your C library has done this.) Or you can write one-off algorithms that rely on the primitives to do something. (There's cool stuff to be done here, but most mere mortals avoid this for fear of getting it wrong.)
Atomicity is a key concept when you have any form of parallel processing (including different applications cooperating or sharing data) that includes shared resources.
The problem is well illustrated with an example. Let's say you have two programs that want to create a file but only if the file doesn't already exists. Any of the two program can create the file at any point in time.
If you do (I'll use C since it's what's in your example):
...
f = fopen ("SYNCFILE","r");
if (f == NULL) {
f = fopen ("SYNCFILE","w");
}
...
you can't be sure that the other program hasn't created the file between your open for read and your open for write.
There's no way you can do this on your own, you need help from the operating system, that usually provide syncronization primitives for this purpose, or another mechanism that is guaranteed to be atomic (for example a relational database where the lock operation is atomic, or a lower level mechanism like processors "test and set" instructions).
Atomicity can only be guaranteed by the OS. The OS uses the underlying processor features to achieve this.
So creating your own testandset function is impossible. (Although I'm not sure if one could use an inline asm snippet, and use the testandset mnemonic directly (Could be that this statement can only be done with OS priviliges))
EDIT:
According to the comments below this post, making your own 'bittestandset' function using an ASM directive directly is possible (on intel x86). However, if these tricks also work on other processors is not clear.
I stand by my point: if You want to do atmoic things, use the OS functions and don't do it yourself
In several modern programming languages (including C++, Java, and C#), the language allows integer overflow to occur at runtime without raising any kind of error condition.
For example, consider this (contrived) C# method, which does not account for the possibility of overflow/underflow. (For brevity, the method also doesn't handle the case where the specified list is a null reference.)
//Returns the sum of the values in the specified list.
private static int sumList(List<int> list)
{
int sum = 0;
foreach (int listItem in list)
{
sum += listItem;
}
return sum;
}
If this method is called as follows:
List<int> list = new List<int>();
list.Add(2000000000);
list.Add(2000000000);
int sum = sumList(list);
An overflow will occur in the sumList() method (because the int type in C# is a 32-bit signed integer, and the sum of the values in the list exceeds the value of the maximum 32-bit signed integer). The sum variable will have a value of -294967296 (not a value of 4000000000); this most likely is not what the (hypothetical) developer of the sumList method intended.
Obviously, there are various techniques that can be used by developers to avoid the possibility of integer overflow, such as using a type like Java's BigInteger, or the checked keyword and /checked compiler switch in C#.
However, the question that I'm interested in is why these languages were designed to by default allow integer overflows to happen in the first place, instead of, for example, raising an exception when an operation is performed at runtime that would result in an overflow. It seems like such behavior would help avoid bugs in cases where a developer neglects to account for the possibility of overflow when writing code that performs an arithmetic operation that could result in overflow. (These languages could have included something like an "unchecked" keyword that could designate a block where integer overflow is permitted to occur without an exception being raised, in those cases where that behavior is explicitly intended by the developer; C# actually does have this.)
Does the answer simply boil down to performance -- the language designers didn't want their respective languages to default to having "slow" arithmetic integer operations where the runtime would need to do extra work to check whether an overflow occurred, on every applicable arithmetic operation -- and this performance consideration outweighed the value of avoiding "silent" failures in the case that an inadvertent overflow occurs?
Are there other reasons for this language design decision as well, other than performance considerations?
In C#, it was a question of performance. Specifically, out-of-box benchmarking.
When C# was new, Microsoft was hoping a lot of C++ developers would switch to it. They knew that many C++ folks thought of C++ as being fast, especially faster than languages that "wasted" time on automatic memory management and the like.
Both potential adopters and magazine reviewers are likely to get a copy of the new C#, install it, build a trivial app that no one would ever write in the real world, run it in a tight loop, and measure how long it took. Then they'd make a decision for their company or publish an article based on that result.
The fact that their test showed C# to be slower than natively compiled C++ is the kind of thing that would turn people off C# quickly. The fact that your C# app is going to catch overflow/underflow automatically is the kind of thing that they might miss. So, it's off by default.
I think it's obvious that 99% of the time we want /checked to be on. It's an unfortunate compromise.
I think performance is a pretty good reason. If you consider every instruction in a typical program that increments an integer, and if instead of the simple op to add 1, it had to check every time if adding 1 would overflow the type, then the cost in extra cycles would be pretty severe.
You work under the assumption that integer overflow is always undesired behavior.
Sometimes integer overflow is desired behavior. One example I've seen is representation of an absolute heading value as a fixed point number. Given an unsigned int, 0 is 0 or 360 degrees and the max 32 bit unsigned integer (0xffffffff) is the biggest value just below 360 degrees.
int main()
{
uint32_t shipsHeadingInDegrees= 0;
// Rotate by a bunch of degrees
shipsHeadingInDegrees += 0x80000000; // 180 degrees
shipsHeadingInDegrees += 0x80000000; // another 180 degrees, overflows
shipsHeadingInDegrees += 0x80000000; // another 180 degrees
// Ships heading now will be 180 degrees
cout << "Ships Heading Is" << (double(shipsHeadingInDegrees) / double(0xffffffff)) * 360.0 << std::endl;
}
There are probably other situations where overflow is acceptable, similar to this example.
C/C++ never mandate trap behaviour. Even the obvious division by 0 is undefined behaviour in C++, not a specified kind of trap.
The C language doesn't have any concept of trapping, unless you count signals.
C++ has a design principle that it doesn't introduce overhead not present in C unless you ask for it. So Stroustrup would not have wanted to mandate that integers behave in a way which requires any explicit checking.
Some early compilers, and lightweight implementations for restricted hardware, don't support exceptions at all, and exceptions can often be disabled with compiler options. Mandating exceptions for language built-ins would be problematic.
Even if C++ had made integers checked, 99% of programmers in the early days would have turned if off for the performance boost...
Because checking for overflow takes time. Each primitive mathematical operation, which normally translates into a single assembly instruction would have to include a check for overflow, resulting in multiple assembly instructions, potentially resulting in a program that is several times slower.
It is likely 99% performance. On x86 would have to check the overflow flag on every operation which would be a huge performance hit.
The other 1% would cover those cases where people are doing fancy bit manipulations or being 'imprecise' in mixing signed and unsigned operations and want the overflow semantics.
Backwards compatibility is a big one. With C, it was assumed that you were paying enough attention to the size of your datatypes that if an over/underflow occurred, that that was what you wanted. Then with C++, C# and Java, very little changed with how the "built-in" data types worked.
If integer overflow is defined as immediately raising a signal, throwing an exception, or otherwise deflecting program execution, then any computations which might overflow will need to be performed in the specified sequence. Even on platforms where integer overflow checking wouldn't cost anything directly, the requirement that integer overflow be trapped at exactly the right point in a program's execution sequence would severely impede many useful optimizations.
If a language were to specify that integer overflows would instead set a latching error flag, were to limit how actions on that flag within a function could affect its value within calling code, and were to provide that the flag need not be set in circumstances where an overflow could not result in erroneous output or behavior, then compilers could generate more efficient code than any kind of manual overflow-checking programmers could use. As a simple example, if one had a function in C that would multiply two numbers and return a result, setting an error flag in case of overflow, a compiler would be required to perform the multiplication whether or not the caller would ever use the result. In a language with looser rules like I described, however, a compiler that determined that nothing ever uses the result of the multiply could infer that overflow could not affect a program's output, and skip the multiply altogether.
From a practical standpoint, most programs don't care about precisely when overflows occur, so much as they need to guarantee that they don't produce erroneous results as a consequence of overflow. Unfortunately, programming languages' integer-overflow-detection semantics have not caught up with what would be necessary to let compilers produce efficient code.
My understanding of why errors would not be raised by default at runtime boils down to the legacy of desiring to create programming languages with ACID-like behavior. Specifically, the tenet that anything that you code it to do (or don't code), it will do (or not do). If you didn't code some error handler, then the machine will "assume" by virtue of no error handler, that you really want to do the ridiculous, crash-prone thing you're telling it to do.
(ACID reference: http://en.wikipedia.org/wiki/ACID)