How would I write a MIPS behavioral simulator for the machine code created using the assembler code provided? - mips

This MIPS simulator will read in a text file consisting of LC3100 machine code instructions (represented as decimal values), and execute the program, then display the values of register files and memory after each instruction is completed.
I do not understand how this can be done and simply need a format for what steps I need to take in order to create the simulator in MIPS. Do I write code in C++ or write the code in MIPS? How do I read files if it is in MIPS? Honestly, just confused.
I do not know where I need to start from. This is what I am asking to help figure out.

I'd imagine you'd want to create some global variables that represent your registers and memory:
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
And then define some functions that mimic the way MIPS behaves. You'll need to read up on how the CPU operates (which is why this example function may seem arbitrary but really it isn't.)
void MIPS_multu(int regA, int regB)
{
// void because we're writing to global variables.
uint64_t temp = regA * regB;
reg_hi = temp >> 32;
reg_lo = (temp & 0x00000000FFFFFFFF);
}
Finally, you'll need to understand how MIPS instructions are encoded and create a routine that can unpack them and select the correct function.
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
int main()
{
reg_pc = &memory[0];
while (reg_pc < &memory[0x80000000/4])
// chances are this is either invalid C or just bad practice,
// but I can't think of a better way to express the idea
{
int temp = *reg_pc;
// use bitwise operators etc to figure out what the instruction represents,
// and switch cases to pick the functions.
reg_pc++;
}
}

Related

OpenMP parallelize for loop inside a function

I am trying to parallelize this for loop inside a function using OpenMP, but when I compile the code I still have an error =(
Error 1 error C3010: 'return' : jump out of OpenMP structured block not allowed.
I am using Visual studio 2010 C++ compiler. Can anyone help me? I appreciate any advice.
int match(char* pattern, int patternSize, char* string, int startFrom, unsigned int &comparisons) {
comparisons = 0;
#pragma omp for
for (int i = 0; i < patternSize; i++){
comparisons++;
if (pattern[i] != string[i + startFrom])
return 0;
}
return 1;
}
As #Hristo has already mentioned, you are not allowed to branch out of a parallel region in OpenMP. Among other reasons, this is not allowed because the compiler cannot know a priori how many iterations each thread should work on when it splits up a for loop like the one that you have written among the different threads.
Furthermore, even if you could branch out of your loop, you should be able to see that comparisons would be computed incorrectly. As is, you have an inherently serial algorithm that breaks at the first different character. How could you split up this work such that throwing more threads at this algorithm possibly makes it faster?
Finally, note that there is very little work being done in this loop anyway. You would be very unlikely to see any benefit from OpenMP even if you could rewrite this algorithm into a parallel algorithm. My suggestion: drop OpenMP from this loop and look to implement it somewhere else (either at a higher level - maybe you call this method on different strings? - or in a section of your code that does more work).

cuda memory alignment

In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general).
When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like
struct pt{
int i;
int j;
int k;
}
even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...
Asked for position 0 of stack, stack only has 0 elements on it.
So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to
struct __align__(16) pt{
int i;
int j;
int k;
}
but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:
error: expected unqualified-id before numeric constant error: expected
‘)’ before numeric constant error: expected constructor, destructor,
or type conversion before ‘;’ token
so, am I supposed to have two different definitions for host and device structures ???
Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.
For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
struct {
int a;
int b;
int c;
int d;
float* el;
} ;
struct {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
} ;
Thank you in advance for any advice or hints
There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.
First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.
#if defined(__CUDACC__) // NVCC
#define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
#define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
#define MY_ALIGN(n) __declspec(align(n))
#else
#error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif
So, am I supposed to have two different definitions for host and device structures?
No, just use MY_ALIGN(n), like this
struct MY_ALIGN(16) pt { int i, j, k; }
For example, how should the following two be aligned?
First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...
struct MY_ALIGN(16) {
int a;
int b;
int c;
int d;
float* el;
};
In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.
struct MY_ALIGN(16) {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
};
In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.
or, how should a structure with 6 floats be aligned?
Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.
or 4 integers?
No tradeoff here. MY_ALIGN(16).
again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.
These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.

CUDA memory allocation - is it efficient

This is my code. I have lot of threads so that those threads calling this function many times.
Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.
__device__ float calculate minimum(float *arr)
{
float vals[9]; //for each call to this function I am creating this arr
// Is it efficient?? Or how can I implement this efficiently?
// Do I need to deallocate the memory after using this array?
for(int i=0;i<9;i++)
vals[i] = //call some function and assign the values
float min = findMin(vals);
return min;
}
There is no "array creation" in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand __device__functions, meaning that the vals will be compiled to be in local memory, or if possible even in registers.
All of this happens at compile time, not run time.
Perhaps I am missing something, but from the code you have posted, you don't need the temporary array at all. Your code will be (a little) faster if you do something like this:
#include "float.h" // for FLT_MAX
__device__ float calculate minimum(float *arr)
{
float minVal = FLT_MAX:
for(int i=0;i<9;i++)
thisVal = //call some function and assign the values
minVal = min(thisVal,minVal);
return minVal;
}
Where an array is actually required, there is nothing wrong with declaring it in this way (as many others have said).
Regarding the "float vals[9]", this will be efficient in CUDA. For arrays that have small size, the compiler will almost surely allocate all the elements into registers directly. So "vals[0]" will be a register, "vals[1]" will be a register, etc.
If the compiler starts to run out of registers, or the array size is larger than around 16, then local memory is used. You don't have to worry about allocating/deallocating local memory, the compiler/driver do all that for you.
Devices of compute capability 2.0 and greater do have a call stack to allow things like recursion. For example you can set the stack size to 6KB per thread with:
cudaStatus = cudaThreadSetLimit(cudaLimitStackSize, 1024*6);
Normally you won't need to touch the stack yourself. Even if you put big static arrays in your device functions, the compiler and driver will see what's there and make space for you.

CUDA: Using realloc inside kernel

I know that it is possible to use malloc inside the kernel to allocate memory on GPU's global memory. Is it also possible to use realloc?
You could write you own realloc device function for your data type.
Just allocate the new space for a new array, copy the old values to the new, free the old array space, return the new with more space.
Approximately like the following code fragment:
__device__ MY_TYPE* myrealloc(int oldsize, int newsize, MY_TYPE* old)
{
MY_TYPE* newT = (MY_TYPE*) malloc (newsize*sizeof(MY_TYPE));
int i;
for(i=0; i<oldsize; i++)
{
newT[i] = old[i];
}
free(old);
return newT;
}
But be sure to call it, if you really need it. Also add proper error checking.
In the Cuda Programming Guide, when they introduce malloc and free functions, there is no mention of realloc. I would assume that it does not exist.
If you want to know it for sure, why don't you write a simple kernel and try using it?

int issue in g++/mysql/redhat

I haven't written C in quite some time and am writing an app using the MySQL C API, compiling in g++ on redhat.
So i start outputting some fields with printfs... using the oracle api, with PRO*C, which i used to use (on suse, years ago), i could select an int and output it as:
int some_int;
printf("%i",some_int);
I tried to do that with mysql ints and i got 8 random numbers displayed... i thought this was a mysql api issue and some config issue with my server, and i wasted a few hours trying to fix it, but couldn't, and found that i could do:
int some_int;
printf("%s",some_int);
and it would print out the integer properly. Because i'm not doing computations on the values i am extracting, i thought this an okay solution.
UNTIL I TRIED TO COUNT SOME THINGS....
I did a simple:
int rowcount;
for([stmt]){
rowcount++;
}
printf("%i",rowcount);
i am getting an 8 digit random number again... i couldn't figure out what the deal is with ints on this machine.
then i realized that if i initialize the int to zero, then i get a proper number.
can someone please explain to me under what conditions you need to initialize int variables to zero? i don't recall doing this every time in my old codebase, and i didn't see it in the example that i was modeling my mysql_stmt code from...
is there something i'm missing? also, it's entirely possible i've forgotten this is required each time
thanks...
If you don't initialize your variables, there's no guarantee of a default 0/NULL/whatever value. Some compilers MIGHT initialize it to 0 for you (IIRC, MSVC++ 6.0 would be kind enough to do so), and others might not. So don't rely on it. Never use a variable without first giving it some sort of sane value.
Only global and static values will be initialized to zero. The variables on the stack will always contain garbage value if not initialized.
int g_var; //This is a global varibale. So, initialized to zero
int main()
{
int s_var = 0; //This is on stack. So, you need to explicitly initialize
static int stat_var; //This is a static variable, So, initialized to zero
}
You always neet to initialize your variables. To catch this sort of error, you should probably compile with -Wall to give you all warnings that g++ can provide. I also prefer to use -Werror to make all warnings errors, since it's almost always the case that a warning indicates an error or a potential error and that cleaning up the code is better than leaving it as is.
Also, in your second printf, you used %s which is for printing strings, not integers.
int i = 0;
printf("%d\n", i);
// or
printf("%i\n", i);
Is what you want.
Variable are not automatically initialized in c.
You have indeed forgotten. In C and C++, you don't get any automatic initialization; the contents of c after int c; are whatever happens to be at the address referred to by c at the time.
Best practice: initialize at the definition: int c = 0;.
Oh, PS, and take some care that the MySQL int type matches the C int type; I think it does but I'm not positive. It will be, however, both architecture and compiler sensitive, since sizeof(int) isn't the same in all C environments.
Uninitialized variable.
int some_int = 0;