find execution time of a program, given cycle count and GHz - mips

I have to find the execution time (in microseconds) of a small block of MIPS code, given that:
it will take a total of 30 cycles
total of 10 MIPS instructions
2.0 GHz CPU
That's all the information I am given to solve this question with (I already added up the total number of cycles, given the assumptions I am supposed to make about how many cycles different kinds of instructions are supposed to take). I have been playing around with the formulas from the book trying to find the execution time, but I can't get an answer that seems right. Whats the process for solving a problem like this? Thanks.

My best guess at interpreting your problem is that on average each instruction takes 3 cycles to complete. Because you were given the total number of cycles I'm not sure that the instruction count even matters.
You have a 2Ghz machine so that is 2 * 10^9 cycles per second. This equates to each cycle taking 5 * 10^(-10) seconds. (twice as fast as a 1Ghz machine which is 1*10^(-9)).
We have 30 cycles to complete to run the program so...
30 * (5 * 10^(-10)) = 1.5 * 10^(-8) or 15 nano seconds to execute all 10 instructions in 30 cycles.

Related

Clock cycle in pipelining and single-clock cycle implementation

There are two mechanisms to execute instructions.
Single clock cycle implementation
pipelining.
In MIPS architecture(from the book Computer organization and design), instruction has 5 stages.
So, in single clock cycle implementation, which means during one clock cycle, 5 stages are executed for one instruction.
For example, load instruction(it has 5 stages) is executed in one clock cycle. So other instructions can be executed after this one clock cycle.
Let's assume that one clock cycle is 10 secs.
And now, in pipelining, multiple instructions can be overlapped. I'm confused from this concept comparing to one clock cycle's time in above example.
In here to Execute 5 instructions, it needs 9 clock cycles. It means to execute 5 instructions, it needs 90 secs. But in single clock cycle implementation, it just needs 50 secs to execute 5 instructions. Pipelining needs more clock cycles.(Not good) Am I thinking wrong?? or Am I missing something??
And here, So, to execute first instruction lw $10, 20($1), it needs 50 secs??
I think the major misconception you are having is that you consider a duration of a clock cycle in both designs to be the same, which is not.
Lets denote a clock cycle in single cycle design as X and a clock cycle in pipeline design as Y.
In a single cycle design 5 instructions will take 5X cycles and in a pipeline design this will take 9Y cycles.
Now we need to find a relationship between X and Y.
Now think of a case where you have only single instruction to execute.
In a single cycle design this will take X cycles and in a pipeline design this will take 5Y. If both are clocked at the same rate, X should be equal to 5Y.
Now lets do a bit of substitution maths :-)
Single cycle - 5X
Pipeline - 9Y
Substituting X = 5Y
Single cycle - 25Y
Pipeline - 9Y
There you go. Single cycle design is 2.7X slower than a multi cycle design.

how to get the right gperf samples report

I have some gperf tool files:
the first one was running about 2 minites,file is 18M;
others running about 2 hours and the files are about 800M
when I try to use :pprof --text to get the report, found the the first one has 1300 samples but these 2 hours running just 5500 samples.
I excepted the larger files have about 2*3600*100 samples(because "by default the gperf tools take 100 samples a second").
The same procedures and the same operating environment, why the samples too few?
sorry for my poor english.
I looks like it's I/O bound. In the 120-second job, you're getting 13 seconds of samples. In the 120-minute job, you're getting about 1 minute of samples. The actual fraction of time spent computing vs. I/O can vary pretty widely, especially if there is some constant startup overhead.
If the time ought to be roughly linear in file size, that 120-minute job should really only be about 40 minutes, so I would do some manual sampling on the big job, to see what's happening.

code ping time meter - is this really true?

I am using a sort of code_ping for the time it took to process the whole page, to all my pages in my webportal.
I figured if I do a $count_start in the header initialised with current timestamp and a $count_end in the footer, the same, the difference is a meter to roughly let me know how well optimised the page is (queries, loading time of all things in that particular page).
Say for one page i get 0.0075 seconds, for others I get 0.045 etc...i'm working on optimising the queries better this way.
My question is. If one page says by this meter "rough loading time" that has 0.007 seconds,
will 1000 users querying the same page at the same time get each the result in 0.007 * 1000 = 7 seconds ? meaning they will each get the page after 7 seconds ?
thanks
Luckily, it doesn't usually mean that.
The missing variable in your equation is how your database and your application server and anything else in your stack handles concurrency.
To illustrate this strictly from the MySQL perspective, I wrote a test client program that establishes a fixed number of connections to the MySQL server, each in its own thread (and so, able to issue a query to the server at approximately the same time).
Once all of the threads have signaled back that they are connected, a message is sent to all of them at the same time, to send their query.
When each thread gets the "go" signal, it looks at the current system time, then sends the query to the server. When it gets the response, it looks at the system time again, and then sends all of the information back to the main thread, which compares the timings and generates the output below.
The program is written in such a way that it does not count the time required to establish the connections to the server, since in a well-behaved application the connections would be reusable.
The query was SELECT SQL_NO_CACHE COUNT(1) FROM ... (an InnoDB table with about 500 rows in it).
threads 1 min 0.001089 max 0.001089 avg 0.001089 total runtime 0.001089
threads 2 min 0.001200 max 0.002951 avg 0.002076 total runtime 0.003106
threads 4 min 0.000987 max 0.001432 avg 0.001176 total runtime 0.001677
threads 8 min 0.001110 max 0.002789 avg 0.001894 total runtime 0.003796
threads 16 min 0.001222 max 0.005142 avg 0.002707 total runtime 0.005591
threads 32 min 0.001187 max 0.010924 avg 0.003786 total runtime 0.014812
threads 64 min 0.001209 max 0.014941 avg 0.005586 total runtime 0.019841
Times are in seconds. The min/max/avg are the best/worst/average times observed running the same query. At a concurrency of 64, you notice the best case wasn't all that different than the best case with only 1 query. But biggest take-away here is the total runtime column. That value is the difference in time from when the first thread sent its query (they all send their query at essentially the same time, but "precisely" the same time is impossible since I don't have a 64-core machine to run the test script on) to when the last thread received its response.
Observations: the good news is that the 64 queries taking an average of 0.005586 seconds definitely did not require 64 * 0.005586 seconds = 0.357504 seconds to execute... it didn't even require 64 * 0.001089 (the best case time) = 0.069696 All of those queries were started and finished within 0.019841 seconds... or only about 28.5% of the time it would have theoretically taken for them to run one-after-another.
The bad news, of course, is that the average execution time on this query at a concurrency of 64 is over 5 times as high as the time when it's only run once... and the worst case is almost 14 times as high. But that's still far better than a linear extrapolation from the single-query execution time would suggest.
Things don't scale indefinitely, though. As you can see, the performance does deteriorate with concurrency and at some point it would go downhill -- probably fairly rapidly -- as we reached whichever bottleneck occurred first. The number of tables, the nature of the queries, any locking that is encountered, all contribute to how the server performs under concurrent loads, as do the performance of your storage, the size, performance, and architecture, of the system's memory, and the internals of MySQL -- some of which can be tuned and some of which can't.
But of course, the database isn't the only factor. The way the application server handles concurrent requests can be another big part of your performance under load, sometimes to a larger extent than the database, and sometimes less.
One big unknown from your benchmarks is how much of that time is spent by the database answering the queries, how much of the time is spent by the application server executing the logic business, and how much of the time is spent by the code that is rendering the page results into HTML.

Increasing block size decreases performance

In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?
P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel.
It looks pretty bad to me right now but I cant figure out what am I doing wrong.
__global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4)
{
f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;
}
}
Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.
The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).
Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).
Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.
You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):
Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
...
You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.
The nvidia occupancy calculator(excel sheet) will be of great help.

Computing estimated times of file copies / movements?

Inspired by this xckd cartoon I wondered exactly what is the best mechanism to provide an estimate to the user of a file copy / movement?
The alt tag on xkcd reads as follows:
They could say "the connection is probably lost," but it's more fun to do naive time-averaging to give you hope that if you wait around for 1,163 hours, it will finally finish.
Ignoring the funny, is that really how it's done in Windows? How about other OS? Is there a better way?
Have a look at my answer to a similar question (and the other answers there) on how the remaining time is estimated in Windows Explorer.
In my opinion, there is only one way to get good estimates:
Calculate the exact number of bytes to be copied before you begin the copy process
Recalculate you estimate regularly (every 1, 5 or 10 seconds, YMMV) based on the current transfer speed
The current transfer speed can fluctuate heavily when you are copying on a network, so use an average, for example based on the amount of bytes transfered since your last estimate.
Note that the first point may require quite some work, if you are copying many files. That is probably why the guys from Microsoft decided to go without it. You need to decide yourself if the additional overhead created by that calculation is worth giving your user a better estimate.
I've done something similar to estimate when a queue will be empty, given that items are being dequeued faster than they are being enqueued. I used linear regression over the most recent N readings of (time,queue size).
This gives better results than a naive
(bytes_copied_so_far / elapsed_time) * bytes_left_to_copy
Start a global timer that fires say, every 1000 milliseconds and update a total elpased time counter. Let's call this variable "elapsedTime"
While the file is being copied, update some local variable with the amount already copied. Let's call this variable "totalCopied"
In the timer event that is periodically raised, divide totalCopied by totalElapsed to give the number of bytes copied per timer interval (in this case, 1000ms). Let's call this variable "bytesPerSec"
Divide the total file size by bytesPerSec and obtain the total number of seconds theoretically required to copy this file. Let's call this variable remainingTime
Subtract elapsedTime from remainingTime and you a somewhat accurate calculation for file copy time.
I think dialogs should just admit their limitations. It's not annoying because it's failing to give a useful time estimate, it's annoying because it's authoritatively offering an estimate that's obvious nonsense.
So, estimate however you like, based on current rate or average rate so far, rolling averages discarding outliers, or whatever. Depends on the operation and the typical durations of events which delay it, so you might have different algorithms when you know the file copy involves a network drive. But until your estimate has been fairly consistent for a period of time equal to the lesser of 30 seconds or 10% of the estimated time, display "oh dear, there seems to be some kind of holdup" when it's massively slowed, or just ignore it if it's massively sped up.
For example, dialog messages taken at 1-second intervals when a connection briefly stalls:
remaining: 60 seconds // estimate is 60 seconds
remaining: 59 seconds // estimate is 59 seconds
remaining: delayed [was 59 seconds] // estimate is 12 hours
remaining: delayed [was 59 seconds] // estimate is infinity
remaining: delayed [was 59 seconds] // got data: estimate is 59 seconds
// six seconds later
remaining: 53 seconds // estimate is 53 seconds
Most of all I would never display seconds (only hours and minutes). I think it's really frustrating when you sit there and wait for a minute while the timer jumps between 10 and 20 seconds. And always display real information like: xxx/yyyy MB copied.
I would also include something like this:
if timeLeft > 5h --> Inform user that this might not work properly
if timeLeft > 10h --> Inform user that there might be better ways to move the file
if timeLeft > 24h --> Abort and check for problems
I would also inform the user if the estimated time varies too much
And if it's not too complicated, there should be an auto-check function that checks if the process is still alive and working properly every 1-10 minutes (depending on the application).
speaking about network file copy, the best thing is to calculate file size to be transfered, network response and etc. An approach that i used once was:
Connection speed = Ping and calculate the round trip time for packages with 15 Kbytes.
Get my file size and see, theorically, how many time it would take if i would break it in
15 kb packages using my connection speed.
Recalculate my connection speed after transfer is started and ajust the time that will be spended.
I've been pondering on this one myself. I have a copy routine - via a Windows Explorer style interface - which allows the transfer of selected files from an Android Device, to a PC.
At the start, I know the total size of the file(s) that are to be copied, and as I am using C#.NET, I am using a Stopwatch, to get the elapsed time, and while the copy is in progress, I am keeping a total of what is copied so far, in terms of bytes.
I haven't actually tested it yet, but the best way seems to be this -
estimated = elapsed * ((totalSize - copiedSoFar) / copiedSoFar)
I never saw it the way you guys are explaining it-by trasfeed bytes & total bytes.
The "experience" always made a lot more sense (not good/accurate) if you instead use bytes of each file, and file count. This is how the estimate swings wildly.
If you are transferring large files first, the estimate goes long-even with the connection static. It is like it naively thinks that all files are the average size of those thus transferred, and then makes a guess assuming that the average file size will remain accurate for the entire time.
This, and the other ways, all get worse when the connection 'speed' varies...