I try to create applications that can prove how coding and chosen techniqies can improve (respectively worsen) the energy consumption. How would you do and what libraries to use in Windows Store applications, to measure the current usage of power / key number for energy consumption?
I know about applications that measure the amount that apps are consuming, but I would like to have it inside the application. The possibilities to have it presented and perhaps more verbose is really helpfull here.
(I assume that I must adress a specific technique to avoid being to broad)
As you didn't specified your language, I'll assume that you code in c# because it's the most used language on WinRT. However the following class also exists in VB.NET and in C++ under the same namespace.
This is pretty much everything you can do with this class:
using Windows.Phone.Devices.Power;
protected void Foo() {
var b = Battery.GetDefault();
var chargeLeft = b.RemainingChargePercent;
var timeLeft = b.RemainingDischargeTime;
b.RemainingChargePercentChanged += B_RemainingChargePercentChanged;
}
private void B_RemainingChargePercentChanged(object sender, object e) {
// To do...
}
Determining the power consumption speed is pretty much straightforward from then, since it's the derivative of the charge by the time.
Related
I am trying to integrate a vector valued (49 components) function f[] which as an example may look like:
f[0] = 1
f[1] = cos(x)
f[2] = cos(2x)
f[3] = cos(3x)
... and so on.
I was wondering if there was a way to integrate such a vector function in GSL using a single command. I can currently do this only by having n=49 different cquad integration handles/procedures which seems inefficient, as I wish to use the same integration "mesh " for all the function components.
Thank you for your attention.
As far as I know, at the moment such things via GSL cannot be done.
But your task can be solved with the help of openmp (with #pragma omp parallel for), but there may be a very large overhead when using gsl dll by several threads at the same time.
It is possible (this is not accurate) that for that you will have to rebuild the GSL library itself with the openmp-compiler-flags. But GSL is quite thread-safe on its own.
Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)
I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.
I am designing a library that has a large contingent of CUDA kernels to perform parallel computations. All the kernels will be acting on a common object, say a computational grid, which is defined using C++ style objects. The computational domain doesn't necessarily need to be accessed from the host side, so creating it on the device side and keeping it there makes sense for now. I'm wondering if the following is considered "good practice":
Suppose my computational grid class is called Domain. First I Define a global device-side variable to store the computational domain:
__device__ Domain* D
Then I Initialize the computational domain using a CUDA kernel
__global__ void initDomain(paramType P){
D = new Domain(P);
}
Then, I perform computations using this domain with other kernels:
__global__ void doComputation(double *x,double *y){
D->doThing(x,y);
//...
}
If my domain remains fixed (i.e. kernels don't modify the domain once it's created), is this OK? Is there a better way? I initially tried creating the Domain object on the host side and copying it over to the device, but this turned out to be a hassle because Domain is a relatively complex type that makes it a pain to copy over using e.g. cudaMemCpy or even Thrust::device_new (at least, I couldn't get it to work nicely).
Yes it's ok.
Maybe you can improve performance using
__constant__
using this keyword, your object will be available in all your kernels in a very fast memory.
In order to copy your object, you must use : cudaMemcpyToSymbol, please note there is come restriction : your object will be read-only in your device code, and it must don't have default constructor.
You can find informations here
If your object is complex and hard to copy, maybe you can look for : Unified memory, then just pass your variable by value to your kernel.
So the problem I am having has not actually happened yet. I am planning out some code for a game I am currently working on and I know that I am going to be needing to conserve memory usage as much as possible from step one. My question is, if I have for example, 500k objects that will need to constantly be constructed and deconstructed. Would is save me any memory to have the functions those classes are going to use as function pointers.e.g. without function pointers class MyClass{public:void Update();void Draw();...}e.g. with function pointersclass MyClass{public:void * Update;void * Draw;...}Would this save me any memory or would any new creation of MyClass just access the same functions that were defined for the rest of them? If it does save me any memory would it be enough to be worthwhile?
Assuming those are not virtual functions, you'd use more memory with function pointers.
The first example
There is no allocation (beyond the base amount required to make new return unique pointers, or your additional implementation that you ... ellipsized).
Non-virtual member functions are basically static functions taking a this pointer.
The advantage is that your objects are really simple, and you'll have only one place to look to find the corresponding code.
The disadvantage is that you lose some flexibility, and have to cram all your code into single update/draw functions. This will be hard to manage for anything but a tiny game.
The second example
You allocate 2x pointers per object instance. Pointers are usually 4x to 8x bytes each (depending on your target platform and your compiler).
The advantage you gain a lot of flexibility of being able to change the function you're pointing to at runtime, and can have a multitude of functions that implement it - thus supporting (but not guaranteeing) better code organization.
The disadvantage is that it will be harder to tell which function each instance will point to when you're debugging your application, or when you're simply reading through the code.
Other options
Using function pointers this specific way (instance data members) usually makes more sense in plain C, and less sense in C++.
If you want those functions to be bound at runtime, you may want to make them virtual instead. The cost is compiler/implementation dependent, but I believe it is (usually) going to be one v-table pointer per object instance.
Plus you can take full advantage of virtual function syntax to make your functions based on type, rather than whatever you bound them to. It will be easier to debug than the function pointer option, since you can simply look at the derived type to figure out what function a particular instance points to.
You also won't have to initialize those function pointers - the C++ type system would do the equivalent initialization automatically (by building the v-table, and initializing each instance's v-table pointer).
See: http://www.parashift.com/c++-faq-lite/virtual-functions.html
apologies if this is a dupe; i couldn't find it.
i've read and understood grant skinner's blog on the AS3 garbage collector -
http://www.adobe.ca/devnet/flashplayer/articles/garbage_collection.html ,
but my question isn't covered there.
here's my question.
suppose i've written some AS3 code like:
statementOne;
statementTwo;
is there any possibility that the garbage collector will run during or between my two statements, or does it only run after my "user" code has finished and returned control up to flash ?
we have an A-Star codeblock that's sometimes slow,
and i'd like to eliminate the GC as a potential culprit.
the codeblock is obviously more complex than my example above,
but it doesn't involve any events or other asynchronous stuff.
tia,
Orion
The garbage collector isn't threaded, so generally speaking it won't run while your code is running. There is one exceptional case where this isn't true, however.
The new keyword will invoke the garbage collector if it runs out of memory.
You can run this code to observe this behaviour:
var vec:Vector.<*> = new Vector.<*>(9001);
for (;;) {
vec[0] = new Vector.<*>(9001);
vec = vec[0];
}
Memory usage will quickly jump up to the maximum (1GB for me) then hold until the time cutoff.
There are many good answers here but I think a few subtleties have not been addressed.
The Flash player implements two kinds of garbage collection. One is reference counting, where Flash keeps a count of the incoming references to each object, and knows that when the count reaches zero the object can be freed. The second method is mark-sweep, where Flash occasionally searches for and deletes isolated groups of objects (where the objects refer to one another, but the group as a whole has no incoming references).
As a matter of specification either kind of garbage collection may occur at any time. The player makes no guarantee of when objects will be freed, and mark-sweeps may even be spread out over several frames. Further, there's no guarantee that the timing of GCs will stay the same between Flash player versions, or platforms, or even browsers. So in a way, it's not very useful to know anything about how GCs are triggered, as you have no way of knowing how widely applicable your knowledge is.
Previous point aside, generally speaking mark-sweeps are infrequent, and only triggered in certain circumstances. I know of certain player/OS configurations where a mark/sweep could be triggered every N seconds, or whenever the player exceeded P% of the total available memory, but this was a long time ago and the details will have changed. Reference-counting GCs, in contrast, can happen frequently - between frames, at least. I don't think I've seen them demonstrated to happen more frequently, but that doesn't mean it can't occur (at least in certain situations).
Finally, a word about your specific issue: in my experience it's very unlikely that the GC has anything to do with your performance problem. Reference-counted GC processing may well occur during your loops, but generally this is a good thing - this kind of collection is extremely fast, and keeps your memory usage manageable. The whole point of managing your references carefully is to ensure that this kind of GC happens in a timely manner.
On the other hand, if a mark/sweep occurs during your loops, that would certainly affect your performance. But this is relatively hard to mistake and easy to test. If you're testing on a PC and your application doesn't regularly climb into the hundreds of MBs of memory usage (test with System.totalMemory), you're probably not getting any mark/sweeps at all. If you do have them, it should be easy to verify that the associated pauses are large, infrequent, and always accompanied by a big drop in memory usage. If those things aren't true, you can disregard the idea that GC is part of your problems.
To my knowledge, this is not documented. I have a feeling the GC won't run while your code is being executed (that is, while your code is on the execution stack; each frame, the player creates a new stack for use code). Obviously, this comes from observation and my own experience with Flash, so I wouldn't say this is 100% accurate. Hopefully, it's an educated guess.
Here's a simple test that seems to show that the above is true:
package {
import flash.display.Sprite;
import flash.net.FileReference;
import flash.system.System;
import flash.utils.Dictionary;
import flash.utils.setTimeout;
public class test extends Sprite
{
private var _dict:Dictionary = new Dictionary(true);
public function test()
{
testGC();
setTimeout(function():void {
traceCount();
},2000);
}
private function testGC():void {
var fileRef:FileReference;
for(var i:int = 0; i < 100; i++) {
fileRef = new FileReference();
_dict[fileRef] = true;
traceCount();
System.gc();
}
}
private function traceCount():void {
var count:int = 0;
for(var i:* in _dict) {
count++;
}
trace(count);
}
}
}
The GC seems to be particularly greedy when there are FileReference objects involved (again, this is from my experience; this isn't documented as far as I know).
Now, if you run the above code, even calling explicitly System.gc(), the objects are not collected while your function is on the stack: you can see they're still alive looking at the count of the dictionary (which is set to use weak references for obvious reasons).
When this count is traced again, in a different execution stack (caused by the asynchronous call to setTimeout), all objects have been freed.
So, I'd say it seems the GC is not the culprit of the poor performance in your case. Again, this is a mere observation and the fact that the GC didn't run while executing user code in this test doesn't mean it never will. Likely, it won't, but since this isn't documented, there's no way to know for sure, I'm afraid. I hope this helps.
GC will not run between two statements that are on the same stack / same frame like you have in your example. Memory will be freed prior to the execution of the next frame. This is how most modern VM environments work.
You can't determine when GC will run. If you are trying to stop something from being GC'd, keep a reference to it somewhere.
According to this Tom from Gabob, large enough orphaned hierarchies do not get garbage collected.