Run tests in single Spec in parallel ini kotest - kotest

I wanted to run kotest in the same spec in parallel. I read the below section in the documentation. But it says you can run only specs in parallel, tests in the single spec will always run sequentially.
https://kotest.io/docs/framework/project-config.html#parallelism
Is there a way to achieve parallelism at the test level? I'm using kotest for my e2e API testing. All tests are independent and should have no problem running them in parallel. But with kotest, I can't. Please advise.

You can enable concurrent tests on a per-spec basis or a global basis.
For example:
class MySpec : FunSpec({
concurrency = 10
test("1") { }
test("2") { }
}

Related

What is the use of task graphs in CUDA 10?

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.

How do I know the presence of nvprof inside CUDA program?

I have a small CUDA program that I want to profile with nvprof. The problem is that I want to write the program in such a way that
When I run nvprof my_prog, it will invoke cudaProfilerStart and cudaProfilerStop.
When I run my_prog, it will not invoke any of the above APIs, and therefore can get rid of profiling overhead.
The problem hence becomes how to make my code aware of the presence of nvprof when it runs, without additional command line argument.
Have you measured and verified that cudaProfilerStart/Stop calls introduce measurable overheads when nvprof is not attached? I highly doubt that this is the case.
If this is a problem, you can use #ifdef directives to exclude these calls from your release builds.
There is no way of detecting whether nvprof is running, since that kind of defeats the purpose of profiling - if the profiled application "senses" the profiler and changes its behavior.

How to parallelize JUnit tests?

I currently have a couple of tests which really run very long. Inside each test I do always the same:
there is a loop which creates a new object (every iteration with different parameters), does some time consuming calculations with the object and at the end of each iteration compares the result to the expected result.
Every iteration in this loop is completely isolated. I could easily run all those 200 very time consuming iterations in parallel. But how best to do this?
Cheers,
AvH
Junit 4 has inbuilt parellel processing. Check this documentation.
Apart from that you may need consider moving all the duplicate iterations in to a static setup method and annotate as #BeforeClass. That will make sure code runs only once in the entire lifecycle.
#BeforeClass
public static void setup() {
//Move anything needs to run only once.
}
You have to create an own modification of the Parameterized runner. See http://jankesterblog.blogspot.de/2011/10/junit4-running-parallel-junit-classes.html
The library JUnit Toolbox provides a ParallelParameterized runner.

How do I use multi threading in TCL?

I'm trying to run two procedures in parallel. As TCL is the interpreter, it will process procedures one by one. Can someone explain with an example how I can use multi-threading in TCL?
These days, the usual way to do multi-threading in Tcl is to use its Thread extension — it's being developed along with the Tcl's core, but on certain platforms (such as various Linux-based OSes) you might need to install a separate package to get this extension available.
The threading model the Thread extension implements is "one thread per interpreter". This means, each thread can "host" just one Tcl interpreter (and an unlimited number of its child interpreters), but no code executed by any thread may access interpreters hosted in other threads. This, in turn, means that when you work with threads in Tcl, you have to master the idea of multiple interpreters.
The classical approach to exchanging data between interpreters running in different threads is message passing: you post scripts to the input queue of the target interpreter running in different thread and then wait for reply. On the other hand, thread-shared variables (implementing sharing memory by locking) is also available. Another available feature is support for thread pools.
Read the "Tcl and threads" wiki page, the Thread's extension manual pages.
The code examples are on the wiki. Here's just one of them.
Please note that if your procedures which, you think, have to be run in parrallel, are mostly I/O bound (that is, they read something from the network and/or send something there) and not CPU-bound (doing heavy computations), you might have better results with the event-based approach to processing: the Tcl has built-in support for the event loop, and you are able to make Tcl execute your code when the next chunk of data can be read from a channel (such as a network socket) or written to a channel.

How to prepare state for several JUnit tests only once

I need to test a program that first preprocesses some data and then computes several different results using this preprocessed data -- it makes sense to write separate test for each computation.
Official JUnit policy seems to be that I should run preprocessing before each computation test.
How can I set up my test so that I could run the preparation only once (it's quite slow) before running remaining tests?
Use the annotation #BeforeClass to annotate the method which will be run once before all test methods.