how neuralprophet use GPU? - deep-learning

I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
i want to use neuralprophet with GPU,
this is my code
trainer_config = {"accelerator":"gpu"}
m = NeuralProphet(trainer_config=trainer_config)
new_dt = df_ercot
metrics = m.fit(new_dt, freq="W")
But I got the following information and Exception
GPU available: True (cuda), used: True.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
I have installed the pytorch with GPU and searching for a long time on net; but no use. Please help or try to give some ideas how to achieve this.

Related

I want to resolve the “Error in diag(vcov(object)) : long vectors not supported yet: array.c:2192” error

I have done a multiple regression analysis in lmer.
So I wanted to calculate the confidence interval of the partial regression coefficient using confint(), and when I enter the following code, I get an error "
> confint(H2_FULL, method="Wald")
***Error in diag(vcov(object)) :
long vectors not supported yet: array.c:2192***
" and cannot run it.
Does anyone know how to resolve this error? Please help.
I am a beginner in R. I would appreciate it if you could help me to understand it clearly.
I only need to be able to calculate 95% confidence intervals for the partial regression coefficients of multiple regression analysis(multi-model).
I assume four explanatory variables, which is why I think this error was made.
This error did not occur in the single regression analysis.

Pytorch DirectML computational inconsistency

I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?

convergence code 1 (glmer model, lme4 package)

I'm running a glmer model with a three-way interaction, which causes me to receive the following warning:
Warning:
In optwrap(optimizer, devfun, start, rho$lower, control = control, :
convergence code 1 from nlminbwrap
The warning is not there when the 3-way interaction is omitted, so I suspect it has to do with model complexity.
Unfortunately, there is no further information about the nature of the convergence issue in the warning (and also not in the model summary), which makes it hard to resolve. [I've tried different optimizers and increasing the nr of function evaluations already].
Is there any way of finding out what precisely convergence code 1 means? Also, I'm wondering whether it is as serious as when it says Model failed to converge? I've been looking for an answer in the R help pages and in the GLMM FAQs, but can't seem to find any. Any help is much appreciated!
Okay, so I've done some reading here with the hope of being able to help out a fellow linguist. Let's start with the model you specified in the comments:
model=glmer(Correct_or_incorrect~ (condition|CASE) + condition + sound + syll + condition:sound + condition:syll + syll:sound + condition:sound:syll, dataMelt, control=glmerControl(optimizer="nlminbwrap"), family = binomial)
The warning message code didn't say anything useful, but convergence code 1 from bobyqa at the very least used to be about exceeding the maximum number of function evaluations. How high did you try and go with the iterations? All you're going to lose is a few hours, so I would try and set it really high and see if the warning message goes away. All you'd be losing is computer time, and I personally think that's a small price to pay for a model that doesn't throw warnings.
You also mentioned that the warning was not there when the 3-way interaction is omitted, and I would be inclined to think that you are right concerning model complexity. If you don't have any specific hypotheses about that interaction I would leave it out and be done, but if you do, I think there are a few options that you haven't mentioned that you have tried yet.
There is a function called allFit() that will fit the model with all available optimizers. This would be a quick and easy way to see if your estimates are roughly the same among all the different optimizers. You run it on an already fitted model, like this:
allFit(model)
There is a good walkthough of using allFit() and it's different arguments here:https://joshua-nugent.github.io/allFit/ This page also has a lot of other potential solutions to your problem.
If you can, I would take advantage of a machine with multiple cores and run allFit with as many iterations as you can swing, and see if any of the optimizers don't give this warning, which is presumably about not minimizing the loss function before the iterations run out.

CUDA samples matrixMul error

I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.

Generalized Hough Transform in CUDA - How can I speed up the binning process?

Like the title says, I'm working on a little personal research into parallel computer vision techniques. Using CUDA, I am trying to implement a GPGPU version of the Hough transform. The only problem that I've encountered is during the voting process. I'm calling atomicAdd() to prevent multiple, simultaneously write operations and I don't seem to be gaining too much performance efficiency. I've searched the web, but haven't found any way to noticeably enhance the performance of the voting process.
Any help you could provide regarding the voting process would be greatly appreciated.
I'm not familiar with the Hough transform, so posting some pseudocode could help here. But if you are interested in voting, you might consider using the CUDA vote intrinsic instructions to accelerate this.
Note this requires 2.0 or later compute capability (Fermi or later).
If you are looking to count the number of threads in a block for which a specific condition is true, you can just use __syncthreads_count().
bool condition = ...; // compute the condition
int blockCount = __syncthreads_count(condition); // must be in non-divergent code
If you are looking to count the number of threads in a grid for which the condition is true, you can then do the atomicAdd
bool condition = ...; // compute the condition
int blockCount = __syncthreads_count(condition); // must be in non-divergent code
atomicAdd(totalCount, blockCount);
If you need to count the number of threads in a group smaller than a block for which the condition is true, you can use __ballot() and __popc() (population count).
// get the count of threads within each warp for which the condition is true
bool condition = ...; // compute the condition in each thread
int warpCount = __popc(__ballot()); // see the CUDA programming guide for details
Hope this helps.
In a very short past, I did use the voting processes...
at the very end, the atomicAdd become even faster and in both scenarios
this link is very useful:
warp-filtering
an this one was my solved problem Write data only from selected lanes in a Warp using Shuffle + ballot + popc
aren't u looking for a critical section?