I have a large equation system to solve. The coefficients are stored in a sparse matrix CM of the dimension 320001 x 320001 elements, of which 18536032 are non-zero. The result vector B is 320001 elements long.
When executing
I=CM\B
Octave Error: SparseMatrix::solve numeric factorization failed
I get the above error message. A brief look into the source code did not give me a clue.
Does anyone know what is causing that error?
BTW: when solving the same problem with a smaller matrix (e.g. 180001x180001) the program runs fine.
Johannes
Octave uses UMFPACK library to solve sparse linear systems. Inspecting the source shows that the error message is due to an error status with a negative value. List of error codes can be found in the user's guide. One of them is related to lack of enough memory:
UMFPACK ERROR out of memory, (-1): Not enough memory. The ANSI C malloc or realloc routine failed.
Related
I am trying to generate a run time error such as divide by zero in ARM Cortex M3. I don't know why when I generate divide by zero error system works correctly. However value seems "Infinity"
Does ARM gcc compilers handle these kind of UsageFault errors? I did not implement hardware exception handler yet like Usage Fault, Bus Fault or Mem Manage.
Depending on the architecture the behaviour is different. ARMv6-M doesn't include a divide instruction so it's the software the one to manage this situation (or the compiler, from the C/C++ point of view, it is UB).
On Cortex M3 (ARMv7-M) things are different, there is an UsageFault exception to manage DIVBY0 situations.
In contrast to x86, no exception is thrown for arm if an integer division by zero takes place. There is simply returned 0 as the result
Edit: This only applies to the Cortex-A series. As Jose noted, there is a control register for integer division in the Cortex-M series, as in the case of Floating-point division described in the following. See the link in his answer.
For floating point operations, the Floating-point Control Register (FPSCR for aarch32 or FPCR for aarch64) is decisive for whether an exception is thrown. If the corresponding bit is set there, an exception is thrown, otherwise only a flag in the Floating-Point Status Register (FPSCR in aarch32 or FPSR in aarch64) is set which then indicates the error. This registers can be set via msr and read via mrs.
If no exception is thrown, there are the following rules:
infinity divided infinity is NaN
zero divided zero is NaN
Anything other divided infinity is ±zero
Anything other divided zero in ±infinity (sign according to the dividend,
this is the case you got in your screenshot)
infinity divided anything other is ±infinity
zero divided anything other is ±zero
See the pseudocode of FDIV in ARM a64 instruction set architecture.
References:
FPCR and FPSR in aarch64
FPSCR in aarch32
ARM a64 instruction set architecture
I am using the arima function in R's forecast package and I get the following error:
Error in solve.default(res$hessian*n.used,A): 'a' must be a complex matrix
The time series that I am fitting has very large values numbers e.g. greater than 10 million. Can anyone please provide a solution that might alleviate this error? Would scaling the time series help?
I am trying to compile a basic memory transfer code using PGI's fortran compiler(Workstation/PGI Visual Fortran). The compiler throws an error on the line where I have a cudamemcpy call. The exact error message is "Could not resolve generic procedure cudamemcpy" for the line
istat=cudaMemcpy(arr(1),arr(2),800,cudaMemcpyDevicetoDevice)
I am also using the cuda fortran module--"use cudafor". What's the solution to this compiler error? Thanks!
The arrays arr(1) and arr(2) are of type
type subgrid
integer, device, dimension(:,:,:), allocatable :: field
end type subgrid
The problem was resolved by not using the 4th argument and by specifying the actual field data that needed to be transferred. 800 is the number of integers I needed to be transferred from one slice to the other.
istat=cudaMemcpy(arr(1)%field(:,:,:), arr(2)%field(:,:,:), 800)
Also, the cudaMemcpyDevicetoDevice doesn't affect the function call. It works fine with/without it.
I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.
I have been trying to understand why I get "SEGMENTATION FAULT" while running a program written in C languagge.
I tried gdb.
This is the message I got:
Program received signal SIGSEGV, Segmentation fault.
0x0016e5e0 in mysql_slave_send_query () from /usr/lib/libmysqlclient.so.16
(gdb) step
Single stepping until exit from function mysql_slave_send_query,
which has no line number information.
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
The fact is that I get no compiler error/warning messages but the program doesn't work
properly.
Can anyone help me?
My query is:
char query[512];
sprintf(query, "SELECT t1.Art_Acquisto as 'Cod',t2.Des_Articolo as 'Descrizione',t4.Cod_Categoria as 'Cat.',t1.Data_Acquisto as" "'data',t1.Netto_Acquisto as'Importo',t3.Des_Fornitore as 'Fornitore' from "
"Aquisti as t1, Articoli as t2, Fornitori as t3, Categorie as t4 where t1.Art_Acquisto = t2.Cod_Articolo and "
"t1.fornitoreM = t3.codiceF and t4.codiceC = t2.categoriaA and Art_Acquisto ='%s'order by Data_Acquisto;",Cod_Articolo);
Check your arguments to mysql_slave_send_query, especially the length and the proper initialisation/allocation of the other arguments.
"Segmentation fault" means that your program is accessing an invalid memory location during the execution of the function named mysql_slave_send_query.
Therefore this means there is a bug in your C program.
How big is the contents of Cod_Articolo? Your query is nearly 400 characters, and the buffer you're storing it in is only 512 characters - if Cod_Articolo is over 110 characters, your code will break.
If this is the problem, you can prevent it by checking the size of Cod_Articolo before writing it into the query string. Even better, you could calculate the size the query string needs to be, and allocate exactly the right amount.
It's also good practice to use snprintf instead of sprintf, as then you can ensure you don't copy too many characters into the destination string.