How can I discover whether my CPU is 32 or 64 bits? - language-agnostic

How do I find out if my processor is 32 bit or 64 bit (in your language of choice)? I want to know this for both Intel and AMD processors.

Windows, C/C++:
#include <windows.h>
SYSTEM_INFO sysInfo, *lpInfo;
lpInfo = &sysInfo;
::GetSystemInfo(lpInfo);
switch (lpInfo->wProcessorArchitecture) {
case PROCESSOR_ARCHITECTURE_AMD64:
case PROCESSOR_ARCHITECTURE_IA64:
// 64 bit
break;
case PROCESSOR_ARCHITECTURE_INTEL:
// 32 bit
break;
case PROCESSOR_ARCHITECTURE_UNKNOWN:
default:
// something else
break;
}

C#, OS agnostic
sizeof(IntPtr) == 4 ? "32-bit" : "64-bit"
This is somewhat crude but basically tells you whether the CLR is running as 32-bit or 64-bit, which is more likely what you would need to know. The CLR can run as 32-bit on a 64-bit processor, for example.
For more information, see here: How to detect Windows 64-bit platform with .NET?

The tricky bit here is you might have a 64 bit CPU but a 32 bit OS. If you care about that case it is going to require an asm stub to interrogate the CPU. If not, you can ask the OS easily.

In .NET you can differentiate x86 from x64 by looking at the Size property of the IntPtr structure. The IntPtr.Size property is returned in bytes, 8 bits per byte so it is equal to 4 on a 32-bit CPU and 8 on a 64-bit CPU. Since we talk about 32-bit and 64-bit processors rather than 4-byte or 8-byte processors, I like to do the comparison in bits which makes it more clear what is going on.
C#
if( IntPtr.Size * 8 == 64 )
{
//x64 code
}
PowerShell
if( [IntPtr]::Size * 8 -eq 64 )
{
#x64 code
}

In Python :
In [10]: import platform
In [11]: platform.architecture()
Out[11]: ('32bit', 'ELF')
As usual, pretty neat. But I'm pretty sure these functions return the platform where the exec has been built, not the the platforms it running on. There is still a small chance that some geek is running a 32 bits version on a 64 bits computer.
You can have some more infos like :
In [13]: platform.system()
Out[13]: 'Linux'
In [19]: platform.uname()
Out[19]:
('Linux',
'asus-u6',
'2.6.28-11-generic',
'#42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009',
'i686',
'')
ETC.
This looks more like live data :-)

VBScript, Windows:
Const PROCESSOR_ARCHITECTURE_X86 = 0
Const PROCESSOR_ARCHITECTURE_IA64 = 6
Const PROCESSOR_ARCHITECTURE_X64 = 9
strComputer = "."
Set oWMIService = GetObject("winmgmts:" & _
"{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2")
Set colProcessors = oWMIService.ExecQuery("SELECT * FROM Win32_Processor")
For Each oProcessor In colProcessors
Select Case oProcessor.Architecture
Case PROCESSOR_ARCHITECTURE_X86
' 32-bit
Case PROCESSOR_ARCHITECTURE_X64, PROCESSOR_ARCHITECTURE_IA64
' 64-bit
Case Else
' other
End Select
Next
Another possible solution for Windows Script Host, this time in JScript and using the PROCESSOR_ARCHITECTURE environment variable:
var oShell = WScript.CreateObject("WScript.Shell");
var oEnv = oShell.Environment("System");
switch (oEnv("PROCESSOR_ARCHITECTURE").toLowerCase())
{
case "x86":
// 32-bit
case "amd64":
// 64-bit
default:
// other
}

I was thinking, on a 64-bit processor, pointers are 64-bit. So, instead of checking processor features, it maybe possible to use pointers to 'test' it programmatically. It could be as simple as creating a structure with two contiguous pointers and then checking their 'stride'.

C# Code:
int size = Marshal.SizeOf(typeof(IntPtr));
if (size == 8)
{
Text = "64 bit";
}
else if (size == 4)
{
Text = "32 bit";
}

In linux you can determine the "bitness" by reading
/proc/cpuinfo
eg.
cat /proc/cpuinfo | grep flags
if it contains the
lm
flag it's a x86 64 bit CPU (even if you have 32 bit linux installed)
Not sure if this works for non x86 CPUs as well such as PPC or ARM.

Related

What misconfiguration might cause Boost Filesystem to fail with an access violation when compiled for Debug model under Visual Studio 2019?

I am struggling to understand why some of my code using boost, which was working fine under Visual Studio 2017, is now resulting in an access violation under Visual Studio 2019. However, I only encounter this failure under Debug build. Release build works fine, with no issues.
What could I have set up incorrectly in my build, environment, or code, that could cause such a failure?
My environment:
Windows 10
Boost 1.74 (dynamic link)
Visual Studio 2019 v16.7.6
Compiling for C++ x64
The failing line of my code is this:
boost::filesystem::path dir = (boost::filesystem::temp_directory_path() / boost::filesystem::unique_path("%%%%-%%%%-%%%%-%%%%"));
The failing line in Boost filesystem is this here in boost/filesystem/path.hpp:
namespace path_traits
{ // without codecvt
inline
void convert(const char* from,
const char* from_end, // 0 for null terminated MBCS
std::wstring & to)
{
convert(from, from_end, to, path::codecvt());
}
The failure message reported by Visual Studio is as follows:
Exception thrown at 0x00007FF9164F1399 (vcruntime140d.dll) in ezv8.tests.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.
The call stack looks like this:
vcruntime140d.dll!00007ff9164f1550() Unknown
> boost_filesystem-vc142-mt-gd-x64-1_74.dll!wmemmove(wchar_t * _S1, const wchar_t * _S2, unsigned __int64 _N) Line 248 C++
boost_filesystem-vc142-mt-gd-x64-1_74.dll!std::_WChar_traits<wchar_t>::move(wchar_t * const _First1, const wchar_t * const _First2, const unsigned __int64 _Count) Line 204 C++
boost_filesystem-vc142-mt-gd-x64-1_74.dll!std::wstring::append(const wchar_t * const _Ptr, const unsigned __int64 _Count) Line 2864 C++
boost_filesystem-vc142-mt-gd-x64-1_74.dll!std::wstring::append<wchar_t *,0>(wchar_t * const _First, wchar_t * const _Last) Line 2916 C++
boost_filesystem-vc142-mt-gd-x64-1_74.dll!`anonymous namespace'::convert_aux(const char * from, const char * from_end, wchar_t * to, wchar_t * to_end, std::wstring & target, const std::codecvt<wchar_t,char,_Mbstatet> & cvt) Line 77 C++
boost_filesystem-vc142-mt-gd-x64-1_74.dll!boost::filesystem::path_traits::convert(const char * from, const char * from_end, std::wstring & to, const std::codecvt<wchar_t,char,_Mbstatet> & cvt) Line 153 C++
appsvcs.dll!boost::filesystem::path_traits::convert(const char * from, const char * from_end, std::wstring & to) Line 1006 C++
appsvcs.dll!boost::filesystem::path_traits::dispatch<std::wstring>(const std::string & c, std::wstring & to) Line 257 C++
appsvcs.dll!boost::filesystem::path::path<char [20]>(const char[20] & source, void * __formal) Line 168 C++
I use UTF-8 strings throughout my code, so I have configured boost::filesystem to expect UTF-8 strings as follows:
boost::nowide::nowide_filesystem();
The cause of this issue turned out to be inconsistent use of _ITERATOR_DEBUG_LEVEL. This setting does affect ABI compatibility. I was setting this flag (to 0) in my own code, but it was not set in the Boost build. The solution is to either remove the flag from one's own code, or add the flag to the Boost build by adding define=_ITERATOR_DEBUG_LEVEL=0 to the b2 arguments (from another stack overflow answer).

How to lossless convert a double to string and back in Octave

When saving a double to a string there is some loss of precision. Even if you use a very large number of digits the conversion may not be reversible, i.e. if you convert a double x to a string sx and then convert back you will get a number x' which may not be bitwise equal to x. This may cause some problem for instance when checking for differences in a battery of tests. One possibility is to use binary form (for instance the native Binary form, or HDF5) but I want to store the number in a text file, so I need a conversion to a string. I have a working solution but I ask if there is some standard for this or a better solution.
In C/C++ you could cast the double to some integer type like char* and then convert each byte to an hexa of length 2 with printf("%02x",c[j]). Then for instance PI would be converted to a string of length 16: 54442d18400921fb. The problem with this is that if you read the hexa you don get any idea of which number it is. So I would be interested in some mix for instance pi -> 3.14{54442d18400921fb}. The first part is a (probably low precision) decimal representation of the number (typically I would use a "%g" output conversion) and the string in braces is the lossless hexadecimal representation.
EDIT: I pass the code as an aswer
Following the ideas already suggested in the post I wrote the
following functions, that seem to work.
function s = dbl2str(d);
z = typecast(d,"uint32");
s = sprintf("%.3g{%08x%08x}\n",d,z);
endfunction
function d = str2dbl(s);
k1 = index(s,"{");
k2 = index(s,"}");
## Check that there is a balanced {} or not at all
assert((k1==0) == (k2==0));
if k1>0; assert(k2>k1); endif
if (k1==0);
## If there is not {hexa} part convert with loss
d = str2double(s);
else
## Convert lossless
ss = substr(s,k1+1,k2-k1-1);
z = uint32(sscanf(ss,"%8x",2));
d = typecast(z,"double");
endif
endfunction
Then I have
>> spi=dbl2str(pi)
spi = 3.14{54442d18400921fb}
>> pi2 = str2dbl(spi)
pi2 = 3.1416
>> pi2-pi
ans = 0
>> snan = dbl2str(NaN)
snan = NaN{000000007ff80000}
>> nan1 = str2dbl(snan)
nan1 = NaN
A further improvement would be to use other type of enconding, for
instance Base64 (as suggested by #CrisLuengo in a comment) that would
reduce the length of the binary part from 16 to 11 bytes.

Negative Speed Gain Using Numba Vectorize target='cuda'

I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?
This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.

why the result is different between cpu and gpu?

This is my code running on GPU
tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x
isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
if(.not. sec_is_int(isec)) return
do iele = tid, sec_n_ele(isec), bdim
idx = n_ele_idx(isec)+iele
u(1:5) = fv_u(1:5,idx)
u(6 ) = fv_t( idx)
g = 0.0d0
do j= sec_iA_ls(idx), sec_iA_ls(idx+1)-1
ss = sec_jA_ls(1,j)
ee = sec_jA_ls(2,j)
tem = n_ele_idx(ss)+ee
du(1:5) = fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = fv_t( n_ele_idx(ss)+ee)-u(6 )
coe(1:3) = sec_coe_ls(1:3,j)
do k=1,6
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
fv_gra(i+(j-1)*3,idx)=g(i,j)
end do
end do
end do
end if
and next is my code running on CPU
do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
if(.not. h_sec_is_int(isec)) cycle
do iele=1,h_sec_n_ele(isec)
idx = h_n_ele_idx(isec)+iele
u(1:5) = h_fv_u(1:5,idx)
u(6 ) = h_fv_t( idx)
g = 0.0d0
do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
ss = h_sec_jA_ls(1,j)
ee = h_sec_jA_ls(2,j)
du(1:5) = h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = h_fv_t( h_n_ele_idx(ss)+ee)-u(6 )
do k=1,6
g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
h_fv_gra(i+(j-1)*3,idx) = g(i,j)
enddo
enddo
end do
end do
The variable between h_* and * shows it belong to cpu and gpu separately.
The result is same at many points, but at some points they are a little different. I add the check code like this.
do i =1,size(h_fv_gra,1)
do j = 1,size(h_fv_gra,2)
if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
end if
end do
end do
The hd_* is a copy of the gpu result. we can see the difference:
1.8626451492309570E-009 13 14306
1.8626451492309570E-009 13 14465
1.8626451492309570E-009 13 14472
1.8626451492309570E-009 14 14128
1.8626451492309570E-009 14 14146
1.8626451492309570E-009 14 14150
1.8626451492309570E-009 14 14153
1.8626451492309570E-009 14 14155
1.8626451492309570E-009 14 14156
So I am confused about that. The precision of Cuda should not as large as this. Any reply will be welcomed.
In addition, I don't know how to print the variables in GPU codes, which can help me debug.
In your code, calculation of g value most probably benefits from Fused Multiply Add (fma) optimization in CUDA.
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
On the CPU side, this is not impossible but strongly depends on compiler choices (and the actual CPU running the code should it implement fma).
To enforce usage of separate multiply and add, you want to use intrinsics from CUDA, as defined here, such as :
__device__ ​ double __dadd_rn ( double x, double y ) Add two floating point values in round-to-nearest-even mode.
and
__device__ ​ double __dmul_rn ( double x, double y ) Multiply two floating point values in round-to-nearest-even mode.
with a rounding mode identical to the one defined on CPU (it depends on the CPU architecture, whether it be Power or Intel x86 or other).
Alternate approach is to pass the --fmad false option to ptxas when compiling cuda, using the --ptxas-options option from nvcc detailed here.

"dimension too large" error when broadcasting to sparse matrix in octave

32-bit Octave has a limit on the maximum number of elements in an array. I have recompiled from source (following the script at https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04 ), and now have 64-bit indexing.
Nevertheless, when I attempt to perform elementwise multiplication using a broadcast function, I get error: out of memory or dimension too large for Octave's index type
Is this a bug, or an undocumented feature? If it's a bug, does anyone have a reasonably efficient workaround?
Minimal code to reproduce the problem:
function indexerror();
% both of these are formed without error
% a = zeros (2^32, 1, 'int8');
% b = zeros (1024*1024*1024*3, 1, 'int8');
% sizemax % returns 9223372036854775806
nnz = 1000 % number of non-zero elements
rowmax = 250000
colmax = 100000
irow = zeros(1,nnz);
icol = zeros(1,nnz);
for ind =1:nnz
irow(ind) = round(rowmax/nnz*ind);
icol(ind) = round(colmax/nnz*ind);
end
sparseMat = sparse(irow,icol,1,rowmax,colmax);
% column vector to be broadcast
broad = 1:rowmax;
broad = broad(:);
% this gives "dimension too large" error
toobig = bsxfun(#times,sparseMat,broad);
% so does this
toobig2 = sparse(repmat(broad,1,size(sparseMat,2)));
mult = sparse( sparseMat .* toobig2 ); % never made it this far
end
EDIT:
Well, I have an inefficient workaround. It's slower than using bsxfun by a factor of 3 or so (depending on the details), but it's better than having to sort through the error in the libraries. Hope someone finds this useful some day.
% loop over rows, instead of using bsxfun
mult_loop = sparse([],[],[],rowmax,colmax);
for ind =1:length(broad);
mult_loop(ind,:) = broad(ind) * sparseMat(ind,:);
end
The unfortunate answer is that yes, this is a bug. Apparently #bsxfun and repmat are returning full matrices rather than sparse. Bug has been filed here:
http://savannah.gnu.org/bugs/index.php?47175