One physical USB harddisk provides two ~2TB block devices, but I want one large 4TB device - partitioning

I have a vendor preformatted hard disk (Verbatim Store'n'Save USB 3.0 4TB). When connecting to my CentOS 6 server I can see 2 block devices /dev/sdd and /dev/sde with appropriate partitions, each of them about 2TB size. dmesg gives:
usb 4-4: new SuperSpeed USB device number 3 using xhci_hcd
usb 4-4: LPM exit latency is zeroed, disabling LPM.
usb 4-4: New USB device found, idVendor=18a5, idProduct=0400
usb 4-4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 4-4: Product: USB 3.0 Desktop HD
usb 4-4: Manufacturer: Verbatim
usb 4-4: SerialNumber: 30624C151155
usb 4-4: configuration #1 chosen from 1 choice
scsi7 : SCSI emulation for USB Mass Storage devices
usb-storage: device found at 3
usb-storage: waiting for device to settle before scanning
usb-storage: device scan complete
scsi 7:0:0:0: Direct-Access TOSHIBA MD04ACA400 FP2A PQ: 0 ANSI: 6
scsi 7:0:0:1: Direct-Access TOSHIBA MD04ACA400 FP2A PQ: 0 ANSI: 6
sd 7:0:0:0: Attached scsi generic sg4 type 0
sd 7:0:0:1: Attached scsi generic sg5 type 0
sd 7:0:0:0: [sdd] 4294965248 512-byte logical blocks: (2.19 TB/1.99 TiB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 1f 00 00 08
sd 7:0:0:0: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sde] 3519071920 512-byte logical blocks: (1.80 TB/1.63 TiB)
sd 7:0:0:1: [sde] Write Protect is off
sd 7:0:0:1: [sde] Mode Sense: 1f 00 00 08
fdisk gives:
fdisk -l /dev/sdd
Disk /dev/sdd: 2199.0 GB, 2199022206976 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x1428089d
Device Boot Start End Blocks Id System
/dev/sdd1 1 267350 2147480576 c W95 FAT32 (LBA)
I want to have only one device with 4TB size.
I tried to use parted to set the disk-label-typ to GPT but without success.

As physical sector size is 512 bytes, You can not have a single 4TB partition...

Related

A CUDA context was created on a GPU that is not currently debuggable

When i start cuda debugging, Nsight return this error:
A CUDA context was created on a GPU that is not currently debuggable.
Breakpoints will be disabled.
Adapter: GeForce GT 720M
This is my system and CUDA information.
Please note that last version of CUDA and Nsight are installed.
I searched this issue and could not find my answer.
Thank you so much.
Report Information
UnixTime Generated 1490538033
OS Information
Computer Name DESKTOP-OLFM6NT
NetBIOS Name DESKTOP-OLFM6NT
OS Name Windows 10 Pro
GetVersionEx
dwMajorVersion 10
dwMinorVersion 0
dwBuildNumber 14393
dwPlatformId 2
wServicePackMajor 0
wServicePackMinor 0
wSuiteMask 256
wProductType Workstation
GetProductInfo 48
GetNativeSystemInfo
wProcessorArchitecture x64
dwPageSize 4096
lpMinimumApplicationAddress 65536
lpMaximumApplicationAddress 140737488289791
dwActiveProcessorMask 15
dwNumberOfProcessors 4
dwAllocationGranularity 65536
wProcessorLevel 6
wProcessorRevision 17665
EnumDisplayDevices
Display Device
DeviceName \\.\DISPLAY1
DeviceString Intel(R) HD Graphics Family
StateFlags 5
DeviceID PCI\VEN_8086&DEV_0A16&SUBSYS_397817AA&REV_09
DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Video\{A9611CC2-95E1-4DAE-9937-60210AFEDCE0}\0000
Monitor
DeviceName \\.\DISPLAY1\Monitor0
DeviceString Generic PnP Monitor
StateFlags 3
DeviceID MONITOR\CMN15B6\{4d36e96e-e325-11ce-bfc1-08002be10318}\0003
DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Class\{4d36e96e-e325-11ce-bfc1-08002be10318}\0003
Display Device
DeviceName \\.\DISPLAY2
DeviceString Intel(R) HD Graphics Family
StateFlags 1
DeviceID PCI\VEN_8086&DEV_0A16&SUBSYS_397817AA&REV_09
DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Video\{A9611CC2-95E1-4DAE-9937-60210AFEDCE0}\0001
Monitor
DeviceName \\.\DISPLAY2\Monitor0
DeviceString Generic PnP Monitor
StateFlags 3
DeviceID MONITOR\SAM04FD\{4d36e96e-e325-11ce-bfc1-08002be10318}\0004
DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Class\{4d36e96e-e325-11ce-bfc1-08002be10318}\0004
Display Device
DeviceName \\.\DISPLAY3
DeviceString Intel(R) HD Graphics Family
StateFlags 0
DeviceID PCI\VEN_8086&DEV_0A16&SUBSYS_397817AA&REV_09
DeviceKey \Registry\Machine\System\CurrentControlSet\Control\Video\{A9611CC2-95E1-4DAE-9937-60210AFEDCE0}\0002
GlobalMemoryStatusEx
dwMemoryLoad 34
ullTotalPhys 8486227968
ullAvailPhys 5588660224
ullTotalPageFile 13854937088
ullAvailPageFile 10756182016
ullTotalVirtual 140737488224256
ullAvailVirtual 140737442308096
Processor Information
0
Name Intel(R) Core(TM) i7-4500U CPU # 1.80GHz
Clock speed (MHz) 2394
1
Name Intel(R) Core(TM) i7-4500U CPU # 1.80GHz
Clock speed (MHz) 2394
2
Name Intel(R) Core(TM) i7-4500U CPU # 1.80GHz
Clock speed (MHz) 2394
3
Name Intel(R) Core(TM) i7-4500U CPU # 1.80GHz
Clock speed (MHz) 2394
NvAPI
IsMSHybridGraphics True
DisplayDriverVersion
Driver Version 37609
Changelist 0
BuildBranchString r376_06
Default AdapterString GeForce GT 720M
DisplayDriverCompileType Release
NvDebugApi
WDDM Devices
GPU
Name GeForce GT 720M
Architecture Fermi
Architecture Number 208
Architecture Implementation 7
Architecture Revision 162
Number of GPCs 1
Number of TPCs 2
Number of SMs 2
Warps per SM 48
Lanes per warp 32
Register file size 32768
Max CTAs per SM 8
Max size of shared memory per CTA (bytes) 49152
SM Revision 131073
Number of FB PAs 6
Number of LTs per LTC 2
RmGpuId 1024
RM Devices
CUDA
CUDA Device
Name GeForce GT 720M
Driver WDDM
DeviceIndex 0
GPU Family GF117
RmGpuId 1024
Compute Major 2
Compute Minor 1
MAX_THREADS_PER_BLOCK 1024
MAX_BLOCK_DIM_X 1024
MAX_BLOCK_DIM_Y 1024
MAX_BLOCK_DIM_Z 64
MAX_GRID_DIM_X 65535
MAX_GRID_DIM_Y 65535
MAX_GRID_DIM_Z 65535
MAX_SHARED_MEMORY_PER_BLOCK 49152
TOTAL_CONSTANT_MEMORY 65536
WARP_SIZE 32
MAX_PITCH 2147483647
MAX_REGISTERS_PER_BLOCK 32768
CLOCK_RATE 1550000
TEXTURE_ALIGNMENT 512
GPU_OVERLAP 1
MULTIPROCESSOR_COUNT 2
KERNEL_EXEC_TIMEOUT 0
INTEGRATED 0
CAN_MAP_HOST_MEMORY 1
COMPUTE_MODE 0
MAXIMUM_TEXTURE1D_WIDTH 65536
MAXIMUM_TEXTURE2D_WIDTH 65536
MAXIMUM_TEXTURE2D_HEIGHT 65535
MAXIMUM_TEXTURE3D_WIDTH 2048
MAXIMUM_TEXTURE3D_HEIGHT 2048
MAXIMUM_TEXTURE3D_DEPTH 2048
MAXIMUM_TEXTURE2D_LAYERED_WIDTH 16384
MAXIMUM_TEXTURE2D_LAYERED_HEIGHT 16384
MAXIMUM_TEXTURE2D_LAYERED_LAYERS 2048
SURFACE_ALIGNMENT 512
CONCURRENT_KERNELS 1
ECC_ENABLED 0
PCI_BUS_ID 4
PCI_DEVICE_ID 0
TCC_DRIVER 0
MEMORY_CLOCK_RATE 900000
GLOBAL_MEMORY_BUS_WIDTH 64
L2_CACHE_SIZE 131072
MAX_THREADS_PER_MULTIPROCESSOR 1536
ASYNC_ENGINE_COUNT 1
UNIFIED_ADDRESSING 1
MAXIMUM_TEXTURE1D_LAYERED_WIDTH 16384
MAXIMUM_TEXTURE1D_LAYERED_LAYERS 2048
CAN_TEX2D_GATHER 1
MAXIMUM_TEXTURE2D_GATHER_WIDTH 16384
MAXIMUM_TEXTURE2D_GATHER_HEIGHT 16384
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE 0
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE 0
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE 0
PCI_DOMAIN_ID 0
TEXTURE_PITCH_ALIGNMENT 32
MAXIMUM_TEXTURECUBEMAP_WIDTH 16384
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH 16384
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS 2046
MAXIMUM_SURFACE1D_WIDTH 65536
MAXIMUM_SURFACE2D_WIDTH 65536
MAXIMUM_SURFACE2D_HEIGHT 32768
MAXIMUM_SURFACE3D_WIDTH 65536
MAXIMUM_SURFACE3D_HEIGHT 32768
MAXIMUM_SURFACE3D_DEPTH 2048
MAXIMUM_SURFACE1D_LAYERED_WIDTH 65536
MAXIMUM_SURFACE1D_LAYERED_LAYERS 2048
MAXIMUM_SURFACE2D_LAYERED_WIDTH 65536
MAXIMUM_SURFACE2D_LAYERED_HEIGHT 32768
MAXIMUM_SURFACE2D_LAYERED_LAYERS 2048
MAXIMUM_SURFACECUBEMAP_WIDTH 32768
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH 32768
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS 2046
MAXIMUM_TEXTURE1D_LINEAR_WIDTH 134217728
MAXIMUM_TEXTURE2D_LINEAR_WIDTH 65000
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT 65000
MAXIMUM_TEXTURE2D_LINEAR_PITCH 1048544
MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH 16384
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT 16384
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH 16384
STREAM_PRIORITIES_SUPPORTED 0
GLOBAL_L1_CACHE_SUPPORTED 1
LOCAL_L1_CACHE_SUPPORTED 1
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR 49152
MAX_REGISTERS_PER_MULTIPROCESSOR 32768
MANAGED_MEMORY 0
MULTI_GPU_BOARD 0
MULTI_GPU_BOARD_GROUP_ID 0
HOST_NATIVE_ATOMIC_SUPPORTED 0
SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO 12
PAGEABLE_MEMORY_ACCESS 0
CONCURRENT_MANAGED_ACCESS 0
COMPUTE_PREEMPTION_SUPPORTED 0
CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM 0
DISPLAY_NAME GeForce GT 720M
COMPUTE_CAPABILITY_MAJOR 2
COMPUTE_CAPABILITY_MINOR 1
TOTAL_MEMORY 2147483648
RAM_TYPE 7
RAM_LOCATION 1
GPU_PCI_DEVICE_ID 289411294
GPU_PCI_SUB_SYSTEM_ID 939530154
GPU_PCI_REVISION_ID 161
GPU_PCI_EXT_DEVICE_ID 4416
GPU_PCI_EXT_GEN 1
GPU_PCI_EXT_GPU_GEN 1
GPU_PCI_EXT_GPU_LINK_RATE 5000
GPU_PCI_EXT_GPU_LINK_WIDTH 8
GPU_PCI_EXT_DOWNSTREAM_LINK_RATE 5000
GPU_PCI_EXT_DOWNSTREAM_LINK_WIDTH 4
Your GT 720m is a compute capability 2.1 device (see here).
Attempting to debug CUDA code (e.g. set breakpoints) on a GPU that is also supporting (hosting) a display requires a compute capability 3.5 or higher device, to support preemption.
Your device does not meet that requirement, so because your GPU is hosting your laptop display, it cannot be used to set breakpoints in CUDA code.
Also note that the latest version of Nsight VSE (5.2 at this time) has officially dropped support for Fermi GPUs (yours is a Fermi GPU):
Note: Fermi family GPUs, and older families, are no longer supported with Nsightâ„¢ Visual Studio Edition 5.2 or better.
Maybe try falling back to an older version of Nsight.
I recently ran into this problem and installing an older version helped.

why the difference in cuda cores between nvidia control panel and device query?

Q1: why there is different information i got from Nvidia control panel->system information and information from device query example in cuda sdk.
system information:
cuda cores 384 cores
memory data rate 1800MHz
device query output:
cuda cores= 2 MP x 192 SP/MP = 576 cuda cores
memory clock rate 900MHz
Q2: how can i calculate the GFLOPs of my GPU using device query data?
the most common used formula i found was the one mentioned here which suggest using Number of mul-add units, number of mul units which i don't know?
Max GFLOPS (Cores x SIMDs x ([mul-add]x2+[mul]*1)*clock speed)
Q1: It tells you right there just above the line...
MapSMtoCores for SM 5.0 is unefined. Default to use 192 Cores/SM
Maxwell, the architecture behind the GeForce 840M, uses 128 "cores" per "SMM"
3 * 128 = 384
Q2: "Cores" * frequency * 2 (because each core can do a multiply+add)

How to get 100% GPU usage using CUDA

I wonder how I can generate high load in a GPU, step by step, though.
What I'm trying to do is a program which put the maximum load in a MP, then in other, until reach the total number of MP.
It would be similar to execute a "while true" in every single core of a CPU, but I'm not sure if the same paradigm would work on a GPU with CUDA.
Can you help me?
If you want to do a stress-test/power consumption test, you'll need to pick the workload. The highest power consumption with compute-only code you'll most likely get with some synthetic benchmark that feed the GPUs with the optimal mix and sequence of operations. Otherwise, BLAS level 3 is probably quite close to optimal.
Putting load only on a certain number of multi-processors will require that you tweak the workload to limit the block-level parallelism.
Briefly, this is what I'd do:
Pick a code that is well-optimized and known to utilize the GPU to a great extent (high IPC, high power consumption, etc.). Have a look around on the CUDA developer forums, you should be able to find hand-tuned BLAS code or something alike.
Change the code to force it to run on a given number of multi-processors. This will require that you tune the number of blocks and threads to produce exactly the right amount of load for the number of processors you want to utilize.
Profile: the profiler counters can show you the amount of instruction per multi-processor which gives you a check that you are indeed only running on the desired number of processors as well as other counters that can indicate how efficiently is the code running.
as well as
Measure. If you have a Tesla or Quadro you get power consumption out of the box. Otherwise, try the nvml fix. Without a power measurement it will be hard for you to know how far are you from the TDP and especially weather the GPU is throttling.
Some of my benchmarks carry out the same calculations via CUDA, OpenMP and programmed multithreading. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. A range of data sizes are also used. [Free] Benchmarks, source codes and results for Linux are via my page:
http://www.roylongbottom.org.uk/linux%20benchmarks.htm
I also provide Windows varieties. Following are some CUDA results, showing maximum speed of 412 GFLOPS using a GeForce GTX 650. On the quad core/8 thread Core i7, OpenMP produced up to 91 GFLOPS and multithreading up to 93 GFLOPS using SSE instructions and 178 GFLOPS with AVX 1. See also section on Burn-In and Reliability Apps, where the most demanding CUDA test is run for a period to show temperature gains, at the same time as CPU stress tests.
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes

Cuda: Chasing an insufficient resource issue

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.

Why does the Cuda runtime reserve 80 GiB virtual memory upon initialization?

I was profiling my Cuda 4 program and it turned out that at some stage the running process used over 80 GiB of virtual memory. That was a lot more than I would have expected.
After examining the evolution of the memory map over time and comparing what line of code it is executing it turned out that after these simple instructions the virtual memory usage bumped up to over 80 GiB:
int deviceCount;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0) {
perror("No devices supporting CUDA");
}
Clearly, this is the first Cuda call, thus the runtime got initialized. After this the memory map looks like (truncated):
Address Kbytes RSS Dirty Mode Mapping
0000000000400000 89796 14716 0 r-x-- prg
0000000005db1000 12 12 8 rw--- prg
0000000005db4000 80 76 76 rw--- [ anon ]
0000000007343000 39192 37492 37492 rw--- [ anon ]
0000000200000000 4608 0 0 ----- [ anon ]
0000000200480000 1536 1536 1536 rw--- [ anon ]
0000000200600000 83879936 0 0 ----- [ anon ]
Now with this huge memory area mapped into virtual memory space.
Okay, its maybe not a big problem since reserving/allocating memory in Linux doesn't do much unless you actually write to this memory. But it's really annoying since for example MPI jobs have to be specified with the maximum amount of vmem usable by the job. And 80GiB that's s just a lower boundary then for Cuda jobs - one has to add all other stuff too.
I can imagine that it has to do with the so-called scratch space that Cuda maintains. A kind of memory pool for kernel code that can dynamically grow and shrink. But that's speculation. Also it's allocated in device memory.
Any insights?
Nothing to do with scratch space, it is the result of the addressing system that allows unified andressing and peer to peer access between host and multiple GPUs. The CUDA driver registers all the GPU(s) memory + host memory in a single virtual address space using the kernel's virtual memory system. It isn't actually memory consumption, per se, it is just a "trick" to map all the available address spaces into a linear virtual space for unified addressing.