Is it possible in AGAL in fragment shader get current fragment depth if any?
No, I'm afraid there is no way to read from the depth buffer in AGAL.
You can however do a workaround by rendering a depthmap first into a texture, and then using it (which may be enought, depending on the effect you are trying to implement).
In fact, even rendering a depth map with good precision can be (a little) tricky, because there are no float32 textures in flash, so the depth as to be stored in a R8G8B8A8 texture (by packing and unpackings values on the GPU).
Related
I have a D3D11 Texture2d with the format DXGI_FORMAT_R10G10B10A2_UNORM and want to convert this into a D3D11 Texture2d with a DXGI_FORMAT_R32G32B32A32_FLOAT or DXGI_FORMAT_R8G8B8A8_UINT format, as those textures can only be imported into CUDA.
For performance reasons I want this to fully operate on the GPU. I read some threads suggesting, I should set the second texture as a render target and render the first texture onto it or to convert the texture via a pixel shader.
But as I don't know a lot about D3D I wasn't able to do it like that.
In an ideal world I would be able to do this stuff without setting up a whole rendering pipeline including IA, VS, etc...
Does anyone maybe has an example of this or any hints?
Thanks in advance!
On the GPU, the way you do this conversion is a render-to-texture which requires at least a minimal 'offscreen' rendering setup.
Create a render target view (DXGI_FORMAT_R32G32B32A32_FLOAT, DXGI_FORMAT_R8G8B8A8_UINT, etc.). The restriction here is it needs to be a format supported as a render target view on your Direct3D Hardware Feature level. See Microsoft Docs.
Create a SRV for your source texture. Again, needs to be supported as a texture by your Direct3D Hardware device feature level.
Render the source texture to the RTV as a 'full-screen quad'. with Direct3D Hardware Feature Level 10.0 or greater, you can have the quad self-generated in the Vertex Shader so you don't really need a Vertex Buffer for this. See this code.
Given your are starting with DXGI_FORMAT_R10G10B10A2_UNORM, then you pretty much require Direct3D Hardware Feature Level 10.0 or better. That actually makes it pretty easy. You still need to actually get a full rendering pipeline going, although you don't need a 'swapchain'.
You may find this tutorial helpful.
On the CPU, I am often using 'sub-images' of 2-D images (pitch-linear), which are simply pointing to a certain ROI of the 'master' image. So all modifications to the sub-image in fact change the 'master' image also.
Are there any problems in CUDA with sub-images to 2-D images (pitch-linear) on the device memory ? E.g., can a bind a texture to it or an texture object ? Do the NPP routines work properly ? I ask because of issues like that a certain alignment (of the 'start address' of the buffer) could be required by certain routines.
Note that I am mainly interested in stability issues. I suppose there might be minor performance penalties for these sub-images, but that is not my main concern.
Especially, I would be interested if the alignment restriction for the buffer base address mentioned in 'cudaBindTexture2D' documentation here:
"Since the hardware enforces an alignment requirement on texture base addresses, cudaBindTexture2D() returns in *offset a byte offset that must be applied to texture fetches in order to read from the desired memory."
is also necessary for 'Texture objects' (for CC >= 3.0 GPUs) ?
Any bound texture (whether via Texture Reference or Texture Object API) should satisfy the alignment requirement(s) provided by cudaGetDeviceProperties, in order to have a direct mapping between data coordinates and texture coordinates:
Any bound texture should satisfy the alignment returned via textureAlignment (in bytes). Allocations provided by cudaMalloc and similar will satisfy this (for the starting address of the allocation).
A 2D bound texture should (for each row in the texture) satisfy the alignment returned via texturePitchAlignment. Allocations provided by (for example) cudaMallocPitch will satisfy this.
NPP should work properly with any properly specified ROI.
Note that your document link is quite old. Current docs can be found here.
This question/answer may be of interest as well.
I have a specialised rendering app that needs to load up any number of jpegs from a pdf, and then write out the images into a rendered page inside a kernel. This is oversimplified, but the point is that I want to find a way to collectively send up 'n' images as textures, and then, within the kernel, to index into this collective of textures for tex2d() calls. Any ideas welcome for doing this gracefully.
As a side question, I haven't yet found a way to decode the jpeg images in the kernel, forcing me to decode on the CPU and then send up (slowly) a large bitmap. Can i improve this?
First: if texture upload performance is not a bottleneck, consider not bulk uploading. Here are some suggestions, each with different trade-offs.
For varying-sized textures, consider creating a texture atlas. This is a technique popular in game development that packs many textures into a single 2D image. This requires offsetting texture coordinates to the corner of the image in question, and it precludes the use of texture coordinate clamping and wrapping. So you would need to store the offset of the corner of each sub-texture instead of its ID. There are various tools available for creating texture atlases.
For constant-sized textures, or for the case where you don't mind the waste of varying-sized textures, you could consider using a layered texture. This is a texture with a number of independent layers that can be indexed at texture fetch time using a separate layer index. Quote from the link above:
A one-dimensional or two-dimensional layered texture (also know as texture array in Direct3D and array texture in OpenGL) is a texture made up of a sequence of layers, all of which are regular textures of same dimensionality, size, and data type.
A one-dimensional layered texture is addressed using an integer index and a floating-point texture coordinate; the index denotes a layer within the sequence and the coordinate addresses a texel within that layer. A two-dimensional layered texture is addressed using an integer index and two floating-point texture coordinates; the index denotes a layer within the sequence and the coordinates address a texel within that layer.
A layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayLayered flag (and a height of zero for one-dimensional layered texture).
Layered textures are fetched using the device functions described in tex1Dlayered() and tex2Dlayered(). Texture filtering (see Texture Fetching) is done only within a layer, not across layers.
Layered textures are only supported on devices of compute capability 2.0 and higher.
You could consider a hybrid approach: sort the textures into same-sized groups and use a layered texture for each group. Or use a layered texture atlas, where the groups are packed such that each layer contains one or a few textures from each group to minimize waste.
Regarding your side question: a google search for "cuda jpeg decode" turns up a lot of results, including at least one open source project.
I am dealing with a set of (largish 2k x 2k) images
I need to do per-pixel operations down a stack of a few sequential images.
Are there any opinions on using a single 2D large texture + calculating offsets vs using 3D arrays?
It seems that 3D arrays are a bit 'out of the mainstream' in the CUDA api, the allocation transfer functions are very different from the same 2D functions.
There doesn't seem to be any good documentation on the higher level "how and why" of CUDA rather than the specific calls
There is the best practices guide but it doesn't address this
I would recommend you to read the book "Cuda by Example". It goes through all these things that aren't documented as well and it'll explain the "how and why".
I think what you should use if you're rendering the result of the CUDA kernel is to use OpenGL interop. This way, your code processes the image on the GPU and leaves the processed data there, making it much faster to render. There's a good example of doing this in the book.
If each CUDA thread needs to read only one pixel from the first frame and one pixel from the next frame, you don't need to use textures. Textures only benefit you if each thread is reading in a bunch of consecutive pixels. So you're best off using a 3D array.
Here is an example of using CUDA and 3D cuda arrays:
https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st
I'm using texture memory for image filtering in CUDA as:
texture<unsigned char> texMem; //deceleration
cudaBindTexture( NULL, texMem,d_inputImage,imageSize); //binding
However I'm not satisfied with the results at the boundary. Is there any other considerations or settings for texture memory tailored for 2D filtering?
I've seen people declear texture this way:
texture<float> texMem(0,cudaFilterModeLinear);
// what does this do?
Moreover, if anyone can suggest some online guide explaining how to properly set setup texture memory abstraction in CUDA, that'll be helpful. Thanks
You can specify what kind of sampling you want using cudaFilterMode (could be linear or cubic).
You could look at Appendix G from the CUDA_C_Programming_Guide.pdf provided in path/to/cudatoolkit/doc to see this explained in detail