I am trying to port this GLSL Shaders to Stage3D and AGAL, but I cannot make it work:
Vertex Shader (GLSL)
attribute vec3 aVertexPosition;
attribute vec4 aVertexColor;
uniform mat4 uMVMatrix;
uniform mat4 uPMatrix;
varying vec4 vColor;
void main(void) {
gl_Position = uPMatrix * uMVMatrix * vec4(aVertexPosition, 1.0);
vColor = aVertexColor;
}
Fragment Shader (GLSL)
precision mediump float;
varying vec4 vColor;
void main(void) {
gl_FragColor = vColor;
}
Having as a result:
Vertex Shader (AGAL)
mov vt0.w, vc0.x
mov vt0.xyz, va0.xyz
mov vt1.xyzw, vc1
mul vt5.xyzw, vt1, vc5
m44 op.xyzw, vt0.xyzw, vt5
mov v0.xyzw, va1.xyzw
Fragment Shader (AGAL)
mov oc.xyzw, v0.xyzw
Here it is a trace of the instructions I execute:
(setVertexBufferAt) va0 3 0
(setVertexBufferAt) va1 4 3
(identity setProgramConstantsFromMatrix): vc0 1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
(setProgramConstantsFromMatrix) vc1 1,0,0,0,0,1,0,0,0,0,1,0,0,0,-2,1
(setProgramConstantsFromMatrix) vc5 1.8106601238250732,0,0,0,0,2.4142136573791504,0,0,0,0,-1.0020020008087158,-1,0,0,-0.20020020008087158,0
render drawTriangles + present
But I can't still see anything.
Update:
Apparently the problem is that the triangle is being clipped. Values on the multiplicated array are bigger than 1.0 which makes them clipped out. So I need to find a way to normalize the matrix or yo extend the clipping
Any suggestion?
Digging up an old post here, but for the sake of any future searches you should really check out:
https://github.com/adobe/glsl2agal
It works pretty much as you would hope - takes valid glsl vertex/fragment shader programs as input, spits out valid agal as output.
as an example:
http://www.cmodule.org/glsl2agal/
Related
Even with a 20 point Laplacian operator there are still coordinate system artifacts with a circularly symmetric seed.
That is one reason to try a spectral solver.
The main code for the above mentioned simplest Laplacian with a forward Euler solver is:
#define A(U) texture(iChannel0,(U)/iResolution.xy)
void mainImage( out vec4 Q, in vec2 U )
{
// Lookup Field
Q = A(U);
// Mean Field
// Two way: horizontal, vertical
vec4 sum2 = A(U+vec2(0,1))+A(U+vec2(1,0))+A(U-vec2(0,1))+A(U-vec2(1,0));
vec4 mean2 = 1./4.*(sum2);
// Laplacian
vec4 laplacian2 = (mean2 - Q);
// Diffuse each variable differently :
Q += laplacian2 * vec4(1, .4, 1, 1);
// Compute reactions:
Q.x = Q.x * .99 + 0.01 * Q.y;
Q.y = Q.y + .05 * Q.y * (1. - Q.y) - .03 * Q.x - 1e-3;
// Prevent Negative Values (depends on system):
Q = max(Q, 0.);
}
How can this be rewritten to a spectral solver on Shadertoy?
Sometimes, one wants to write a (small) CUDA device-side function which returns two values. In C, you would have that function take two out-parameters, e.g.:
__device__ void pair_maker(float x, float &out1, float& out2);
but in C++, the idiomatic way to write this is to return an std::pair (well, maybe an std::tuple, or a struct, but C++ tuples are clunky and a struct is not generic enough):
__device__ std::pair<float, float> pair_maker(float x);
My question: Can I trust NVCC (with --expt-relaxed-constexpr) to optimize-away the construction of the pointer, and just assign directly to the variables which I later assign to from the .first and .second elements of the pair?
I don't have a complete answer, but from my limited experience - it seems that NVCC can optimize the std::pair away. Illustration (also on GodBolt):
#include <utility>
__device__ std::pair<float, float> pair_maker(float x) {
float sin, cos;
__sincosf(x, &sin, &cos);
return {sin, cos};
}
__device__ float foo(float x) {
auto p = pair_maker(x);
auto sin = p.first;
auto cos = p.second;
return sin + cos;
}
__global__ void bar(float x, float *out) { *out = foo(x); }
__global__ void baz(float x, float *out) {
float sin, cos;
__sincosf(x, &sin, &cos);
*out = sin + cos;
}
The kernels bar() and baz() compile to the same PTX code:
ld.param.f32 %f1, [param_0];
ld.param.u64 %rd1, [param_1];
cvta.to.global.u64 %rd2, %rd1;
sin.approx.f32 %f2, %f1;
cos.approx.f32 %f3, %f1;
add.f32 %f4, %f2, %f3;
st.global.f32 [%rd2], %f4;
ret;
No extra copies or construction-related operations.
I am working with a buffer that passes in a few different elements, below is a crude diagram of where each element appears in the buffer:
pos col amb dif spe nor uv t a s
+---+---+---+---+---+---+--+-+-+-+
0 3 6 9 1 1 1 2 2 2 2
2 5 8 0 1 2 3
Where
pos - the vertex (3 floats)
col - the color at that vertex (note, this is a legacy variable that is unused(3 floats)
amb - the ambient RGB reflection of the model (3 floats)
dif - the diffuse RGB reflection of the model (3 floats)
spe - the specular RGB reflection of the model (3 floats)
nor - the normals of the model (3 floats)
uv - the uv coordinates to the mapped texture (2 floats)
t - a pointer to which texture to load (a float)
a - the transparency (alpha) of the model (a float)
s - the specular exponent (a float)
My buffer looks something like this:
// stride = how many floats to skip each round (times 4)
stride = 23 * 4;
// Last parameter = where this attribute starts in the buffer
GL.vertexAttribPointer(_position, 3, GL.FLOAT, false, stride, 0 * 4) ;
GL.vertexAttribPointer(_color, 3, GL.FLOAT, false, stride, 3 * 4) ;
GL.vertexAttribPointer(_ambient, 3, GL.FLOAT, false, stride, 6 * 4) ;
GL.vertexAttribPointer(_diffuse, 3, GL.FLOAT, false, stride, 9 * 4) ;
GL.vertexAttribPointer(_specular, 3, GL.FLOAT, false, stride, 12 * 4) ;
GL.vertexAttribPointer(_normals, 3, GL.FLOAT, false, stride, 15 * 4) ;
GL.vertexAttribPointer(_uvs, 2, GL.FLOAT, false, stride, 18 * 4) ;
GL.vertexAttribPointer(_tex, 1, GL.FLOAT, false, stride, 20 * 4) ;
GL.vertexAttribPointer(_a, 1, GL.FLOAT, false, stride, 21 * 4) ;
GL.vertexAttribPointer(_shine, 1, GL.FLOAT, false, stride, 22 * 4) ;
All three floats are being passed the same way in the vertex shader:
attribute float tex;
attribute float a;
attribute float shine;
...
varying float vTex;
varying float vA;
varying float vShine;
void main(void) {
...
vTex = tex;
vA = a;
vShine = shine;
I'm passing everything fine, literally copy/pasted the _tex code for _a and _shine. No errors are popping up and if I print the array containing all these values, everything is getting stored properly. Likewise, _tex is being used inside the fragment shader without error.
void main(void) {
vec4 texColor;
//Ambient
vec4 Ia = La * Ka;
// Diffuse
vec4 Id = Kd;
vec3 lightDirection = normalize(world_light - vertex);
vec3 L = normalize(lightDirection - world_pos);
vec3 N = normalize(world_norm);
float lambert = max(0.0, dot(N, -L));
Id = Kd*Ld*lambert;
// Specular
vec4 Is = Ks;
vec3 V = normalize(vertex - world_pos);
vec3 H = normalize(L + V);
float NdotH = dot(N, H);
NdotH = max(NdotH, 0.0);
NdotH = pow(NdotH, 10.0);
// NdotH = pow(NdotH, vShine); <-------------------------------- ERRORS
Is = Ks*Ls*NdotH;
if (vTex < 0.1) {
vec4 texColor = texture2D(texture01, vUV);
gl_FragColor = vec4(texColor.rgb, texColor.a);
} else if (vTex < 1.1) {
vec4 texColor = texture2D(texture02, vUV);
gl_FragColor = vec4(texColor.rgb, texColor.a);
} else if (vTex < 2.1) {
vec4 texColor = texture2D(texture03, vUV);
gl_FragColor = vec4(texColor.rgb, texColor.a);
} else {
vec4 texColor = texture2D(texture04, vUV);
gl_FragColor = vec4(texColor.rgb, texColor.a);
}
gl_FragColor = gl_FragColor * (Ia*A) + (Id*D) + (Is*S);
The second I flip to NdotH = pow(NdotH, vShine);, Chrome's WebGL will crash with the following error message:
VM258:1958 WebGL: INVALID_OPERATION: getUniformLocation: program not linked(anonymous function) # VM258:1958
gl.getUniformLocation # VM258:4629
main # texturize.js:569
onload # (index):26
VM258:1958 WebGL: INVALID_OPERATION: getUniformLocation: program not linked(anonymous function) # VM258:1958
gl.getUniformLocation # VM258:4629
main # texturize.js:570
onload # (index):26
This is obviously the confusing part, as the floats are attributes, not uniforms. Again, loading in Firefox is fine, but I am trying to understand what is causing problems on the Chrome front and what the resolution is without having to refactor.
I'm hesitant to post full code, as this is a class assignment.
Thanks!
So I found the issue, it is specifically a limitation to Chrome's cap on Max Varying Vectors, discovered via here: https://www.browserleaks.com/webgl
The issue on why Chrome cannot handle things is I am pushing to many different vectors to a single buffer, while Firefox can handle 30, Chrome can only handle 9. Since I am at this cusp, that is where my error is coming from.
Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:
"Can save instructions when performing complex logic operations
on multiple inputs."
At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.
However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.
Do I have to define them in .ptx file instead ? if so, how ?
float atan2f(const float dy, const float dx)
{
float flag, z = 0.0f;
__SET_LT(flag, fabsf(dy), fabsf(dx));
uint32_t m, t1 = 0x80000000;
float t2 = float(M_PI) / 2.0f;
__LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
float w = flag * __int_as_float(m) + float(M_PI)/2.0f;
float Offset = copysignf(w, dy);
float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));
uint32_t r, b = __float_as_int(flag) << 2;
uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
__LOP3_0xe2(r, mask, t1, __floast_as_int(t));
const float p = fabsf(__int_as_float(r)) - 1.0f;
return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}
Edit:
The macro defines are finally:
#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.
In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:
F = (A || B) && (!C) ((A or B) and (not-C))
Then we would compute immLut argument by:
immLut = (0xF0 | 0xCC) & (~0xAA)
Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.
For the above example, immLut would have a computed value of 0x54
If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.
Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.
$ cat t1149.cu
#include <stdio.h>
const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));
__device__ int my_LOP_0x54(int A, int B, int C){
int temp;
asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
return temp;
}
__global__ void testkernel(){
printf("A=true, B=false, C=true, F=%d\n", my_LOP_0x54(true, false, true));
printf("A=true, B=false, C=false, F=%d\n", my_LOP_0x54(true, false, false));
printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}
int main(){
printf("0x%x\n", A_or_B_and_notC);
testkernel<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true, F=0
A=true, B=false, C=false, F=1
A=false, B=false, C=false, F=0
$
Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)
Also see here for a method of understanding how the 8 bit truthtable (immLut) works for LOP3 at the source code level.
I managed to get alpha working, but i want to change alpha only on one texture. Currently it works only with both. I set depth test and blend factors:
context3D.setDepthTest(false, Context3DCompareMode.LESS);
context3D.setBlendFactors(Context3DBlendFactor.SOURCE_ALPHA, Context3DBlendFactor.ONE_MINUS_SOURCE_ALPHA);
I haven't still figured out, what is exactly source and destionation, any clarification would be nice.
Vertex data contains
[x, y, z, u, v, 0|1]
Set vertex registers
context3D.setVertexBufferAt(0, m_vertexBuffer, 0, Context3DVertexBufferFormat.FLOAT_3);
context3D.setVertexBufferAt(1, m_vertexBuffer, 3, Context3DVertexBufferFormat.FLOAT_2);
context3D.setVertexBufferAt(2, m_vertexBuffer, 5, Context3DVertexBufferFormat.FLOAT_1);
Set fragment constant
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 0, Vector.<Number>([textureMul, 0, 0, 0]));
Vertex shader
m44 op, va0, vc0
mov v0, va1
mov v1, va2
Fragment shader
tex ft0, v0, fs0 <2d,clamp,linear>
sub ft0.w, ft0.w, fc0.x
add ft0.w, ft0.w, v1
mov oc, ft0
Where fc0.x is number between 0 and 1. And v1 is either 0 or 1.
Thank god for Windows for PIX program (comes with Microsoft DirectX SDK). I understood how to use properly FLOAT. Since i pass v1 as FLOAT_1, i must use v1.x, so right fragment shader would be
tex ft0, v0, fs0 <2d,clamp,linear>
sub ft0.w, ft0.w, fc0.x
add ft0.w, ft0.w, v1.x
mov oc, ft0