Error in writing/Reading to I2C EEPROM + STM32F0 Discovery - hal

I am struggling to Write or read to AT24C256 I2C EEPROM. I am using STM32F0 discovery board to read/write to EEPROM.
I am using HAL library and CUBEMX for basic structure. I have written small code to test the read and write function. On debugging the the values of Test is always '2' whereas it should be '1' if it's successful in writing into memory. Here it is :-
#define ADDR_24LCxx_Write 0x50
#define ADDR_24LCxx_Read 0x50
#define BufferSize 5
uint8_t WriteBuffer[BufferSize],ReadBuffer[BufferSize],Test;
uint16_t i;
I2C_HandleTypeDef hi2c1;
int main(void)
{
HAL_Init();
/* Configure the system clock */
SystemClock_Config();
/* Initialize all configured peripherals */
MX_GPIO_Init();
MX_I2C1_Init();
for(i=0; i<5; i++)
{
WriteBuffer[i]=i;
}
if(HAL_I2C_Mem_Write(&hi2c1, ADDR_24LCxx_Write, 0, I2C_MEMADD_SIZE_8BIT,WriteBuffer,BufferSize, 0x10) == HAL_OK)
{
Test = 1;
}
else
{
Test = 2;
}
HAL_I2C_Mem_Read(&hi2c1, ADDR_24LCxx_Read, 0, I2C_MEMADD_SIZE_8BIT,ReadBuffer,BufferSize, 0x10);
if(memcmp(WriteBuffer,ReadBuffer,BufferSize) == 0 ) /* check date */
{
Test = 3;
}
else
{
Test = 4;
}
}

You should step in the function HAL_I2C_Mem_Write to understand why it does not return HAL_OK. More particularly, you should check what it exactly returns, it would help you.
Looking at your code, I am confident that the issue is with I2C address. In the AT24C256 datasheet, they say that the I2C address is:
1 0 1 0 0 A1 A2 R/W
Assuming that you connected the pins A1 and A2 to GND, the I2C address is:
1 0 1 0 0 0 0 R/W
In hex, the I2C address is 0xA0. So, change your adress definition as follows:
#define ADDR_24LCxx 0xA0
And in the HAL functions:
HAL_I2C_Mem_Write(&hi2c1, ADDR_24LCxx, 0, I2C_MEMADD_SIZE_8BIT,WriteBuffer,BufferSize, 100)
HAL_I2C_Mem_Read(&hi2c1, ADDR_24LCxx, 0, I2C_MEMADD_SIZE_8BIT,ReadBuffer,BufferSize, 100)
Please note that I have also increased the timeout to 100ms. For testing, you don't really want to have timeout issues....

Related

Static const array in Cuda kernel

I need to have the following in the Cuda kernel:
static const float PREDEFINED_CONSTS[16] = {...}; // 16 constants.
float c = PREDEFINED_CONSTS[threadId.x % 16];
/// Use c in computations.
What's the best way to provide PREDEFINED_CONSTS ?
Const memory does't seem good, cause different threads will access different locations.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?
What about this:
float c;
if ( threadId.x % 16 == 0 ) c = VAL0;
else if ( threadId.x % 16 == 1 ) c = VAL1;
...
else if ( threadId.x % 16 ==15 ) c = VAL15;
Although last example has thread divergence, literal VAL* values are part of the instruction opcode, so there will be no reading from memory.
What's the best way to provide PREDEFINED_CONSTS ?
If it were me, I would simply put what you have in your first example in your CUDA kernel and go with that. That is very likely the best way to do it. Later on, if you feel like you have a performance problem with your code, you can use a profiler to steer you in the direction of what needs to be addressed. I doubt it would be this. For constants, there really are only 2 possibilities:
Load them from some kind of memory
Load them as part of the instruction stream.
You've already indicated you are aware of this, you can simply benchmark both if you're really worried. Benchmarking would require more than what you have shown here, might be inconclusive, and may also depend on other factors such as how many times and in what way you are loading these constants.
As you have indicated already, __constant__ doesn't seem to be a sensible choice because the load pattern is clearly non-uniform, across the warp.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?
Yes, your first method will be stored in global memory. This can be confirmed with careful study and compilation using -Xptxas -v. Your second method has the potential (at least) to load the constants via the instruction stream. Since the second method is quite ugly from a coding perspective, and also very inflexible compared to the first method (what if I needed different constants per thread in different places in my code?), it's not what I would choose.
This strikes me as premature optimization. The first method is clearly preferred from a code flexibility and conciseness standpoint, and there is no real reason to think that simply because you are loading from memory that it is a problem. The second method is ugly, inflexible, and may not be any better from a performance perspective. Even if the data is part of the instruction stream, it still has to be loaded from memory.
Here's an example test case suggesting to me that the first case is preferred. If you come up with a different kind of test case, you may end up with a different observation:
$ cat t97.cu
#include <cstdio>
const float VAL0 = 1.1;
const float VAL1 = 2.2;
const float VAL2 = 3;
const float VAL3 = 4;
const float VAL4 = 5;
const float VAL5 = 6;
const float VAL6 = 7;
const float VAL7 = 8;
const float VAL8 = 9;
const float VAL9 = 10;
const float VAL10 = 11;
const float VAL11 = 12;
const float VAL12 = 13;
const float VAL13 = 14;
const float VAL14 = 15;
const float VAL15 = 16;
__global__ void k1(int l){
static const float PREDEFINED_CONSTS[16] = {VAL0, VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10, VAL11, VAL12, VAL13, VAL14, VAL15};
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += PREDEFINED_CONSTS[(threadIdx.x+i) & 15];
if (sum == 0.0) printf("%f\n", sum);
}
__device__ float get_const(int i){
float c = VAL15;
unsigned t = (threadIdx.x+i) & 15;
if (t == 0) c = VAL0;
else if (t == 1) c = VAL1;
else if (t == 2) c = VAL2;
else if (t == 3) c = VAL3;
else if (t == 4) c = VAL4;
else if (t == 5) c = VAL5;
else if (t == 6) c = VAL6;
else if (t == 7) c = VAL7;
else if (t == 8) c = VAL8;
else if (t == 9) c = VAL9;
else if (t == 10) c = VAL10;
else if (t == 11) c = VAL11;
else if (t == 12) c = VAL12;
else if (t == 13) c = VAL13;
else if (t == 14) c = VAL14;
return c;
}
__global__ void k2(int l){
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += get_const(i);
if (sum == 0.0) printf("%f\n", sum);
}
int main(){
int l = 1048576;
k1<<<1,16>>>(l);
k2<<<1,16>>>(l);
cudaDeviceSynchronize();
}
$ nvcc -o t97 t97.cu -Xptxas -v
ptxas info : 68 bytes gmem
ptxas info : Compiling entry function '_Z2k2i' for 'sm_52'
ptxas info : Function properties for _Z2k2i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 324 bytes cmem[0], 8 bytes cmem[2]
ptxas info : Compiling entry function '_Z2k1i' for 'sm_52'
ptxas info : Function properties for _Z2k1i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 324 bytes cmem[0]
$ nvprof ./t97
==22848== NVPROF is profiling process 22848, command: ./t97
==22848== Profiling application: ./t97
==22848== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 91.76% 239.39ms 1 239.39ms 239.39ms 239.39ms k2(int)
8.24% 21.508ms 1 21.508ms 21.508ms 21.508ms k1(int)
API calls: 62.34% 260.89ms 1 260.89ms 260.89ms 260.89ms cudaDeviceSynchronize
37.48% 156.85ms 2 78.427ms 10.319us 156.84ms cudaLaunchKernel
0.13% 542.39us 202 2.6850us 192ns 117.71us cuDeviceGetAttribute
0.04% 156.19us 2 78.094us 58.411us 97.777us cuDeviceTotalMem
0.01% 59.150us 2 29.575us 26.891us 32.259us cuDeviceGetName
0.00% 10.845us 2 5.4220us 1.7280us 9.1170us cuDeviceGetPCIBusId
0.00% 1.6860us 4 421ns 216ns 957ns cuDeviceGet
0.00% 1.5850us 3 528ns 283ns 904ns cuDeviceGetCount
0.00% 667ns 2 333ns 296ns 371ns cuDeviceGetUuid
$

What's the alternative for __match_any_sync on compute capability 6?

In the cuda examples, e.g. here, __match_all_sync __match_any_sync is used.
Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter.
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int pred;
//const auto mask = __match_all_sync(__activemask(), ptr, &pred); //error, should be any_sync, not all_sync
const auto mask = __match_any_sync(__activemask(), ptr, &pred);
const auto leader = __ffs(mask) - 1; // select a leader
int res;
const auto lane_id = ThreadId() % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
The __match_any_sync here splits up the threads in the warp into groups that have the same ptr value, so that each group can update its own ptr atomically without getting in the way of other threads.
I know the nvcc compiler (since cuda 9) does this sort of optimization under the hood automatically, but this is just about the mechanics of __match_any_sync
Is there a way to do this pre compute capability 7?
EDIT: The blog article has now been modified to reflect __match_any_sync() rather than __match_all_sync(), so any commentary to that effect below should be disregarded. The answer below is edited to reflect this.
Based on your statement:
this is just about the mechanics of __match_any_sync
we will focus on a replacement for __match_any_sync itself, not any other form of rewriting the atomicAggInc function. Therefore, we must provide a mask that has the same value as would be returned by __match_any_sync() on cc7.0 or higher architectures.
I believe this will require a loop, which broadcasts the ptr value, in the worst case one iteration for each thread in the warp (since each thread could have a unique ptr value) and testing which threads have the same value. There are various ways we could "optimize" this loop for this function, so as to possibly reduce the trip count from 32 to some lesser value, based on the actual ptr values in each thread, but such optimization in my view introduces considerable complexity, which makes the worst-case processing time longer (as is typical of early-exit optimizations). So I will demonstrate a fairly simple method without this optimization.
The other consideration is what to do in the case of the warp not being converged? For that, we can employ __activemask() to identify that case.
Here is a worked example:
$ cat t1646.cu
#include <iostream>
#include <stdio.h>
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int mask;
#if __CUDA_ARCH__ >= 700
mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
#else
unsigned tmask = __activemask();
for (int i = 0; i < warpSize; i++){
#ifdef USE_OPT
if ((1U<<i) & tmask){
#endif
unsigned long long tptr = __shfl_sync(tmask, (unsigned long long)ptr, i);
unsigned my_mask = __ballot_sync(tmask, (tptr == (unsigned long long)ptr));
if (i == (threadIdx.x & (warpSize-1))) mask = my_mask;}
#ifdef USE_OPT
}
#endif
#endif
int leader = __ffs(mask) - 1; // select a leader
int res;
unsigned lane_id = threadIdx.x % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
__global__ void k(int *d){
int *ptr = d + threadIdx.x/4;
if ((threadIdx.x >= 16) && (threadIdx.x < 32))
atomicAggInc(ptr);
}
const int ds = 32;
int main(){
int *d_d, *h_d;
h_d = new int[ds];
cudaMalloc(&d_d, ds*sizeof(d_d[0]));
cudaMemset(d_d, 0, ds*sizeof(d_d[0]));
k<<<1,ds>>>(d_d);
cudaMemcpy(h_d, d_d, ds*sizeof(d_d[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < ds; i++)
std::cout << h_d[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1646 t1646.cu -DUSE_OPT
$ cuda-memcheck ./t1646
========= CUDA-MEMCHECK
0 0 0 0 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
========= ERROR SUMMARY: 0 errors
$
(CentOS 7, CUDA 10.1.243, with device 0 being Tesla V100, device 1 being a cc3.5 device).
I've added an optional optimization for the case where the warp is diverged (i.e. tmask is not 0xFFFFFFFF). This can be selected by defining USE_OPT.

How to use CUDA tex1DFetch with cudaTextureObject_t?

I was working with texture references when I noticed they were deprecated, I tried to update my test function to work with the 'new' bindless texture objects with tex1Dfetch but was not able to produce the same results.
I'm currently exploring the use of texture memory to speed up my aho-corasick implementation; I was able to get tex1D() working with texture references, however, I noticed they were deprecated and decided to use texture objects instead.
I'm getting some immensely weird behaviour with the kernels when I try to use the results in any way; I can do results[tidx] = tidx; without any issues, but results[tidx] = temp + 1; only ever returns the value of temp not temp * 3 or any other numerical test involving temp.
I can see no logical reason for this behaviour, and the documentation examples look similar enough that I can't see where I've gone wrong.
I've already read CUDA tex1Dfetch() wrong behaviour and New CUDA Texture Object — getting wrong data in 2D case but neither seem related to the issue I am having.
Just in case it makes a difference; I am am using CUDA release 10.0, V10.0.130 with an Nvidia GTX 980ti.
#include <iostream>
__global__ void test(cudaTextureObject_t tex ,int* results){
int tidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned temp = tex1Dfetch<unsigned>(tex, threadIdx.x);
results[tidx] = temp * 3;
}
int main(){
int *host_arr;
const int host_arr_size = 8;
// Create and populate host array
std::cout << "Host:" << std::endl;
cudaMallocHost(&host_arr, host_arr_size*sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// Create resource description
struct cudaResourceDesc resDesc;
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = &host_arr;
resDesc.res.linear.sizeInBytes = host_arr_size*sizeof(unsigned);
resDesc.res.linear.desc = cudaCreateChannelDesc<unsigned>();
// Create texture description
struct cudaTextureDesc texDesc;
texDesc.readMode = cudaReadModeElementType;
// Create texture
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
// Allocate results array
int * result_arr;
cudaMalloc(&result_arr, host_arr_size*sizeof(unsigned));
// launch test kernel
test<<<1, host_arr_size>>>(tex, result_arr);
// fetch results
std::cout << "Device:" << std::endl;
cudaMemcpy(host_arr, result_arr, host_arr_size*sizeof(unsigned), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaDestroyTextureObject(tex);
cudaFreeHost(host_arr);
cudaFree(result_arr);
}
I expected the above to work similarly to the below (which does work):
texture<int, 1, cudaReadModeElementType> tex_ref;
cudaArray* cuda_array;
__global__ void test(int* results){
const int tidx = threadIdx.x;
results[tidx] = tex1D(tex_ref, tidx) * 3;
}
int main(){
int *host_arr;
int host_arr_size = 8;
// Create and populate host array
cudaMallocHost((void**)&host_arr, host_arr_size * sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// bind to texture
cudaChannelFormatDesc cuDesc = cudaCreateChannelDesc <int >();
cudaMallocArray(&cuda_array, &cuDesc, host_arr_size);
cudaMemcpyToArray(cuda_array, 0, 0, host_arr , host_arr_size * sizeof(int), cudaMemcpyHostToDevice);
cudaBindTextureToArray(tex_ref , cuda_array);
// Allocate results array
int * result_arr;
cudaMalloc((void**)&result_arr, host_arr_size*sizeof(int));
// launch kernel
test<<<1, host_arr_size>>>(result_arr);
// fetch results
cudaMemcpy(host_arr, result_arr, host_arr_size * sizeof(int), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaUnbindTexture(tex_ref);
cudaFreeHost(host_arr);
cudaFreeArray(cuda_array);
cudaFree(result_arr);
}
Expected results:
Host:
0
2
4
6
8
10
12
14
Device:
0
6
12
18
24
30
36
42
Actual results:
Host:
0
2
4
6
8
10
12
14
Device:
0
2
4
6
8
10
12
14
Does anyone know what on earth is going wrong?
CUDA API function calls return error codes. You want to check these error codes. Especially when something is clearly going wrong somewhere…
You use the the same array to store the initial array data as well as to receive the result from the device. Your kernel launch fails with an illegal address error because you do not have a valid texture object. You do not have a valid texture object because the creation of your texture object failed. The first API call right after the kernel launch is the cudaMemcpy() to get the results back. Since there was an error during the kernel launch, cudaMemcpy() will fail, returning the most recent error instead of performing the copy. As a result, the contents of your host_arr buffer are unchanged and you just end up displaying the original input data again.
The reson why creation of your texture object failed is explained in the documentation (emphasis mine):
If cudaResourceDesc::resType is set to cudaResourceTypeLinear, cudaResourceDesc::res::linear::devPtr must be set to a valid device pointer, that is aligned to cudaDeviceProp::textureAlignment. […]
A texture object cannot reference host memory. The problem in your code lies here:
resDesc.res.linear.devPtr = &host_arr;
You need to allocate a buffer in decive memory, e.g., using cudaMalloc(), copy your data there, and create a texture object that refers to that device buffer.
Furthermore, your texDesc is not initialized properly. In your case, it should be sufficient to just zero-initialize it:
struct cudaTextureDesc texDesc = {};
4 steps:
declare
texture<unsigned char,1,cudaReadmodeElementType> tex1;
bind
cudaBindTexture(0,tex1,dev_A);
fetch/read via index
tex1Dfetch(tex1,2);
unbind
cudaUnbindTexture(tex1);

How do I unescape a html attribute value in Prolog?

I find a predicate xml_quote_attribute/2 in a library(sgml)
of SWI-Prolog. This predicate works with the first argument
as input and the second argument as output:
?- xml_quote_attribute('<abc>', X).
X = '<abc>'.
But I couldn't figure out how I can do the reverse conversion.
For example the following query doesn't work:
?- xml_quote_attribute(X, '<abc>').
ERROR: Arguments are not sufficiently instantiated
Is there another predicate that does the job?
Bye
This is how Ruud's solution looks like with DCG notation + pushback lists / semicontext notation.
:- use_module(library(dcg/basics)).
html_unescape --> sgml_entity, !, html_unescape.
html_unescape, [C] --> [C], !, html_unescape.
html_unescape --> [].
sgml_entity, [C] --> "&#", integer(C), ";".
sgml_entity, "<" --> "<".
sgml_entity, ">" --> ">".
sgml_entity, "&" --> "&".
Using DCGs makes the code a bit more readable. It also does away with some of the superfluous backtracking that Cookie Monster noted is the result of using append/3 for this.
Here's the naive solution, using lists of character codes. Most likely it will not give you the best performance possible, but for strings that are not extremely long, it might just be alright.
html_unescape("", "") :- !.
html_unescape(Escaped, Unescaped) :-
append("&", _, Escaped),
!,
append(E1, E2, Escaped),
sgml_entity(E1, U1),
!,
html_unescape(E2, U2),
append(U1, U2, Unescaped).
html_unescape(Escaped, Unescaped) :-
append([C], E2, Escaped),
html_unescape(E2, U2),
append([C], U2, Unescaped).
sgml_entity(Escaped, [C]) :-
append(["&#", L, ";"], Escaped),
catch(number_codes(C, L), error(syntax_error(_), _), fail),
!.
sgml_entity("<", "<").
sgml_entity(">", ">").
sgml_entity("&", "&").
You will have to complete the list of SGML entities yourself.
Sample output:
?- html_unescape("<a> 曹操", L), format('~s', [L]).
<a> 曹操
L = [60, 97, 62, 32, 26361, 25805].
If you don't mind linking a foreign module, then you can make a very efficient implementation in C.
html_unescape.pl:
:- module(html_unescape, [ html_unescape/2 ]).
:- use_foreign_library(foreign('./html_unescape.so')).
html_unescape.c:
#include <stdio.h>
#include <string.h>
#include <SWI-Prolog.h>
static int to_utf8(char **unesc, unsigned ccode)
{
int ok = 1;
if (ccode < 0x80)
{
*(*unesc)++ = ccode;
}
else if (ccode < 0x800)
{
*(*unesc)++ = 192 + ccode / 64;
*(*unesc)++ = 128 + ccode % 64;
}
else if (ccode - 0xd800u < 0x800)
{
ok = 0;
}
else if (ccode < 0x10000)
{
*(*unesc)++ = 224 + ccode / 4096;
*(*unesc)++ = 128 + ccode / 64 % 64;
*(*unesc)++ = 128 + ccode % 64;
}
else if (ccode < 0x110000)
{
*(*unesc)++ = 240 + ccode / 262144;
*(*unesc)++ = 128 + ccode / 4096 % 64;
*(*unesc)++ = 128 + ccode / 64 % 64;
*(*unesc)++ = 128 + ccode % 64;
}
else
{
ok = 0;
}
return ok;
}
static int numeric_entity(char **esc, char **unesc)
{
int consumed;
unsigned ccode;
int ok = (sscanf(*esc, "&#%u;%n", &ccode, &consumed) > 0 ||
sscanf(*esc, "&#x%x;%n", &ccode, &consumed) > 0) &&
consumed > 0 &&
to_utf8(unesc, ccode);
if (ok)
{
*esc += consumed;
}
return ok;
}
static int symbolic_entity(char **esc, char **unesc, char *name, int ccode)
{
int ok = strncmp(*esc, name, strlen(name)) == 0 &&
to_utf8(unesc, ccode);
if (ok)
{
*esc += strlen(name);
}
return ok;
}
static foreign_t pl_html_unescape(term_t escaped, term_t unescaped)
{
char *esc;
if (!PL_get_chars(escaped, &esc, CVT_ATOM | REP_UTF8))
{
PL_fail;
}
else if (strchr(esc, '&') == NULL)
{
return PL_unify(escaped, unescaped);
}
else
{
char buffer[strlen(esc) + 1];
char *unesc = buffer;
while (*esc != '\0')
{
if (*esc != '&' || !(numeric_entity(&esc, &unesc) ||
symbolic_entity(&esc, &unesc, "<", '<') ||
symbolic_entity(&esc, &unesc, ">", '>') ||
symbolic_entity(&esc, &unesc, "&", '&')))
// TODO: more entities...
{
*unesc++ = *esc++;
}
}
return PL_unify_chars(unescaped, PL_ATOM | REP_UTF8, unesc - buffer, buffer);
}
}
install_t install_html_unescape()
{
PL_register_foreign("html_unescape", 2, pl_html_unescape, 0);
}
The following statement will build a shared library html_unescape.so from html_unescape.c. Tested on Ubuntu 14.04; may be different on Windows.
swipl-ld -shared -o html_unescape html_unescape.c
Start up SWI-Prolog:
swipl html_unescape.pl
Sample output:
?- html_unescape('<a> 曹操', S).
S = '<a> 曹操'.
With special thanks to the SWI-Prolog documentation and source code, and to C library to convert unicode code points to UTF8?
Not aspiring as being the ultimate answer, since it doesn't give
a solution for SWI-Prolog. For a Java based interpreter the problem
is that XML escaping is not part of J2SE, at least not in a simple
form (didn't figure out how to use Xerxes or the like).
A possible route would be to interface to StringEscapeUtils ( * ) from
Apache Commons. But then again this would not be necessary on
Android since there is a class TextUtil. So we rolled our own ( * * )
little conversion. It works as follows:
?- text_escape('<abc>', X).
X = '<abc>'
?- text_escape(X, '<abc>').
X = '<abc>'
Note the use of the Java methods codePointAt() and charCount()
respectively appendCodePoint() in the Java source code. So it
could also escape and unescape code points above the basic
plane, i.e. in a range >0xFFFF (currently not implemented,
left as an exercise).
On the other hand the Apache libraries, at least version 2.6, are
NOT surrogate pair aware and will place two decimal entities per
code point instead as one.
Bye
( * ) Java: Class StringEscapeUtils Source
http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/Entities.java#Entities.escape%28java.io.Writer,java.lang.String%29
( * * ) Jekejeke Prolog: Module xml
http://www.jekejeke.ch/idatab/doclet/prod/en/docs/05_run/10_docu/05_frequent/07_theories/20_system/03_xml.html

cudamemcpyasync and streams behaviour understanding

I have this simple code shown below which is doing nothing but just copies some data to the device from host using the streams. But I am confused after running the nvprof as to cudamemcpyasync is really async and understanding of the streams.
#include <stdio.h>
#define NUM_STREAMS 4
cudaError_t memcpyUsingStreams (float *fDest,
float *fSrc,
int iBytes,
cudaMemcpyKind eDirection,
cudaStream_t *pCuStream)
{
int iIndex = 0 ;
cudaError_t cuError = cudaSuccess ;
int iOffset = 0 ;
iOffset = (iBytes / NUM_STREAMS) ;
/*Creating streams if not present */
if (NULL == pCuStream)
{
pCuStream = (cudaStream_t *) malloc(NUM_STREAMS * sizeof(cudaStream_t));
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
cuError = cudaStreamCreate (&pCuStream[iIndex]) ;
}
}
if (cuError != cudaSuccess)
{
cuError = cudaMemcpy (fDest, fSrc, iBytes, eDirection) ;
}
else
{
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
iOffset = iIndex * iOffset ;
cuError = cudaMemcpyAsync (fDest + iOffset , fSrc + iOffset, iBytes / NUM_STREAMS , eDirection, pCuStream[iIndex]) ;
}
}
if (NULL != pCuStream)
{
for (iIndex = 0 ; iIndex < NUM_STREAMS; iIndex++)
{
cuError = cudaStreamDestroy (pCuStream[iIndex]) ;
}
free (pCuStream) ;
}
return cuError ;
}
int main()
{
float *hdata = NULL ;
float *ddata = NULL ;
int i, j, k, index ;
cudaStream_t *abc = NULL ;
hdata = (float *) malloc (sizeof (float) * 256 * 256 * 256) ;
cudaMalloc ((void **) &ddata, sizeof (float) * 256 * 256 * 256) ;
for (i=0 ; i< 256 ; i++)
{
for (j=0; j< 256; j++)
{
for (k=0; k< 256 ; k++)
{
index = (((i * 256) + j) * 256) + k;
hdata [index] = index ;
}
}
}
memcpyUsingStreams (ddata, hdata, sizeof (float) * 256 * 256 * 256, cudaMemcpyHostToDevice, abc) ;
cudaFree (ddata) ;
free (hdata) ;
return 0;
}
The nvprof results are as below.
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
104.35ms 10.38ms - - - - - 16.78MB 1.62GB/s 0 1 7 [CUDA memcpy HtoD]
114.73ms 10.41ms - - - - - 16.78MB 1.61GB/s 0 1 8 [CUDA memcpy HtoD]
125.14ms 10.46ms - - - - - 16.78MB 1.60GB/s 0 1 9 [CUDA memcpy HtoD]
135.61ms 10.39ms - - - - - 16.78MB 1.61GB/s 0 1 10 [CUDA memcpy HtoD]
So I didnt understand the point of using the streams here because of the start time. It looks sequential to me. Please help me to understand as what I am doing wrong here. I am using tesla K20c card.
The PCI Express link that connects your GPU to the system only has one channel going to the card and one channel coming from the card. That means at most, you can have a single cudaMemcpy(Async) operation that is actually executing at any given time, per direction (i.e. one DtoH and one HtoD, at the most). All other cudaMemcpy(Async) operations will get queued up, waiting for those ahead to complete.
You cannot have two operations going in the same direction at the same time. One at a time, per direction.
As #JackOLantern states, the principal benefit for streams is to overlap memcopies and compute, or else to allow multiple kernels to execute concurrently. It also allows one DtoH copy to run concurrently with one HtoD copy.
Since your program does all HtoD copies, they all get executed serially. Each copy has to wait for the copy ahead of it to complete.
Even getting an HtoD and DtoH memcopy to execute concurrently requires a device with multiple copy engines; you can discover this about your device using deviceQuery.
I should also point out, to enable concurrent behavior, you should use cudaHostAlloc, not malloc, for your host side buffers.
EDIT: The answer above has GPUs in view that have at most 2 copy engines (one per direction) and is still correct for such GPUs. However there exist some newer Pascal and Volta family member GPUs that have more than 2 copy engines. In that case, with 2 (or more) copy engines per direction, it is theoretically possible to have 2 (or more) transfers "in-flight" in that direction. However this doesn't change the characteristics of the PCIE (or NVLink) bus itself. You are still limited to the available bandwidth, and the exact low level behavior (whether such transfers appear to be "serialized" or else appear to run concurrently, but take longer due to sharing of bandwidth) should not matter much in most cases.