Arduino (ESP8266) not receiving all characters - arduino-ide

I have a file of 18164 bytes that I am trying to download from a ESP8266 (4m) but I'm only receiving part of the file (about 17000 bytes). Sometimes it will receive all but more often than not won't. The file is a minified javascript file. Smaller files of 5000 bytes work fine. If buffer size is a problem, how do I increase the buffer size for WiFiClient?
void setup()
{
String line = "";
...
while ((net.connected() || net.available()))
{
if (net.available())
{
char c = net.read();
line += c;
}
}
net.stop();
Serial.println("Line:-" + line + "-");
}
void loop() {
}

You're probably exhausting the available memory in your ESP8266. It doesn't have a lot to start with.
Your 4m ESP8266 has 4Mbits of flash storage (512Kbytes), but you're accumulating the file you download in RAM, which is much more scarce. The ESP8266 has 80Kbytes of RAM. Some of this RAM will already be used by the Arduino Core and network stack.
I loaded a minimal program on an ESP8266 with a connected WiFi client and saw 51Kbytes of RAM available.
You're also using String a lot. String stores the actual string itself using memory that's allocated from the free memory on the ESP8266. Each time you add a character, String will free its previous storage and allocate the same amount plus one byte more. The Arduino Core doesn't have particularly sophisticated memory management; this could possibly lead to heap fragmentation to a point where you just can't allocate a bigger piece of memory, and your program would fail.
You can learn about the state of free memory with these calls:
ESP.getFreeHeap() - returns total amount of available memory
ESP.getHeapFragmentation() - returns an indication of how badly the heap is fragmented. The bigger, the worse - over 50 is very bad.
ESP.getMaxFreeBlockSize() - the biggest block you can allocate.
To verify, I'd try something like this:
while ((net.connected() || net.available()))
{
if (net.available())
{
char c = net.read();
line += c;
Serial.print("line length: ");
Serial.println(line.length());
Serial.print("free heap: ");
Serial.println(ESP.getFreeHeap());
Serial.print("heap fragmentation: ");
Serial.println(ESP.getHeapFragmentation());
Serial.print("max block size: ");
Serial.println(ESP.getMaxFreeBlockSize());
}
}
net.stop();
If the max block size gets close to or less than the length of line, you can no longer allocate a big enough chunk of memory to store line + c and your program will fail the way you described.
One thing you could try is to use the reserve() method on String after you create line. Something like:
String line;
line.reserve(18500);
line = "";
Ultimately you're much better off reading your file into a char array, or saving it to SPIFFS a chunk (512 would be a good size) at a time, than using String.

Related

Efficient read some bytes from DataReader?

I have a stream with ANSI string. It is prefixed with bytes length. How can I read it into std::string?
Something like:
short len = reader.readInt16();
char[] result = reader.readBytes(len); // ???
std::string str = std::copy(result, result + len);
but there is no method readBytes(int).
Side question: is it slow to read with readByte() from DataReader one byte at a time?
According to MSDN, DataReader::ReadBytes exists and is what you are looking for: http://msdn.microsoft.com/en-us/library/windows/apps/windows.storage.streams.datareader.readbytes
It takes an Platform::Array<unsigned char> as an argument, which presumably you'll initialize using the prefixed length, which on returning will contain your bytes. From there it's a tedious-but-straightforward process to construct the desired std::string.
The basic usage will look something like this (apologies, on a Mac at the moment, so precise syntax might be a little off):
auto len = reader->ReadInt16();
auto data = ref new Platform::Array<uint8>(len);
reader->ReadBytes(data);
// now data has the bytes you need, and you can make a string with it
Note that the above code is not production-ready - it's definitely possible that reader does not have enough data buffered, and so you'll need to reader.LoadAsync(len) and create a continuation to process the data when it is available. Despite that, hopefully this is enough to get you going.
EDIT:
Just noticed your side question. The short answer is, yes, it is much slower to read a byte at a time, for the reason that it is much more work.
The long answer: Consider what goes in to each byte:
A function call happens - stack frame allocation
Some logic of reading a byte from the buffer happens
The function returns - stack frame is popped, result is pushed, control returns
You take the byte, and push it into a std::string, occasionally causing dynamic re-allocation (unless you've already str.resize(len), that is)
Of all the things that happen, the dynamic reallocation is the really performance killer. That being said, if you have lots of bytes the work of function-calling will dominate the work of reading a byte.
Now, consider what happens when you read all the bytes at once:
A function call happens - stack frame, push the result array
(in the happy path where all requested data is there) memcpy from the internal buffer to your pre-allocated array
return
memcpy into the string
This is of course quite a bit faster - your allocations are constant with respect to the number of bytes read, as are the number of function calls.

How best to transfer a large number of arrays of chars to the GPU?

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...
One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).
Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.
Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);

CUDA - device memory,searching a string in a text

I have a string of length maximum 500 characters and a text file of size 200MB. I want to write a program in CUDA to search the string in the text file. My text file is too large and I think I have to put it in a global memory of the device, but what about my string? Which is the best among the shared, constant and texture memory? and why?
Also I have an array of size maximum 2500. Which types of device memory is suitable for storing it?
For a naive implementation on Fermi:
Store the text file in global memory and the search string in constant memory. Set up a result buffer of the same size as the text file. Fill the result buffer with zeroes.
Let the number of threads per block, t, be the same as the length of the search string. To determine grid dimensions, consider the size of your text file and the grid dimension limit of 64K. To cover your whole file, select the dimension for x to be, for instance, 10K. Then find the dimension for y by dividing the size of your text file with x and rounding up the result. So 200M / 10K = 20K (which is within 64K). Launch the kernel with t threads and an (x, y) grid.
In the kernel:
Calculate the offset into the text file as d = x + 1024 * y.
Since the y dimension was rounded up above, some kernels at the end of the run must be aborted. Abort the thread if d + t is higher than the size of the text file.
Else, have the thread load one character at index t from the search string and compare it with one character at index t + d in the text file. If the characters didn't match, store a "1" in the result buffer at index d, else do nothing.
When the kernel completes, scan through the result buffer with Thrust. Each location that is 0 marks the starting point of a match.
Assuming you write the kernel so that all threads in a half-warp are accessing the same element of the search string simultaneously, constant memory is likely to yield good results. It's optimized for that case.
Here's some pseudo-code for a simple baseline implementation
...load blocksize+strlen bytes of the file into shared memory...
__syncthreads();
bool found = true;
for (int i = 0; i < strlen; i++) {
if (file_chunk_in_sharedmem[threadIdx.x + i] !=
search_str_in_constantmem[i])
{
found = false;
break;
}
}
if (found) {
...output the result...
}
If the loop is structured such that each thread is accessing a different element of the search string, 1d texture memory might be faster.
The profiler and/or cuda timing functions are your friends.

NPP library function argument *pDeviceBuffer

I notice some of the npp functions has an argument *pDeviceBuffer. I am wondering what this argument is for and how I shall set it while using the functions. Also, the results of functions,such as nppsMax_32f, are written back to a pointer. Is the memory on host or device memory? Thank you.
pDeviceBuffer is used as scratch space inside npp. Scratch space is usually allocated internally (like in CUFFT). But some of these operations (sum, min, max) are so quick that allocating the scratch space itself might become a bottleneck. Querying for the scratch space required and then allocating it once before re-using it multiple times would be a good idea.
Example:
Let's say you have a very large array that you want to get min max and sum from, you will need to do the following.
int n = 1e6, bytes = 0;
nppsReductionGetBufferSize_32f(n, &bytes);
Npp8u *scratch = nppsMalloc_8f(bytes);
nppsMax_32f(in, n, max_val, nppAlgHintNone, scratch);
// Reusing scratch space for input of same size
nppsMin_32f(in, n, min_val, nppAlgHintNone, scratch);
// Reusing scratch space for input of smaller size
nppsSum_32f(in, 1e4, sum_val, nppAlgHintNone, scratch);
// Larger inputs may require more scratch space.
// So you may need to check and allocate appropriate space
int newBytes = 0; nppsReductionGetBufferSize_32f(5e6, &newBytes);
if (bytes != newBytes) {
nppsFree(scratch);
scratch = nppsMalloc_8u(bytes);
}
nppsSum_32f(in, 5e6, sum_val, nppAlgHintNone, scratch);

C# Data structure Algorithm

I recently gave a interview to one of the TOP software company. I was completely stuck with only one question asked by interviewer to me, which was
Q. I have a machine with 512 mb / 1 GB RAM and I have to sort a file (XML, or any) of 4 GB size. How will I proceed? What will be the data structure, and which sorting algorithm will I use and how?
Do you think it is achievable? If yes then can you please explain?
Thanks in advance!
The answer the interviewer might want maybe how you manage to efficiently sort the data set which exceeds system memory.The following section is taken from Wikipedia:
Memory usage patterns and index
sorting
When the size of the array to be
sorted approaches or exceeds the
available primary memory, so that
(much slower) disk or swap space must
be employed, the memory usage pattern
of a sorting algorithm becomes
important, and an algorithm that might
have been fairly efficient when the
array fit easily in RAM may become
impractical. In this scenario, the
total number of comparisons becomes
(relatively) less important, and the
number of times sections of memory
must be copied or swapped to and from
the disk can dominate the performance
characteristics of an algorithm. Thus,
the number of passes and the
localization of comparisons can be
more important than the raw number of
comparisons, since comparisons of
nearby elements to one another happen
at system bus speed (or, with caching,
even at CPU speed), which, compared to
disk speed, is virtually
instantaneous.
For example, the popular recursive
quicksort algorithm provides quite
reasonable performance with adequate
RAM, but due to the recursive way that
it copies portions of the array it
becomes much less practical when the
array does not fit in RAM, because it
may cause a number of slow copy or
move operations to and from disk. In
that scenario, another algorithm may
be preferable even if it requires more
total comparisons.
One way to work around this problem,
which works well when complex records
(such as in a relational database) are
being sorted by a relatively small key
field, is to create an index into the
array and then sort the index, rather
than the entire array. (A sorted
version of the entire array can then
be produced with one pass, reading
from the index, but often even that is
unnecessary, as having the sorted
index is adequate.) Because the index
is much smaller than the entire array,
it may fit easily in memory where the
entire array would not, effectively
eliminating the disk-swapping problem.
This procedure is sometimes called
"tag sort".[5]
Another technique for overcoming the
memory-size problem is to combine two
algorithms in a way that takes
advantages of the strength of each to
improve overall performance. For
instance, the array might be
subdivided into chunks of a size that
will fit easily in RAM (say, a few
thousand elements), the chunks sorted
using an efficient algorithm (such as
quicksort or heapsort), and the
results merged as per mergesort. This
is less efficient than just doing
mergesort in the first place, but it
requires less physical RAM (to be
practical) than a full quicksort on
the whole array.
Techniques can also be combined. For
sorting very large sets of data that
vastly exceed system memory, even the
index may need to be sorted using an
algorithm or combination of algorithms
designed to perform reasonably with
virtual memory, i.e., to reduce the
amount of swapping required.
Use Divide and Conquer.
Here's the pseudocode:
function sortFile(file)
if fileTooBigForMemory(file)
pair<firstHalfOfFile, secondHalfOfFile> = breakIntoTwoHalves()
sortFile(firstHalfOfFile)
sortFile(secondHalfOfFile)
else
sortCharactersInFile(file)
endif
MergeTwoHalvesInOrder(firstHalfOfFile, secondHalfOfFile)
end
Two well-known algorithms that fall in to the divide and conquer category are merge sort and quick sort algorithm. So you could use them for implementation.
As for the data structure, a char array containing characters in the file could do. If you want to be more object oriented, wrap it in a class called File:
class File {
private char[] characters;
//methods to access and mutate 'characters'
}
There is a nice post on the Guido van Rossum blog which has something to suggest. Beware that the code is in Python.
Split your file to chunks which fit into memory.
Sort each chunk using quick sort and save it to a separate file.
Then merge result files and you get your result.
I would use a multiway merge. There is an excellent book called Managing Gigabytes that shows several different ways of doing it. They also go into sort based inversion for files that are larger than physical memory. Look around page 240 for a pretty detailed algorithm on sorting through chunks on disk.
The post above is correct in that you split the file and sort each portion.
Say you have the 4GB file and only want to load a max of 512MB. That means you need to split the file into 8 chunks minimum. If you are not sure how much extra overhead your sort is going to use, you might even double that number to be safe to 16 chunks.
The 16 files are then sorted one at a time to be in a guaranteed order. So now you have chunk 0-15 as sorted files.
Now you open 16 file handles to those files and read one entry at a time, writing the lowest one to the final output. Since you know each of the files is already sorted, taking the lowest from each means you are then writing them in the correct order to the final output.
I have used such a system in C# for sorting large collections of spam words from emails. The original system required all of them to load into RAM in order to sort them and build a dictionary for spam counts. Once the file grew over 2 GB the in memory structures were requiring 6+GB of RAM and taking over 24 hours to sort due to paging and VM. The new system using the chunking above sorted the entire file in under 40 minutes. That was an impressive speedup for such a simple change.
I played with various load options (1/4 system memory per chunk, etc). It turned out that for our situation the best option was about 1/10 system memory. Then Windows had enough memory left over for decent File I/O buffering to offset the increased file traffic. And the machine was left very responsive to other processes running on it.
And yes, I do frequently like to ask these types of questions in interviews as well. Just to see if people can think outside the box. What do you do when you can't just use .Sort() on a list?
Just simulate a virtual memory, overload the array index operator, []
Find a quicksort implementation that sorts an array in C++ or C#. overload the indexer operator [] which will read from and save to file. That way, you can just plug existing sort algorithms, you just change what happens behind the scenes on those []
here's one example of simulating virtual memory on C#
source: http://msdn.microsoft.com/en-us/library/aa288465(VS.71).aspx
// indexer.cs
// arguments: indexer.txt
using System;
using System.IO;
// Class to provide access to a large file
// as if it were a byte array.
public class FileByteArray
{
Stream stream; // Holds the underlying stream
// used to access the file.
// Create a new FileByteArray encapsulating a particular file.
public FileByteArray(string fileName)
{
stream = new FileStream(fileName, FileMode.Open);
}
// Close the stream. This should be the last thing done
// when you are finished.
public void Close()
{
stream.Close();
stream = null;
}
// Indexer to provide read/write access to the file.
public byte this[long index] // long is a 64-bit integer
{
// Read one byte at offset index and return it.
get
{
byte[] buffer = new byte[1];
stream.Seek(index, SeekOrigin.Begin);
stream.Read(buffer, 0, 1);
return buffer[0];
}
// Write one byte at offset index and return it.
set
{
byte[] buffer = new byte[1] {value};
stream.Seek(index, SeekOrigin.Begin);
stream.Write(buffer, 0, 1);
}
}
// Get the total length of the file.
public long Length
{
get
{
return stream.Seek(0, SeekOrigin.End);
}
}
}
// Demonstrate the FileByteArray class.
// Reverses the bytes in a file.
public class Reverse
{
public static void Main(String[] args)
{
// Check for arguments.
if (args.Length == 0)
{
Console.WriteLine("indexer <filename>");
return;
}
FileByteArray file = new FileByteArray(args[0]);
long len = file.Length;
// Swap bytes in the file to reverse it.
for (long i = 0; i < len / 2; ++i)
{
byte t;
// Note that indexing the "file" variable invokes the
// indexer on the FileByteStream class, which reads
// and writes the bytes in the file.
t = file[i];
file[i] = file[len - i - 1];
file[len - i - 1] = t;
}
file.Close();
}
}
Use the above code to roll your own array class. Then just use any array sorting algorithms.