I'm writing a decoder-handler for a binary message format which is possible to decode up to the last complete element of the message in the available bytes. So if there's an array of int64 values, you read the highest multiple of 8 bytes which is less than ByteBuf.readableBytes(). I'm using netty 5.0.0.Alpha2.
My problem is that it looks as though Netty is discarding the unread bytes left in the ByteBuf, rather than appending new network-bytes to them; this means that when I try to resume decoding, it fails since there are missing bytes and a corrupt stream.
Are there ChannelHandlerContext or ByteBuf or Channel methods which I should be invoking to preserve those unread bytes? Or is the current/only solution to save them in a scratch space within the handler myself? I suspect that buffer-pooling is the reason why a different buffer is being used for the subsequent read.
Thanks
Michael
PS: I'm not keen on using the ReplayingDecoder or ByteToMessageDecoder classes as fitting my decoder library around them would be too intrusive (IMHO).
Are there ChannelHandlerContext or ByteBuf or Channel methods which I should be invoking to preserve those unread bytes? Or is the current/only solution to save them in a scratch space within the handler myself? I suspect that buffer-pooling is the reason why a different buffer is being used for the subsequent read.
That's what ByteToMessageDecoder does. If I understood correctly, you want your buffer always has n * 8 bytes.
public class Int64ListDecoder extends ByteToMessageDecoder {
#Override
protected void decode(ctx, in, out) {
final int inLen = in.readableBytes();
final int outLen = inLen / 8 * 8;
out.add(in.readSlice(outLen).retain());
}
}
Related
I have the following snippet of code:
#include <stdio.h>
struct Nonsense {
float3 group;
float other;
};
__global__ void coalesced(float4* float4Array, Nonsense* nonsenseArray) {
float4 someCoordinate = float4Array[threadIdx.x];
someCoordinate.x = 5;
float4Array[threadIdx.x] = someCoordinate;
Nonsense nonsenseValue = nonsenseArray[threadIdx.x];
nonsenseValue.other = 3;
nonsenseArray[threadIdx.x] = nonsenseValue;
}
int main() {
float4* float4Array;
cudaMalloc(&float4Array, 32 * sizeof(float4));
cudaMemset(float4Array, 32 * sizeof(float4), 0);
Nonsense* nonsenseArray;
cudaMalloc(&nonsenseArray, 32 * sizeof(Nonsense));
cudaMemset(nonsenseArray, 32 * sizeof(Nonsense), 0);
coalesced<<<1, 32>>>(float4Array, nonsenseArray);
cudaDeviceSynchronize();
return 0;
}
When I run this through the Nvidia profiler in Nsight, and look at the Global Memory Access Pattern, the float4Array has perfect coalesced reads and writes. Meanwhile, the Nonsense array has a poor access patterns (due to it being an array of structs).
Does NVCC automatically convert a float4 array which conceptually is an array of structs into a struct of array for better memory access patterns?
No, it does not convert it into a struct of arrays. I think if you think about this carefully, you would conclude that it is nearly impossible for the compiler to reorganize data this way. After all, the thing that is being passed is a pointer.
There is only one array, and the elements of that array still have the struct elements in the same order:
float address (i.e. index): 0 1 2 3 4 5 ...
array element : a[0].x a[0].y a[0].z a[0].w a[1].x a[1].y ...
However the float4 array gives a better pattern because the compiler generates a single 16-byte load per thread. This is sometimes referred to as a "vector load" because we are loading a vector (float4 in this case) per thread. Therefore, adjacent threads are still reading adjacent data, and you have ideal coalescing behavior. In the above example, thread 0 would read a[0].x, a[0].y, a[0].z and a[0].w, thread 1 would read a[1].x, a[1].y etc. All of this would take place in a single request (i.e. SASS instruction) but may be split across multiple transactions. The splitting of a request into multiple transactions does not result in any loss of efficiency (in this case).
In the case of the Nonsense struct, the compiler does not recognize that that struct could also be loaded in a similar fashion, so under the hood it must generate 3 or 4 loads per thread:
one 8-byte load (or two 4-byte loads) to load the first two words of the float3 group
one 4-byte load to load the last word of the float3 group
one 4-byte load to load the float other
If you map out the above loads per thread, perhaps using the above diagram, you will see that each load involves a stride (unused elements between the items loaded per thread) and so results in lower efficiency.
By using careful typecasting or a union definition in your struct, you can get the compiler to load your Nonsense struct in a single load.
This answer also covers some ideas related to AoS -> SoA conversion and the related efficiency gains.
This answer covers vector load details.
I have a stream with ANSI string. It is prefixed with bytes length. How can I read it into std::string?
Something like:
short len = reader.readInt16();
char[] result = reader.readBytes(len); // ???
std::string str = std::copy(result, result + len);
but there is no method readBytes(int).
Side question: is it slow to read with readByte() from DataReader one byte at a time?
According to MSDN, DataReader::ReadBytes exists and is what you are looking for: http://msdn.microsoft.com/en-us/library/windows/apps/windows.storage.streams.datareader.readbytes
It takes an Platform::Array<unsigned char> as an argument, which presumably you'll initialize using the prefixed length, which on returning will contain your bytes. From there it's a tedious-but-straightforward process to construct the desired std::string.
The basic usage will look something like this (apologies, on a Mac at the moment, so precise syntax might be a little off):
auto len = reader->ReadInt16();
auto data = ref new Platform::Array<uint8>(len);
reader->ReadBytes(data);
// now data has the bytes you need, and you can make a string with it
Note that the above code is not production-ready - it's definitely possible that reader does not have enough data buffered, and so you'll need to reader.LoadAsync(len) and create a continuation to process the data when it is available. Despite that, hopefully this is enough to get you going.
EDIT:
Just noticed your side question. The short answer is, yes, it is much slower to read a byte at a time, for the reason that it is much more work.
The long answer: Consider what goes in to each byte:
A function call happens - stack frame allocation
Some logic of reading a byte from the buffer happens
The function returns - stack frame is popped, result is pushed, control returns
You take the byte, and push it into a std::string, occasionally causing dynamic re-allocation (unless you've already str.resize(len), that is)
Of all the things that happen, the dynamic reallocation is the really performance killer. That being said, if you have lots of bytes the work of function-calling will dominate the work of reading a byte.
Now, consider what happens when you read all the bytes at once:
A function call happens - stack frame, push the result array
(in the happy path where all requested data is there) memcpy from the internal buffer to your pre-allocated array
return
memcpy into the string
This is of course quite a bit faster - your allocations are constant with respect to the number of bytes read, as are the number of function calls.
I am implementing a compiler for a proprietary language.
The language has one built-in integer type, with unlimited range. Sometimes variables are represented using smaller types, for example if a and b are integer variables but b is only ever assigned the value of the expression a % 100000 or a & 0xFFFFFF, then b can be represented as an Int32 instead.
I am considering implementing the following optimization. Suppose it sees the equivalent of this C# method:
public static void Main(string[] args)
{
BigInt i = 0;
while (true)
{
DoStuff(i++);
}
}
Mathematically speaking, transforming into the following is not valid:
public static void Main(string[] args)
{
Int64 i = 0;
while (true)
{
DoStuff(i++);
}
}
Because I have replaced a BigInt with an Int64, which will eventually overflow if the loop runs forever. However I suspect I can ignore this possibility because:
i is initialized to 0 and is modified only by repeatedly adding 1 to it, which means that will take 263 iterations of the loop to make it overflow
If DoStuff does any useful work, it will take centuries (extrapolated from my very crude tests) for i to overflow. The machine the program runs on will not last that long. Not only that but its architecture probably won't last that long either, so I also don't need to worry about it running on a VM that is migrated to new hardware.
If DoStuff does not do any useful work, an operator will eventually notice that it is wasting CPU cycles and kill the process
So what scenarios do I need to worry about?
Do any compilers already use this hack?
Well.. It seems to me you already answered your question.
But I doubt the question really has any useful outcome.
If the only built-in integer has unlimited range by default it should not inefficient for typical usage such as a loop counter.
I think expanding value range (and allocate more memory to the variable) only after actual overflow occur won't that hard for such language.
Please see the class I have created at http://textsnip.com/see/WAVinAS3 for parsing a WAVE file in ActionScript 3.0.
This class is correctly pulling apart info from the file header & fmt chunks, isolating the data chunk, and creating a new ByteArray to store the data chunk. It takes in an uncompressed WAVE file with a format tag of 1. The WAVE file is embedded into my SWF with the following Flex embed tag:
[Embed(source="some_sound.wav", mimeType="application/octet-stream")]
public var sound_class:Class;
public var wave:WaveFile = new WaveFile(new sound_class());
After the data chunk is separated, the class attempts to make a Sound object that can stream the samples from the data chunk. I'm having issues with the streaming process, probably because I'm not good at math and don't really know what's happening with the bits/bytes, etc.
Here are the two documents I'm using as a reference for the WAVE file format:
http://www.lightlink.com/tjweber/StripWav/Canon.html
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
Right now, the file IS playing back! In real time, even! But...the sound is really distorted. What's going on?
The problem is in the onSampleData handler.
In your wav file, the amplitudes are stored as signed shorts, that is 16 bit integers. You are reading them as 32 bit signed floats. Integers and floats are represented differently in binary, so that will never work right.
Now, the player expects floats. Why did they use floats? Don't know for sure, but one good reason is that it allows the player to accept a normalized value for each sample. That way you don't have to care or know what bitdept the player is using: the max value is 1, and the min value is -1, and that's it.
So, your problem is you have to convert your signed short to a normalized signed float. A short takes 16 bits, so it can store 2 ^ 16 (or 65,536) different values. Since it's signed and the sign takes up one bit, the max value will be 2 ^ 15. So, you know your input is the range -32,768 ... 32,767.
The sample value is normalized and must be in the range -1 ... 1, on the other hand.
So, you have to normalize your input. It's quite easy. Just take the read value and divide it by the max value, and you have your input amplitude converted to the range -1 ... 1.
Something like this:
private function onSampleData(evt:SampleDataEvent):void
{
var amplitude:int = 0;
var maxAmplitude:int = 1 << (bitsPerSample - 1); // or Math.pow(2, bitsPerSample - 1);
var sample:Number = 0;
var actualSamples:int = 8192;
var samplesPerChannel:int = actualSamples / channels;
for ( var c:int = 0; c < samplesPerChannel ; c++ ) {
var i:int = 0;
while(i < channels && data.bytesAvailable >= 2) {
amplitude = data.readShort();
sample = amplitude / maxAmplitude;
evt.data.writeFloat(sample);
i++;
}
}
}
A couple of things to note:
maxAmplitude could (and probably
should) be calculated when you read
the bitdepth. I'm doing it in the
method just so you can see it in the
pasted code.
Although maxAmplitude is calculated
based on the read bitdepth and thus
will be correct for any bitdepth,
I'm reading shorts in the loop, so
if your wav file happens to use a
different bitdepth, this function
will not work correctly. You could
add a switch and read the necessary
ammount of data (i.e., readInt if
bitdepth is 32). However, 16 bits is
such a widely used standard, that I
doubt this is practically needed.
This function will work for
stereo wavs. If you want it to work
for mono, re write it to write the
same sample twice. That is, for each
read, you do two writes (your input
is mono, but the player expects 2
samples).
I removed the EOF catch, as you can
know if you have enough data to read
from your buffer checking
bytesAvailable. Reaching the end of
stream is not exceptional in any
way, IMO, so I'd rather control that
case without an exception handler,
but this is just a personal
preference.
I recently gave a interview to one of the TOP software company. I was completely stuck with only one question asked by interviewer to me, which was
Q. I have a machine with 512 mb / 1 GB RAM and I have to sort a file (XML, or any) of 4 GB size. How will I proceed? What will be the data structure, and which sorting algorithm will I use and how?
Do you think it is achievable? If yes then can you please explain?
Thanks in advance!
The answer the interviewer might want maybe how you manage to efficiently sort the data set which exceeds system memory.The following section is taken from Wikipedia:
Memory usage patterns and index
sorting
When the size of the array to be
sorted approaches or exceeds the
available primary memory, so that
(much slower) disk or swap space must
be employed, the memory usage pattern
of a sorting algorithm becomes
important, and an algorithm that might
have been fairly efficient when the
array fit easily in RAM may become
impractical. In this scenario, the
total number of comparisons becomes
(relatively) less important, and the
number of times sections of memory
must be copied or swapped to and from
the disk can dominate the performance
characteristics of an algorithm. Thus,
the number of passes and the
localization of comparisons can be
more important than the raw number of
comparisons, since comparisons of
nearby elements to one another happen
at system bus speed (or, with caching,
even at CPU speed), which, compared to
disk speed, is virtually
instantaneous.
For example, the popular recursive
quicksort algorithm provides quite
reasonable performance with adequate
RAM, but due to the recursive way that
it copies portions of the array it
becomes much less practical when the
array does not fit in RAM, because it
may cause a number of slow copy or
move operations to and from disk. In
that scenario, another algorithm may
be preferable even if it requires more
total comparisons.
One way to work around this problem,
which works well when complex records
(such as in a relational database) are
being sorted by a relatively small key
field, is to create an index into the
array and then sort the index, rather
than the entire array. (A sorted
version of the entire array can then
be produced with one pass, reading
from the index, but often even that is
unnecessary, as having the sorted
index is adequate.) Because the index
is much smaller than the entire array,
it may fit easily in memory where the
entire array would not, effectively
eliminating the disk-swapping problem.
This procedure is sometimes called
"tag sort".[5]
Another technique for overcoming the
memory-size problem is to combine two
algorithms in a way that takes
advantages of the strength of each to
improve overall performance. For
instance, the array might be
subdivided into chunks of a size that
will fit easily in RAM (say, a few
thousand elements), the chunks sorted
using an efficient algorithm (such as
quicksort or heapsort), and the
results merged as per mergesort. This
is less efficient than just doing
mergesort in the first place, but it
requires less physical RAM (to be
practical) than a full quicksort on
the whole array.
Techniques can also be combined. For
sorting very large sets of data that
vastly exceed system memory, even the
index may need to be sorted using an
algorithm or combination of algorithms
designed to perform reasonably with
virtual memory, i.e., to reduce the
amount of swapping required.
Use Divide and Conquer.
Here's the pseudocode:
function sortFile(file)
if fileTooBigForMemory(file)
pair<firstHalfOfFile, secondHalfOfFile> = breakIntoTwoHalves()
sortFile(firstHalfOfFile)
sortFile(secondHalfOfFile)
else
sortCharactersInFile(file)
endif
MergeTwoHalvesInOrder(firstHalfOfFile, secondHalfOfFile)
end
Two well-known algorithms that fall in to the divide and conquer category are merge sort and quick sort algorithm. So you could use them for implementation.
As for the data structure, a char array containing characters in the file could do. If you want to be more object oriented, wrap it in a class called File:
class File {
private char[] characters;
//methods to access and mutate 'characters'
}
There is a nice post on the Guido van Rossum blog which has something to suggest. Beware that the code is in Python.
Split your file to chunks which fit into memory.
Sort each chunk using quick sort and save it to a separate file.
Then merge result files and you get your result.
I would use a multiway merge. There is an excellent book called Managing Gigabytes that shows several different ways of doing it. They also go into sort based inversion for files that are larger than physical memory. Look around page 240 for a pretty detailed algorithm on sorting through chunks on disk.
The post above is correct in that you split the file and sort each portion.
Say you have the 4GB file and only want to load a max of 512MB. That means you need to split the file into 8 chunks minimum. If you are not sure how much extra overhead your sort is going to use, you might even double that number to be safe to 16 chunks.
The 16 files are then sorted one at a time to be in a guaranteed order. So now you have chunk 0-15 as sorted files.
Now you open 16 file handles to those files and read one entry at a time, writing the lowest one to the final output. Since you know each of the files is already sorted, taking the lowest from each means you are then writing them in the correct order to the final output.
I have used such a system in C# for sorting large collections of spam words from emails. The original system required all of them to load into RAM in order to sort them and build a dictionary for spam counts. Once the file grew over 2 GB the in memory structures were requiring 6+GB of RAM and taking over 24 hours to sort due to paging and VM. The new system using the chunking above sorted the entire file in under 40 minutes. That was an impressive speedup for such a simple change.
I played with various load options (1/4 system memory per chunk, etc). It turned out that for our situation the best option was about 1/10 system memory. Then Windows had enough memory left over for decent File I/O buffering to offset the increased file traffic. And the machine was left very responsive to other processes running on it.
And yes, I do frequently like to ask these types of questions in interviews as well. Just to see if people can think outside the box. What do you do when you can't just use .Sort() on a list?
Just simulate a virtual memory, overload the array index operator, []
Find a quicksort implementation that sorts an array in C++ or C#. overload the indexer operator [] which will read from and save to file. That way, you can just plug existing sort algorithms, you just change what happens behind the scenes on those []
here's one example of simulating virtual memory on C#
source: http://msdn.microsoft.com/en-us/library/aa288465(VS.71).aspx
// indexer.cs
// arguments: indexer.txt
using System;
using System.IO;
// Class to provide access to a large file
// as if it were a byte array.
public class FileByteArray
{
Stream stream; // Holds the underlying stream
// used to access the file.
// Create a new FileByteArray encapsulating a particular file.
public FileByteArray(string fileName)
{
stream = new FileStream(fileName, FileMode.Open);
}
// Close the stream. This should be the last thing done
// when you are finished.
public void Close()
{
stream.Close();
stream = null;
}
// Indexer to provide read/write access to the file.
public byte this[long index] // long is a 64-bit integer
{
// Read one byte at offset index and return it.
get
{
byte[] buffer = new byte[1];
stream.Seek(index, SeekOrigin.Begin);
stream.Read(buffer, 0, 1);
return buffer[0];
}
// Write one byte at offset index and return it.
set
{
byte[] buffer = new byte[1] {value};
stream.Seek(index, SeekOrigin.Begin);
stream.Write(buffer, 0, 1);
}
}
// Get the total length of the file.
public long Length
{
get
{
return stream.Seek(0, SeekOrigin.End);
}
}
}
// Demonstrate the FileByteArray class.
// Reverses the bytes in a file.
public class Reverse
{
public static void Main(String[] args)
{
// Check for arguments.
if (args.Length == 0)
{
Console.WriteLine("indexer <filename>");
return;
}
FileByteArray file = new FileByteArray(args[0]);
long len = file.Length;
// Swap bytes in the file to reverse it.
for (long i = 0; i < len / 2; ++i)
{
byte t;
// Note that indexing the "file" variable invokes the
// indexer on the FileByteStream class, which reads
// and writes the bytes in the file.
t = file[i];
file[i] = file[len - i - 1];
file[len - i - 1] = t;
}
file.Close();
}
}
Use the above code to roll your own array class. Then just use any array sorting algorithms.