Efficient mapping of byte buffers - okio

I'm looking into the source code for Okio in order to understand efficient byte transferring better, and as a toy example made a little ForwardingSource which inverts individual bytes as they come along. For example, it transforms (unsigned) 0b1011 to (unsigned) 0b0100.
class ByteInvertingSource(source: Source) : ForwardingSource(source) {
// temporarily stores incoming bytes
private val sourceBuffer: Buffer = Buffer()
override fun read(sink: Buffer, byteCount: Long): Long {
// read incoming bytes
val count = delegate.read(sourceBuffer, byteCount)
// write inverted bytes to sink
sink.write(
sourceBuffer.readByteArray().apply {
println("Converting: ${joinToString(",") { it.toString(2) }}")
forEachIndexed { index, byte -> this[index] = byte.inv() }
println("Converted : ${joinToString(",") { it.toString(2) }}")
}
)
return count
}
}
Is this optimal code?
Specifically:
Do I really need the sourceBuffer field, or could I use another trick to transform the bytes directly?
Is it more efficient to read the individual bytes from sourceBuffer and write the individual bytes into sink? (I can't find a write(Byte) method, so maybe that is a clue that it's not.)

It looks pretty close the this testing sample from OkHttp.
https://github.com/square/okhttp/blob/f8fd4d08decf697013008b05ad7d2be10a648358/okhttp-testing-support/src/main/kotlin/okhttp3/UppercaseResponseInterceptor.kt
override fun read(
sink: Buffer,
byteCount: Long
): Long {
val buffer = Buffer()
val read = delegate.read(buffer, byteCount)
if (read != -1L) {
sink.write(buffer.readByteString().toAsciiUppercase())
}
return read
}
It is definitely not more efficient to read individual bytes. I don't think you can improve your invert loop, as it's an operation on a single byte. But generally you don't want to be doing loops in your code, so definitely do the bulk reads.

Related

Is it possible to read a portion of a blob without copying it through blob.splice()?

I'm trying to pass an image stored as binary to the client through a web socket and display it. However, the first byte of the binary payload is an integer denoting a request id.
Is this the correct way to read that byte and convert to an unsigned integer?
Is this inefficient since the rest of the blob, excluding the first byte, is copied over using evt.data.splice(1)? The MDN document on blob.splice() seems to indicate that it returns a new blob. Does this mean that this method will require twice as much memory since it makes a new copy of the image data less the first byte?
Is there a more efficient method that perhaps just reads from evt.data starting at the first byte rather than copying it first?
Thank you.
WEBS.socks[name].onmessage = function(evt) {
if (evt.data instanceof Blob) {
evt.data.slice(0,1).arrayBuffer()
.then( (b) => {
let v = new DataView(b);
if ( v.getUint8() === 123 ) {
let objectURL = URL.createObjectURL( evt.data.slice(1) );
imgscan.onload = () => { URL.revokeObjectURL( objectURL ); };
imgscan.src = objectURL;
}
});
}
};

Streaming PipeReader content to Json in ASP.NET MVC

We have an ASP.NET MVC application in which we need to send back json response. The content for sending this json response is coming from a PipeReader.
The approach we have taken is to read all the contents from the PipeReader using ReadAsync and convert the byte array to a string and write it as base64 string.
Here is the code sample:
List<byte> bytes = new List<byte>();
try
{
while (true)
{
ReadResult result = await reader.ReadAsync();
ReadOnlySequence<byte> buffer = result.Buffer;
bytes.AddRange(buffer.ToArray());
reader.AdvanceTo(buffer.End);
if (result.IsCompleted)
{
break;
}
}
}
finally
{
await reader.CompleteAsync();
}
byte[] byteArray = bytes.ToArray();
var base64str = Convert.ToBase64String(byteArray);
We have written a JsonConverter which does the conversion to json. The JsonConverter has a reference to the Utf8JsonWriter instance and we write using the WriteString method on the Utf8JsonWriter.
The above approach requires us to read the entire content in memory from the pipereader and then write to the Utf8JsonWriter.
Instead we want to read a sequence of bytes from the pipereader, convert to utf8 and write it immediately. We do not want to convert the entire content in memory before writing.
Is that even feasible ? I don't know if we can do utf8 conversion in chunk instead of doing it all in one go.
The main reason for this is that the content coming from PipeReader can be large and so we want to do some kind of streaming instead of converting to string in memory and then write to the Json output.

Golang reading csv consuming more than 2x space in memory than on disk

I am loading in a lot of CSV files into a struct using Golang.
The struct is
type csvData struct {
Index []time.Time
Columns map[string][]float64
}
I have a parser that uses:
csv.NewReader(file).ReadAll()
Then I iterate over the rows, and convert the values into their types: time.Time or float64.
The problem is that on disk these files consume 5GB space.
Once I load them into memory they consume 12GB!
I used ioutil.ReadFile(path) and found that this was, as expected, almost exactly the on-disk size.
Here is the code for my parser, with errors omitted for readability, if you could help me troubleshoot:
var inMemoryRepo = make([]csvData, 0)
func LoadCSVIntoMemory(path string) {
parsedData := csvData{make([]time.Time, 0), make(map[string][]float64)}
file, _ := os.Open(path)
reader := csv.NewReader(file)
columnNames := reader.Read()
columnData := reader.ReadAll()
for _, row := range columnData {
parsedData.Index = append(parsedData.Index, parseTime(row[0])) //parseTime is a simple wrapper for time.Parse
for i := range row[1:] { //parse non-index numeric columns
parsedData.Columns[columnNames[i]] = append(parsedData.Columns[columnsNames[i]], parseFloat(columnData[i])) //parseFloat is wrapper for strconv.ParseFloat
}
}
inMemoryRepo = append(inMemoryRepo, parsedData)
}
I tried troubleshooting by setting columnData and reader to nil at end of function call, but no change.
There is nothing surprising in this. On your disk there are just the characters (bytes) of your CSV text. When you load them into memory, you create data structures from your text.
For example, a float64 value requires 64 bits in memory, that is: 8 bytes. If you have an input text "1", that is 1 single byte. Yet, if you create a float64 value equal to 1, that will still consume 8 byes.
Further, strings are stored having a string header (reflect.StringHeader) which is 2 integer values (16 bytes on 64-bit architectures), and this header points to the actual string data. See String memory usage in Golang for details.
Also slices are similar data structures: reflect.SliceHeader. The header consists of 3 integer values, which again is 24 bytes on 64-bit architectures even if there are no elements in the slice.
Structs on top of this may have padding (fields must be aligned to certain values), which again adds overhead. For details, see Spec: Size and alignment guarantees.
Go maps are hashmaps, which again has quite some overhead, for details see why slice values can sometimes go stale but never map values?, for memory usage see How much memory do golang maps reserve?
Reading an entire file into memory rarely is a good idea.
What if your csv is 100GiB?
If your transformation does not involve several records, maybe you could apply the following algorithm:
open csv_reader (source file)
open csv_writer (destination file)
for row in csv_reader
transform row
write row into csv_writer
close csv_reader and csv_write

Quicker way to deepcopy objects in golang, JSON vs gob

I am using go 1.9. And I want to deepcopy value of object into another object. I try to do it with encoding/gob and encoding/json. But it takes more time for gob encoding than json encoding. I see some other questions like this and they suggest that gob encoding should be quicker. But I see exact opposite behaviour. Can someone tell me if I am doing something wrong? Or any better and quicker way to deepcopy than these two? My object's struct is complex and nested.
The test code:
package main
import (
"bytes"
"encoding/gob"
"encoding/json"
"log"
"time"
"strconv"
)
// Test ...
type Test struct {
Prop1 int
Prop2 string
}
// Clone deep-copies a to b
func Clone(a, b interface{}) {
buff := new(bytes.Buffer)
enc := gob.NewEncoder(buff)
dec := gob.NewDecoder(buff)
enc.Encode(a)
dec.Decode(b)
}
// DeepCopy deepcopies a to b using json marshaling
func DeepCopy(a, b interface{}) {
byt, _ := json.Marshal(a)
json.Unmarshal(byt, b)
}
func main() {
i := 0
tClone := time.Duration(0)
tCopy := time.Duration(0)
end := 3000
for {
if i == end {
break
}
r := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew Test
t0 := time.Now()
Clone(r, &rNew)
t2 := time.Now().Sub(t0)
tClone += t2
r2 := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew2 Test
t0 = time.Now()
DeepCopy(&r2, &rNew2)
t2 = time.Now().Sub(t0)
tCopy += t2
i++
}
log.Printf("Total items %+v, Clone avg. %+v, DeepCopy avg. %+v, Total Difference %+v\n", i, tClone/3000, tCopy/3000, (tClone - tCopy))
}
I get following output:
Total items 3000, Clone avg. 30.883µs, DeepCopy avg. 6.747µs, Total Difference 72.409084ms
JSON vs gob difference
The encoding/gob package needs to transmit type definitions:
The implementation compiles a custom codec for each data type in the stream and is most efficient when a single Encoder is used to transmit a stream of values, amortizing the cost of compilation.
When you "first" serialize a value of a type, the definition of the type also has to be included / transmitted, so the decoder can properly interpret and decode the stream:
A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types.
This is explained in great details here: Efficient Go serialization of struct to disk
So while in your case it's necessary to create a new gob encoder and decoder each time, it is still the "bottleneck", the part that makes it slow. Encoding to / decoding from JSON format, type description is not included in the representation.
To prove it, make this simple change:
type Test struct {
Prop1 [1000]int
Prop2 [1000]string
}
What we did here is made the types of fields arrays, "multiplying" the values a thousand times, while the type information is effectively remained the same (all elements in the arrays have the same type). Creating values of them like this:
r := Test{Prop1: [1000]int{}, Prop2: [1000]string{}}
Now running your test program, the output on my machine:
Original:
2017/10/17 14:55:53 Total items 3000, Clone avg. 33.63µs, DeepCopy avg. 2.326µs, Total Difference 93.910918ms
Modified version:
2017/10/17 14:56:38 Total items 3000, Clone avg. 119.899µs, DeepCopy avg. 462.608µs, Total Difference -1.02812648s
As you can see, in the original version JSON is faster, but in the modified version gob became faster, as the cost of transmitting type info amortized.
Testing / benching method
Now on to your testing method. This way of measuring performance is bad and can yield quite inaccurate results. Instead you should use Go's built-in testing and benchmark tools. For details, read Order of the code and performance.
Caveats of these cloning
These methods work with reflection and thus can only "clone" fields that are accessible via reflection, that is: exported. Also they often don't manage pointer equality. By this I mean if you have 2 pointer fields in a struct, both pointing to the same object (pointers being equal), after marshaling and unmarshaling, you'll get 2 different pointers pointing to 2 different values. This may even cause problems in certain situations. They also don't handle self-referencing structures, which at best returns an error, or in wrose case causes an infinite loop or goroutine stack exceeding.
The "proper" way of cloning
Considering the caveats mentioned above, often the proper way of cloning needs help from the "inside". That is, cloning a specific type is often only possible if that type (or the package of that type) provides this functionality.
Yes, providing a "manual" cloning functionality is not convenient, but on the other side it will outperform the above methods (maybe even by orders of magnitude), and will require the least amount of "working" memory required for the cloning process.

How to store a QPixmap in JSON, via QByteArray?

I have a QByteArray, which I want to save in a JSON file using Qt and also be able to read from it again. Since JSON natively can't store raw data, I think the best way would probably be a string? The goal is to save a QPixmap this way:
{
"format" : "jpg",
"data" : "...jibberish..."
}
How do I achieve this and how do I read from this JSON Object again (I am using Qt5)? What I have right now looks this way:
QPixmap p;
...
QByteArray ba;
QBuffer buffer(&ba);
buffer.open(QIODevice::WriteOnly);
p.save(&buffer, "jpg");
QJsonObject json;
gameObject["data"] = QString(buffer.data());
QJsonDocument doc(json);
file.write(doc.toJson());
But the resulting 'jibberish' is way to short to contain the whole image.
A QString cannot be constructed from an arbitrary QByteArray. You need to encode the byte array such that it is convertible to a string first. It is somewhat misleading that a QString is constructible from a QByteArray from the C++ semantics point of view. Whether it is really constructible depends on what's in the QByteArray.
QByteArray::toBase64 and fromBase64 are one way of doing it.
Since you would want to save the pixmap without losing its contents, you should not save it in a lossy format like JPG. Use PNG instead. Only use JPG if you're not repeatedly loading and storing the same pixmap while doing the full json->pixmap->json circuit.
There's another gotcha: for a pixmap to store or load itself, it needs to internally convert to/from QImage. This involves potentially color format conversions. Such conversions may lose data. You have to be careful to ensure that any roundtrips are made with the same format.
Ideally, you should be using QImage instead of a QPixmap. In modern Qt, a QPixmap is just a thin wrapper around a QImage anyway.
// https://github.com/KubaO/stackoverflown/tree/master/questions/pixmap-to-json-32376119
#include <QtGui>
QJsonValue jsonValFromPixmap(const QPixmap &p) {
QBuffer buffer;
buffer.open(QIODevice::WriteOnly);
p.save(&buffer, "PNG");
auto const encoded = buffer.data().toBase64();
return {QLatin1String(encoded)};
}
QPixmap pixmapFrom(const QJsonValue &val) {
auto const encoded = val.toString().toLatin1();
QPixmap p;
p.loadFromData(QByteArray::fromBase64(encoded), "PNG");
return p;
}
int main(int argc, char **argv) {
QGuiApplication app{argc, argv};
QImage img{32, 32, QImage::Format_RGB32};
img.fill(Qt::red);
auto pix = QPixmap::fromImage(img);
auto val = jsonValFromPixmap(pix);
auto pix2 = pixmapFrom(val);
auto img2 = pix2.toImage();
Q_ASSERT(img == img2);
}