I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.
I am targeting 10x load for an API, this API contains 6 endpoints which should be under the test, but each endpoint has its own throughput which should be multiplied by 10.
Now, I put all endpoints in one script file, but it doesn't make any sense to have the same throughput for all endpoints, I wanna run the k6 and it has to stop automatically when the needed throughput is already reached for a specific group.
Example:
api/GetUser > current 1k RPM > target 10k RPM
api/GetManyUsers > current 500 RPM > target 5k RPM
The main problem is when I put each endpoint in a separate group in one single script, this let k6 iterate over both groups/endpoints with the same iterations count with the same virtual users, which leads to reach 10x for both endpoints which is not required at the moment.
One more thing, I already tried to separate all endpoints in separate scripts, but this is difficult to manage and this makes the monitoring not easy because all 6 endpoints should be run in parallel.
What you need can currently be approximated roughly with the __ITER and/or __VU execution context variables. Have a single default function that has something like this:
if (__ITER % 3 == 0) {
CallGetManyUsers(); // 33% of iterations
} else {
CallGetUser(); // 66% of iterations
}
In the very near future we plan to also add a more elegant way of supporting multi-scenario tests in a single script: https://github.com/loadimpact/k6/pull/1007
I have a big binary data iof ip data about Xmb. Processes use binary do some search algorithm to lookup ip address. I have three method.
1. put in ets. but I suppose every read access will copy big binary to process. :(
2. put in gen_server state. processes use gen_server:call to get address.The short coming concurrency.
3. compile binary into beam. but when I compile get
eheap_alloc: Cannot allocate 1318267840 bytes of memory (of type "heap")
which the best practice of big data share in erlang?
Binaries over 64 bytes in size are stored as reference counted binaries and their data is stored outside the heap of any process. If such a binary is sent to any process, the underlying data is not duplicated. So, if you store such a binary in an ETS table and then access it from various processes, the underlying data will not be copied, only its reference count will be incremented/decremented. I'd suggest going with the ETS table solution.
Here's a demonstration of the memory usage at boot, after inserting a 100MB binary into an ETS table, and after fetching a copy of the binary into the shell process. The memory usage does not change after we have a copy binary stored in the shell process. The same would not be true if it was million character string (list of integers) that we were copying in from ETS or another process.
1> erlang:memory().
[{total,21912472},
{processes,5515456},
{processes_used,5510816},
{system,16397016},
{atom,223561},
{atom_used,219143},
{binary,844872},
{code,4808780},
{ets,301232}]
2> ets:new(foo, [named_table, set]).
foo
3> ets:insert(foo, {foo, binary:copy(<<".">>, 104857600)}).
true
4> erlang:memory().
[{total,127038632},
{processes,5600320},
{processes_used,5599952},
{system,121438312},
{atom,223561},
{atom_used,220445},
{binary,105770576},
{code,4908097},
{ets,308416}]
5> X = ets:lookup(foo, foo).
[{foo,<<"........................................................................................................"...>>}]
6> erlang:memory().
[{total,127511632},
{processes,6082360},
{processes_used,6081992},
{system,121429272},
{atom,223561},
{atom_used,220445},
{binary,105761504},
{code,4908097},
{ets,308416}]
You can find a lot more info about how to efficiently work with binaries in Erlang in the link above.
I am fairly new to both Kafka and Spark and trying to write a job (either Streaming or batch). I would like to read from Kafka a predefined number of messages (say x), process the collection through workers and then only start working on the next set of x messages. Basically each message in Kafka is 10 KB and I want to put 2 GB worth of messages in a single S3 file.
So is there any way of specifying the number of messages that the receiver fetches?
I have read that I can specify 'from offset' while creating DStream, but this use case is somewhat different. I need to be able to specify both 'from offset' and 'to offset'.
There's no way to set ending offset as the initial parameter (as you can for starting offset), but
you can use createDirectStream (the fourth overloaded version in the listing) which gives you the ability to get the offsets of the current micro batch using HasOffsetRanges (which gives you back OffsetRange).
That means that you'll have to compare values that you get from OffsetRange with your ending offset in every micro batch in order to see where you are and when to stop consuming from Kafka.
I guess you also need to think about the fact that each partition has its sequential offset. I assume it would be easiest if you could go a bit over 2GB, as much as it takes to finish the current micro-batch (could be couple of kB, depending on density of your messages), in order to avoid splitting the last batch on consumed and unconsumed part, which may require you to fiddle with offsets that Spark keeps in order to track what's consumed and what isn't.
Hope this helps.
I'm writing tool to parse and extract some data from cpuprofile file (file produced when I save profile report) and I'm having some troubles with precision of Self & Total times calculations. So, time depends on the value from field hitCount, but. When hitCount is small (<300) the coefficient between hitCount and Self time ~1.033. But as hitCount grows, coefficient also grows.
So, when hitCount=3585, k is 1.057. When hitCount=7265: k=1.066.
Currently I'm using 1.035 as coefficient, I tried to minimize error on my sample data. But I'm not happy with approximation. I'm not familiar with Chromium code base to go and figure it out directly in the source code.
So how do I get Self time for function call having hitCount value?
Basically it's:
sampleDuration = totalRecordingDuration / totalHitCount
nodeSelfTime = nodeHitCount * sampleDuration
You can find it here:
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/devtools/front_end/sdk/CPUProfileDataModel.js&sq=package:chromium&type=cs&l=31