Kafka SparkStreaming configuration specify offsets/message list size - configuration

I am fairly new to both Kafka and Spark and trying to write a job (either Streaming or batch). I would like to read from Kafka a predefined number of messages (say x), process the collection through workers and then only start working on the next set of x messages. Basically each message in Kafka is 10 KB and I want to put 2 GB worth of messages in a single S3 file.
So is there any way of specifying the number of messages that the receiver fetches?
I have read that I can specify 'from offset' while creating DStream, but this use case is somewhat different. I need to be able to specify both 'from offset' and 'to offset'.

There's no way to set ending offset as the initial parameter (as you can for starting offset), but
you can use createDirectStream (the fourth overloaded version in the listing) which gives you the ability to get the offsets of the current micro batch using HasOffsetRanges (which gives you back OffsetRange).
That means that you'll have to compare values that you get from OffsetRange with your ending offset in every micro batch in order to see where you are and when to stop consuming from Kafka.
I guess you also need to think about the fact that each partition has its sequential offset. I assume it would be easiest if you could go a bit over 2GB, as much as it takes to finish the current micro-batch (could be couple of kB, depending on density of your messages), in order to avoid splitting the last batch on consumed and unconsumed part, which may require you to fiddle with offsets that Spark keeps in order to track what's consumed and what isn't.
Hope this helps.

Related

Kafka Consumer - How to set fetch.max.bytes higher than the default 50mb?

I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.

AS3 - Massive Numbers/Integers, Beyond MAX_VALUE

Can anyone help me write a class, e.g. BigNumber.as (or BigInt.as) which will:
Allow for really really big numbers/integers.
Include a method to express a number in format "1.54 Million", "1.98 Vigintillion" and so on...
Allow the maximum number to stop only at the last number word (e.g. Million, Vigintillion, etc) in the defined list. (e.g. list built from here: https://en.wikipedia.org/wiki/Names_of_large_numbers under Standard dictionary numbers [Short scale])
I had an idea to have a class which contains 2 Number values ("value" and "timesMaxedOut"). When "value" >= Number.MAX_VALUE, it would then increment "timesMaxedOut" by 1 and reset "value" back to the difference that the value went over by.
The problem? It seems if you hit or surpass "MAX_VALUE" then the Number will reset to 0. I'm also sure it would then be difficult to properly multiply or divide numbers with this approach, as it would need to take into account "timesMaxedOut" just for the calculations to work correctly.
My goal is to write a game which would allow players to reach really big numbers, and play indefinitely essentially, but AS3 lacks very large number support it seems.

JMeter: Capturing Throughput in Command Line Interface Mode

In Jmeter v2.13, is there a way to capture Throughput via non-GUI/Command Line mode?
I have the jmeter.properties file configured to output via the Summariser and I'm also outputting another [more detailed] .csv results file.
call ..\..\binaries\apache-jmeter-2.13\bin\jmeter -n -t "API Performance.jmx" -l "performanceDetailedResults.csv"
The performanceDetailedResults.csv file provides:
timeStamp
elapsed time
responseCode
responseMessage
threadName
success
failureMessage
bytes sent
grpThreads
allThreads
Latency
However, no amount of tweaking the .properties file or the test itself seems to provide Throuput results like I get via the GUI's Summary Report's Save Table Data button.
All articles, postings, and blogs seem to indicate it wasn't possible without manual manipulation in a spreadsheet. But I'm hoping someone out there has figured out a way to do this with no, or minimal, manual manipulation as the client doesn't want to have to manually calculate the Throughput value each time.
It is calculated by JMeter Listeners so it isn't something you can enable via properties files. Same applies to other metrics which are calculated like:
Average response time
50, 90, 95, and 99 percentiles
Standard Deviation
Basically throughput is calculated as simple as dividing total number of requests by elapsed time.
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time)
Hopefully it won't be too hard for you.
References:
Glossary #1
Glossary #2
Did you take a look at JMeter-Plugins?
This tool can generate aggregate report through commandline.

Windows Phone IsolatedStorageSettings: capacity and dynamic allocation

I am going to save enough big amounts of data in my WP8 app using the handy IsolatedStorageSettings dictionary. However, the first question that arises is how big is it?
Second, in the documentation for the IsolatedStorageSettings.Save method we can find this:
If more space is required, use the IsolatedStorageFile.IncreaseQuotaTo
method to request more storage space from the host.
Can we estimate the amount of required memory and increase the room for IsolatedStorageSettings accordingly? What if we need to do that dynamically, as the user is entering new portions of data to store persistently? Or, maybe, we need to use another technique for that (though I would like to stay with the handy IsolatedStorageSettings class)?
I have found the answer to the first part of my question in this article: How to find out the Space in isolated storage in Windows Phone?. Here is the code to get the required value on a particular device with some enhancements:
long availablespace, Quota;
using (var store = IsolatedStorageFile.GetUserStoreForApplication())
{
availablespace = store.AvailableFreeSpace ;
Quota = store.Quota ;
}
MessageBox.Show("Available : " + availablespace.ToString("##,#") + "\nQuota : " + Quota.ToString("##,#));
The 512Mb WP8 emulator gave me the following values for a minimal app with few strings saved in IsolatedStorageSettings:
Lumia 920 reports even a much bigger value - about 20Gb, which gladdens my heart. Such a big value (which, I think, depends on the whole available memory in the device) will allow me to use the IsolatedStorageSettings object for huge amounts of data.
As for a method one can use to estimate the amount of data, I guess, this can be done only experimentally. For instance, when I added some strings to my IsolatedStorageSettings, the available space was reduced by 4Kb. However, adding the same portion of data again did not show any new memory allocation. As I can see, it is allocated by blocks of 4Kb.

How to detect local maxima and curve windows correctly in semi complex scenarios?

I have a series of data and need to detect peak values in the series within a certain number of readings (window size) and excluding a certain level of background "noise." I also need to capture the starting and stopping points of the appreciable curves (ie, when it starts ticking up and then when it stops ticking down).
The data are high precision floats.
Here's a quick sketch that captures the most common scenarios that I'm up against visually:
One method I attempted was to pass a window of size X along the curve going backwards to detect the peaks. It started off working well, but I missed a lot of conditions initially not anticipated. Another method I started to work out was a growing window that would discover the longer duration curves. Yet another approach used a more calculus based approach that watches for some velocity / gradient aspects. None seemed to hit the sweet spot, probably due to my lack of experience in statistical analysis.
Perhaps I need to use some kind of a statistical analysis package to cover my bases vs writing my own algorithm? Or would there be an efficient method for tackling this directly with SQL with some kind of local max techniques? I'm simply not sure how to approach this efficiently. Each method I try it seems that I keep missing various thresholds, detecting too many peak values or not capturing entire events (reporting a peak datapoint too early in the reading process).
Ultimately this is implemented in Ruby and so if you could advise as to the most efficient and correct way to approach this problem with Ruby that would be appreciated, however I'm open to a language agnostic algorithmic approach as well. Or is there a certain library that would address the various issues I'm up against in this scenario of detecting the maximum peaks?
my idea is simple, after get your windows of interest you will need find all the peaks in this window, you can just compare the last value with the next , after this you will have where the peaks occur and you can decide where are the best peak.
I wrote one simple source in matlab to show my idea!
My example are in wave from audio file :-)
waveFile='Chick_eco.wav';
[y, fs, nbits]=wavread(waveFile);
subplot(2,2,1); plot(y); legend('Original signal');
startIndex=15000;
WindowSize=100;
endIndex=startIndex+WindowSize-1;
frame = y(startIndex:endIndex);
nframe=length(frame)
%find the peaks
peaks = zeros(nframe,1);
k=3;
while(k <= nframe - 1)
y1 = frame(k - 1);
y2 = frame(k);
y3 = frame(k + 1);
if (y2 > 0)
if (y2 > y1 && y2 >= y3)
peaks(k)=frame(k);
end
end
k=k+1;
end
peaks2=peaks;
peaks2(peaks2<=0)=nan;
subplot(2,2,2); plot(frame); legend('Get Window Length = 100');
subplot(2,2,3); plot(peaks); legend('Where are the PEAKS');
subplot(2,2,4); plot(frame); legend('Peaks in the Window');
hold on; plot(peaks2, '*');
for j = 1 : nframe
if (peaks(j) > 0)
fprintf('Local=%i\n', j);
fprintf('Value=%i\n', peaks(j));
end
end
%Where the Local Maxima occur
[maxivalue, maxi]=max(peaks)
you can see all the peaks and where it occurs
Local=37
Value=3.266296e-001
Local=51
Value=4.333496e-002
Local=65
Value=5.049438e-001
Local=80
Value=4.286804e-001
Local=84
Value=3.110046e-001
I'll propose a couple of different ideas. One is to use discrete wavelets, the other is to use the geographer's concept of prominence.
Wavelets: Apply some sort of wavelet decomposition to your data. There are multiple choices, with Daubechies wavelets being the most widely used. You want the low frequency peaks. Zero out the high frequency wavelet elements, reconstruct your data, and look for local extrema.
Prominence: Those noisy peaks and valleys are of key interest to geographers. They want to know exactly which of a mountain's multiple little peaks is tallest, the exact location of the lowest point in the valley. Find the local minima and maxima in your data set. You should have a sequence of min/max/min/max/.../min. (You might want to add an arbitrary end points that are lower than your global minimum.) Consider a min/max/min sequence. Classify each of these triples per the difference between the max and the larger of the two minima. Make a reduced sequence that replaces the smallest of these triples with the smaller of the two minima. Iterate until you get down to a single min/max/min triple. In your example, you want the next layer down, the min/max/min/max/min sequence.
Note: I'm going to describe the algorithmic steps as if each pass were distinct. Obviously, in a specific implementation, you can combine steps where it makes sense for your application. For the purposes of my explanation, it makes the text a little more clear.
I'm going to make some assumptions about your problem:
The windows of interest (the signals that you are looking for) cover a fraction of the entire data space (i.e., it's not one long signal).
The windows have significant scope (i.e., they aren't one pixel wide on your picture).
The windows have a minimum peak of interest (i.e., even if the signal exceeds the background noise, the peak must have an additional signal excess of the background).
The windows will never overlap (i.e., each can be examined as a distinct sub-problem out of context of the rest of the signal).
Given those, you can first look through your data stream for a set of windows of interest. You can do this by making a first pass through the data: moving from left to right, look for noise threshold crossing points. If the signal was below the noise floor and exceeds it on the next sample, that's a candidate starting point for a window (vice versa for the candidate end point).
Now make a pass through your candidate windows: compare the scope and contents of each window with the values defined above. To use your picture as an example, the small peaks on the left of the image barely exceed the noise floor and do so for too short a time. However, the window in the center of the screen clearly has a wide time extent and a significant max value. Keep the windows that meet your minimum criteria, discard those that are trivial.
Now to examine your remaining windows in detail (remember, they can be treated individually). The peak is easy to find: pass through the window and keep the local max. With respect to the leading and trailing edges of the signal, you can see n the picture that you have a window that's slightly larger than the actual point at which the signal exceeds the noise floor. In this case, you can use a finite difference approximation to calculate the first derivative of the signal. You know that the leading edge will be somewhat to the left of the window on the chart: look for a point at which the first derivative exceeds a positive noise floor of its own (the slope turns upwards sharply). Do the same for the trailing edge (which will always be to the right of the window).
Result: a set of time windows, the leading and trailing edges of the signals and the peak that occured in that window.
It looks like the definition of a window is the range of x over which y is above the threshold. So use that to determine the size of the window. Within that, locate the largest value, thus finding the peak.
If that fails, then what additional criteria do you have for defining a region of interest? You may need to nail down your implicit assumptions to more than 'that looks like a peak to me'.