Spark streaming maxRate is violated sometimes - configuration

I have a simple Spark Streaming process (1.6.1) which receives data from Azure Event Hub. I am experimenting with back pressure and maxRate settings. This is my configuration:
spark.streaming.backpressure.enabled = true
spark.streaming.backpressure.pid.minRate = 900
spark.streaming.receiver.maxRate = 1000
I use two receivers, therefor per microbatch I would expect 2000 messages (in total). Most of the time this works fine (the total event count is below or equal the maxRate value). However sometimes I have spikes which violate the maxRate value.
My test case is as follows:
- send 10k events to azure event hub
- mock job/cluster downtime (no streaming job is running) 60sec delay
- start streaming job
- process events and assert events number smaller or equal to 2000
In that test I can observe that the total number of events sometime is higher than 2000, for example: 2075, 2530, 2040. It is not significantly higher and the processing is not time consuming however I would still expect the total number of events per microbatch to obey the maxRate value. Furthermore sometime the total number of events is smaller than backpressure.pid.minRate, for example: 811, 631.
Am I doing something wrong?

Related

Kafka Consumer - How to set fetch.max.bytes higher than the default 50mb?

I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.

TransactionError when using Brownie on Optimism - Tx dropped without known replacement

I have a Python script using Brownie that occasionally triggers a swap on Uniswap by sending a transaction to Optimism Network.
It worked well for a few days (did multiple transactions successfully), but now each time it triggers a transaction, I get an error message:
TransactionError: Tx dropped without known replacement
However, the transaction goes through and get validated, but the script stops.
swap_router = interface.ISwapRouter(router_address)
params = (
weth_address,
dai_address,
3000,
account.address,
time.time() + 86400,
amount * 10 ** 18,
0,
0,
)
amountOut = swap_router.exactInputSingle(params, {"from": account})
There is a possibility that one of your methods seeks data off-chain and is being called prematurely before the confirmation is received.
I had the same problem, and I managed to sort it out by adding
time.sleep(60)
at the end of the function that seeks for data off-chain
"Dropped and replaced" means the transaction is being replaced by a new one, Eth is being overloaded with a new gas fee. My guess is that you need to increase your gas costs in order to average the price.

Fargate scaling up works consistently but scaling down is not working consistently

We have a simple example of target tracking autoscaling configured for an ecs containerized application based on the CPU and memory. We have 4 alarms autoconfigured by the code below (2 CPU - 1 scale up, 1 scaledown, and 2 memory, 1 scale up and 1 scale down)
We see that when the cloudwatch alarms trigger for autoscaling up, our ecs service tasks autoscales up instantaneously (on the ecs side, there are events present straight away setting the desired count upwards).
However, we are observing different behaviour when the cloudwatch alarms trigger for autoscaling down:
Sometimes ecs service tasks scale down straight away (scale down alarms goes off straight away and set desired count downwards event present straight away on ecs side)
Sometimes ecs service tasks scales down at a delayed time e.g. 7-15 minutes later, or even a few hours later (scale down alarms goes off straight away but set desired count downwards event delayed on ecs side for 7-15 mins, or a few hours later)
Sometimes ecs service tasks do not scale down at all (we saw over the weekend that scale down alarms were triggered but the ecs service tasks never scaled down over a 48 hour period and set desired count downwards event never reached ecs side)
On the cloudwatch alarm side we are observing that the alarm always goes off when expected for both scaling up and down, its on the ecs side that we think the issue resides.
The autoscaling code is as follows:
resource aws_appautoscaling_target this {
max_capacity = 5
min_capacity = 1
resource_id = "service/dev/service1"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource aws_appautoscaling_policy memory {
name = "memory"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.this.resource_id
scalable_dimension = aws_appautoscaling_target.this.scalable_dimension
service_namespace = aws_appautoscaling_target.this.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
scale_in_cooldown = 60
scale_out_cooldown = 60
target_value = 50
}
}
resource aws_appautoscaling_policy cpu {
name = "cpu"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.this.resource_id
scalable_dimension = aws_appautoscaling_target.this.scalable_dimension
service_namespace = aws_appautoscaling_target.this.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
scale_in_cooldown = 60
scale_out_cooldown = 60
target_value = 60
}
}
Has anyone seen this behaviour i.e. that alarms in cloudwatch are going off correctly, the ecs service is always scaling up when expected but not always scaling down when expected? Are we missing something obvious here?, Help greatly appreciated
Check your policy configuration. When you have multiple scaling policies, they must all be ready to scale down together.
If your goal is to scale down after inactivity, you can try playing with disabling scale down on certain policies to reduce the variables for scale down and/or raise the target utilization on certain policies. If there is activity that is intermittent, it might be a signal to a given policy that it shouldn't scale down yet. It needs sustained low activity to scale down.

Python 1Hz measurement temperature

I want to create csv file in python ,storing data from the sensor plus time stamp of the reading .But sensor measures fast and I need exactly 1 measurment from the sensor exactly after 1 sec.For example sensor value is 20 at time 12:34:15.and i need value exactly at 12:34:16 .I do not have to use time.sleep because it creates delay more than second and will affect the log file if i have to take readings more than hundred.
Consumer PCs do not have real-time operating systems and there is no guarantee that a particular process will execute at least once per second and certainly no guarantee that it will be executing at each 1 second interval. If you want precision timed measurements with Python, you should look as Micropython executing on a microcontroller board. It may be able to do what you want. Python on a Raspberry Pi board might also work better than a PC.
On a regular PC, I would start with something using perf_counter.
from time import perf_counter as timer
from somewhere import sensor, save # read temperature, save value
t0 = t1 = timer()
delta = .99999 # adjust by experiment to average 1 sec reading intervals
while True:
while t1 - t0 < delta:
t1 = timer()
value = sensor())
save(value)
t0 = t1

Kafka SparkStreaming configuration specify offsets/message list size

I am fairly new to both Kafka and Spark and trying to write a job (either Streaming or batch). I would like to read from Kafka a predefined number of messages (say x), process the collection through workers and then only start working on the next set of x messages. Basically each message in Kafka is 10 KB and I want to put 2 GB worth of messages in a single S3 file.
So is there any way of specifying the number of messages that the receiver fetches?
I have read that I can specify 'from offset' while creating DStream, but this use case is somewhat different. I need to be able to specify both 'from offset' and 'to offset'.
There's no way to set ending offset as the initial parameter (as you can for starting offset), but
you can use createDirectStream (the fourth overloaded version in the listing) which gives you the ability to get the offsets of the current micro batch using HasOffsetRanges (which gives you back OffsetRange).
That means that you'll have to compare values that you get from OffsetRange with your ending offset in every micro batch in order to see where you are and when to stop consuming from Kafka.
I guess you also need to think about the fact that each partition has its sequential offset. I assume it would be easiest if you could go a bit over 2GB, as much as it takes to finish the current micro-batch (could be couple of kB, depending on density of your messages), in order to avoid splitting the last batch on consumed and unconsumed part, which may require you to fiddle with offsets that Spark keeps in order to track what's consumed and what isn't.
Hope this helps.