Kafka & Flink duplicate messages on restart - duplicates

First of all, this is very similar to Kafka consuming the latest message again when I rerun the Flink consumer, but it's not the same. The answer to that question does NOT appear to solve my problem. If I missed something in that answer, then please rephrase the answer, as I clearly missed something.
The problem is the exact same, though -- Flink (the kafka connector) re-runs the last 3-9 messages it saw before it was shut down.
My Versions
Flink 1.1.2
Kafka 0.9.0.1
Scala 2.11.7
Java 1.8.0_91
My Code
import java.util.Properties
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.connectors.kafka._
import org.apache.flink.streaming.util.serialization._
import org.apache.flink.runtime.state.filesystem._
object Runner {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(500)
env.setStateBackend(new FsStateBackend("file:///tmp/checkpoints"))
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "testing");
val kafkaConsumer = new FlinkKafkaConsumer09[String]("testing-in", new SimpleStringSchema(), properties)
val kafkaProducer = new FlinkKafkaProducer09[String]("localhost:9092", "testing-out", new SimpleStringSchema())
env.addSource(kafkaConsumer)
.addSink(kafkaProducer)
env.execute()
}
}
My SBT Dependencies
libraryDependencies ++= Seq(
"org.apache.flink" %% "flink-scala" % "1.1.2",
"org.apache.flink" %% "flink-streaming-scala" % "1.1.2",
"org.apache.flink" %% "flink-clients" % "1.1.2",
"org.apache.flink" %% "flink-connector-kafka-0.9" % "1.1.2",
"org.apache.flink" %% "flink-connector-filesystem" % "1.1.2"
)
My Process
(3 terminals)
TERM-1 start sbt, run program
TERM-2 create kafka topics testing-in and testing-out
TERM-2 run kafka-console-producer on testing-in topic
TERM-3 run kafka-console-consumer on testing-out topic
TERM-2 send data to kafka producer.
Wait for a couple seconds (buffers need to flush)
TERM-3 watch data appear in testing-out topic
Wait for at least 500 milliseconds for checkpointing to happen
TERM-1 stop sbt
TERM-1 run sbt
TERM-3 watch last few lines of data appear in testing-out topic
My Expectations
When there are no errors in the system, I expect to be able to turn flink on and off without reprocessing messages that successfully completed the stream in a prior run.
My Attempts to Fix
I've added the call to setStateBackend, thinking that perhaps the default memory backend just didn't remember correctly. That didn't seem to help.
I've removed the call to enableCheckpointing, hoping that perhaps there was a separate mechanism to track state in Flink vs Zookeeper. That didn't seem to help.
I've used different sinks, RollingFileSink, print(); hoping that maybe the bug was in kafka. That didn't seem to help.
I've rolled back to flink (and all connectors) v1.1.0 and v1.1.1, hoping that maybe the bug was in the latest version. That didn't seem to help.
I've added the zookeeper.connect config to the properties object, hoping that the comment about it only being useful in 0.8 was wrong. That didn't seem to help.
I've explicitly set the checkpointing mode to EXACTLY_ONCE (good idea drfloob). That didn't seem to help.
My Plea
Help!

(I've posted the same reply in the JIRA, just cross-posting the same here)
From your description, I'm assuming you're manually shutting down the job, and then resubmitting it, correct?
Flink does not retain exactly-once across manual job restarts, unless you use savepoints (https://ci.apache.org/projects/flink/flink-docs-master/setup/savepoints.html).
The exactly-once guarantee refers to when the job fails and then automatically restores itself from previous checkpoints (when checkpointing is enabled, like what you did with env.enableCheckpointing(500) )
What is actually happening is that the Kafka consumer is simply start reading from existing offsets committed in ZK / Kafka when you manually resubmitted the job. These offsets were committed to ZK / Kafka the first time you executed the job. They however are not used for Flink's exactly-once semantics; Flink uses internally checkpointed Kafka offsets for that. The Kafka consumer commits those offsets back to ZK simply to expose a measure of progress of the job consumption to the outside world (wrt Flink).

Update 2: I fixed the bug with the offset handling, it got merged in the current MASTER.
Update: Not an issue, use manual savepoints before canceling the job (thanks to Gordon)
I checked the logs and it seems like a bug in the offset handling. I filed a report under https://issues.apache.org/jira/browse/FLINK-4618.
I will update this answer when I got feedback.

Related

stable-baseline3, gym, train while also step/predict

With stable-baselines3 given an agent, we can call "action = agent.predict(obs)". And then with Gym, this would be "new_obs, reward, done, info = env.step(action)". (more or less, maybe missed an input or an output).
We also have "agent.learn(10_000)" as an example, yet here we're less involved in the process and don't call the environment.
Looking for a way to train the agent while still calling "env.step". If you wander why, just trying to implement self play (agent and a previous version of it) playing with one environment (for example turns play as Chess).
WKR, Oren.
But why do you need it? If you take a look at the implementation of any learn method, you will see it is nothing more than an iteration over time steps calling collect_rollouts and train with some additional logging and setup at the beginning for, e.g., further saving the agent etc. Your env.step is called inside collect_rollouts.
I'd better suggest you to write a callback based on CheckpointCallback, which saves your agent (model) after N training steps and then attach this callback to your learn call. In your environment you could instantiate each N steps a "new previous" version of your model by calling ModelClass.load(file) on the file saved by a callback, so that finally you would be able to select actions of the other player using a self-play in your environment

In a Foundry Code Repository, how do I log debug messages from within a Python transform?

Is there any way to introduce debugging statements in your Transforms code, so you can later see them in driver logs, for example? Or is raising exceptions the only way to do this?
Yes this is possible as of transforms-python 1.9.0 driver logs get written to the transaction. You can use python logging to write logs.
For example:
import logging
log = logging.getLogger(__name__)
def my_transform(input):
log.info("Testing logging")
return input

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark 2.4.5, GPU
It is for distributed deep learning model training.
I just tried a very simple example at first:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=2)
def train():
print('in train')
import tensorflow as tf
print('after import tf')
hvd.init()
print('done')
hr.run(train)
But, the command is alway running without any progress.
HorovodRunner will stream all training logs to notebook cell output. If there are too many
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.
### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod Timeline.
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location
of the
timeline file to be created. You can then open the timeline file using the chrome://tracing
facility of the Chrome browser.
Do I miss something or need to set up something to make it work ?
Thanks
your code does no actual training within it.. you might have better luck with the better example code
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

Warn (or fail) if a package is run without having overriden every pkg connectionstring with a config file entry

It seems like a very common issue with SSIS packages is releasing a package to Production that ends up with running the wrong connectionstring parameters. This could happen by making any one of many mistakes or ommisions. As a result, I find it helpful to dump all ConnectionString values to a log file. This helps me understand what connectionstrings were actually applied to the package at run time.
Now, I am considering having my packages check to see if every connnection object in my package had its connectionstring overriden by an entry in the config file and if not, return a warning or even fail the package. This is to allow easier configuration by extracting all environment variables to a config file. If a connectionstring is never overridden, this risks that a package, when run in production, may use development settings or a package, when run in a non production setting when testing, may accidentily be run against production.
I'd like to borrow from anyone who may have tried to do this. I'd also be interested in suggestions on how to accomplish this with minimal work.
Thx
Technical question 1 - what are my connection string
This is an easy question to answer. In your package, add a Script Task and enumerate through the Connections collection. I fire the OnInformation event and if I had this scheduled, I'd be sure to have the /rep iew options in my dtexec to ensure I record Information, Errors and Warnings.
namespace TurnDownForWhat
{
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
/// <summary>
/// ScriptMain is the entry point class of the script. Do not change the name, attributes,
/// or parent of this class.
/// </summary>
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
bool fireAgain = false;
foreach (var item in Dts.Connections)
{
Dts.Events.FireInformation(0, "SCR Enumerate Connections", string.Format("{0}->{1}", item.Name, item.ConnectionString), string.Empty, 0, ref fireAgain);
}
Dts.TaskResult = (int)ScriptResults.Success;
}
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
Running that on my package, I can see I had two Connection managers, CM_FF and CM_OLE along with their connection strings.
Information: 0x0 at SCR Enum, SCR Enumerate Connections: CM_FF->C:\ssisdata\dba_72929.csv
Information: 0x0 at SCR Enum, SCR Enumerate Connections: CM_OLE->Data Source=localhost\dev2012;Initial Catalog=tempdb;Provider=SQLNCLI11;Integrated Security=SSPI;
Add that to ... your OnPreExecute event for all the packages and no one sees it but every reports back.
Technical question 2 - Missed configurations
I'm not aware of anything that will allow a package to know it's under configuration. I'm sure there's an event as you will see in your Information/Warning messages that a package attempted to apply a configuration, didn't find one and is going to retain it's design time value. Information - I'm configuring X via Y. Warning - tried to configure X but didn't find Y. But how to have a package inspect itself to find that out, I have no idea.
That said, I've seen reference to a property that fails package on missed configuration. I'm not seeing it now, but I'm certain it exists in some crevice. You can supply the /w parameter to dtexec which treats warnings as errors and really, warnings are just errors that haven't grown up yet.
Unspoken issue 1 - Permissions
I had a friend who botched an XML config file as part of their production deploy. Their production server started consuming data from a dev server. Bad things happened. It sounds like you have had a similar situation. The resolution is easy, insulate your environments. Are you using the same service account for your production class SQL Server boxes and dev/test/uat/qa/load/etc? STOP. Make a new one. Don't allow prod to talk to any boxes that aren't in their tier of service. Someone bones a package and doesn't set a configuration? First of all, you'll catch it when it goes from dev to something-before-production because that tier wouldn't be able to talk to anything else that's not that level. But if you're in the ultra cheap shop and you've only got dev and prod, so be it. Non-configured package goes to prod. Prod SQL Agent fires off the package. Package uses default connection manager and fails validation because it can't talk to the dev sales database.
Unspoken issue 2 - template
What's your process when you have a new package to build? Does your team really start from scratch? There are so many ways to solve this problem but the core concept is to define your best practices for Configuration, Logging, Package Protection Level, Transaction levels, etc into some easily consumable form. Maybe that's 3 starter packages: one for raw acquisition, maybe one stages and conforms the data and the last one moves data from conformed into the final destination. Teammates then simply have to pick one to start from and fill in the spots that need it. If they choose to do their own thing, that's the stick you beat them with when their package fails to run in production because they didn't follow the standard path.
There are other approaches here. If you're a strong .NET crew, you can gen your template packages that way. At this point, I create my templates with Biml and use that to drive basic package creation.
If I am understanding you correctly the below solution should work.
My suggestion to you is to turn on the Do not save sensitive option for the ProtectionLevel property at the top level of the package.
This will require you to use package configurations for every connection, otherwise it will not have the credentials to make a connection.

Observable Command and Unit tests with Rx RTM

I have ported my code to the RTM version of both WinRT and Rx. I use ReactiveUI in my ViewModels. Before porting the code my unit tests were running without problem but now I got a strange behavior.
Here the test:
var sut = new MyViewModel();
myViewModel.MyCommand.Execute(null) //ReactiveAsyncCommand
Assert.AreEqaul(0, sut.Collection.Count)
If I debug the test step by step, the assertion is not failing, but using the test runner it's failing...
The Collection asserted is modified by a method subscribing to the command:
MyCommand.RegisterAsyncTask(_ => DoWork())
.ObserveOn(SynchronizationContext.Current)
.Subscribe(MethodModifyingCollection);
The code was working before moving it to the RTM. I tried also to remove the ObserveOn and add an await Task.Delay() before the Assert without success.
Steven's got the rightish answer, but there are a few RxUI specific things missing. This is definitely related to scheduling in a test runner, but the reason is that the WinRT version of ReactiveUI can't detect properly whether it's in a test runner at the moment.
The dumb workaround for now is to set this at the top of all your tests:
RxApp.DeferredScheduler = Scheduler.CurrentThread;
Do not use the TestScheduler for every test, it's overkill and actually isn't compatible with certain kinds of testing. TestScheduler is good for tests where you're simulating time passing.
Your problem is that MSTest unit tests have a default SynchronizationContext. So ObserveOn and ReactiveAsyncCommand will marshal to the thread pool instead of to the WPF context. This causes a race condition.
Your first and best option is the Rx TestScheduler.
Another option is to await some completion signal (and ensure your test method is async Task, not async void).
Otherwise, if you just need a SynchronizationContext, you can use AsyncContext from my AsyncEx library to execute the tests within your own SynchronizationContext.
Finally, if you have any code that directly uses Dispatcher instead of SynchronizationContext, you can use WpfContext from the Async CTP download.