spring batch: Dump a set of queries over a database in parallel to flat files - csv

So my scenario drilled down to the essence is as follows:
Essentially, I have a config file containing a set of SQL queries whose result sets need to be exported as CSV files.
Since some queries may return billions of rows, and because something may interrupt the process (bug, crash, ...), I want to use a framework such as spring batch, which gives me restartabilty and job monitoring.
I am using a file based H2 database for persisting spring batch jobs.
So, here are my questions:
Upon creating a Job, I need to provide my RowMapper some initial configuration. So what happens when a job needs to be restarted after a e.g. crash? Concretly:
Is the state of the RowMapper automatically persisted, and upon restart Spring batch will try to restore the object from its database, or
will the RowMapper object be used that is part of the original spring batch XML config file, or
I have to maintain the RowMapper's state using the step's/job's ExecutionContext?
Above question is related to whether there is magic going on when using the spring batch XML configuration, or whether I could as well create all these beans in a programmatic way:
Since I need to parse my own config format into a spring batch job config, I rather just use spring batch's Java classes (beans) and fill them out appropriately, rather attempting to manually write out valid XML. However, if my Job crashes, I would create all the beans myself again. Does spring batch automagically restore the Job state from its database?
If I really need XML, is there a way to serialize a spring-batch JobRepository (or one of these objects) as a spring batch XML config?
Right now, I tried to configure my Step with the following code - but I am unsure if this is the proper way to do this:
Is TaskletStep the way to go?
Is the way I create the chunked reader/writer correct, or is there some other object which I should use instead?
I would have assumed that opening of the reader and writer would occur automatically as part of the JobExecution, but if I don't open these resources prior to running the Job, I get an exception telling me that I need to open them first. Maybe I need to create some other object that manages the resoures (jdbc connection and file handle)?
JdbcCursorItemReader<Foobar> itemReader = new JdbcCursorItemReader<Foobar>();
itemReader.setSql(sqlStr);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(rowMapper);
itemReader.afterPropertiesSet();
ExecutionContext executionContext = new ExecutionContext();
itemReader.open(executionContext);
FlatFileItemWriter<String> itemWriter = new FlatFileItemWriter<String>();
itemWriter.setLineAggregator(new PassThroughLineAggregator<String>());
itemWriter.setResource(outResource);
itemWriter.afterPropertiesSet();
itemWriter.open(executionContext);
int commitInterval = 50000;
CompletionPolicy completionPolicy = new SimpleCompletionPolicy(commitInterval);
RepeatTemplate repeatTemplate = new RepeatTemplate();
repeatTemplate.setCompletionPolicy(completionPolicy);
RepeatOperations repeatOperations = repeatTemplate;
ChunkProvider<Foobar> chunkProvider = new SimpleChunkProvider<Foobar>(itemReader, repeatOperations);
ItemProcessor<Foobar, String> itemProcessor = new ItemProcessor<Foobar, String>() {
/* Custom implemtation */ };
ChunkProcessor<Foobar> chunkProcessor = new SimpleChunkProcessor<Foobar, String>(itemProcessor, itemWriter);
Tasklet tasklet = new ChunkOrientedTasklet<QuadPattern>(chunkProvider, chunkProcessor); //new SplitFilesTasklet();
TaskletStep taskletStep = new TaskletStep();
taskletStep.setName(taskletName);
taskletStep.setJobRepository(jobRepository);
taskletStep.setTransactionManager(transactionManager);
taskletStep.setTasklet(tasklet);
taskletStep.afterPropertiesSet();
job.addStep(taskletStep);

Most of you questions are really complex and can be difficult give a good answer without write a long paper.
I'm new with spring-batch as you, and I found a lot of really useful info - and all the answers to your questions - reading Spring batch in action: it's completed, well explained, full of example and cover all aspects of framework (reader/writer/processor, job/tasklet/chunk lifecycle/persistence, tx/resources management, job flow, integration with other service, partitioning, restarting/retry, failure management and a lot of interesting things).
Hope to help

Related

How can I externalize ISchedulerExecutorService to run tasks in an external hazelcast cluster(Hazecast 5.2) without using UserCodeDeployment?

I am working on externalizing our IScheduledExecutorService so I can run tasks externally on a external cluster. I am able to write a test and get the Runnable to actually run ONLY if I turn on UserCode deployment. If I want to change this task at all and run the tests again I get the below in my external cluster member's logs..
java.lang.IllegalStateException: Class com.mycompany.task.ScheduledTask is already in local cache and has conflicting byte code representation
I want to be able to change the task if I could and redeploy to Hazelcast to just handle it. I do this kind of thing with our external maps now. It can handle different versions of our objects using compact serialization.
Am I stuck using user code deployment for these functional objects? If I need to make a change to it I need to change the class name and redeploy to production. I'm hoping to get this task right the first time and not have to ever do that but I have a way of handling it if I do.
The cluster is already running in production and I'll have to add the following to each member
HZ_USERCODEDEPLOYMENT_ENABLED=true
and the appropriate client code(listed below) to enable this.
What I've done...
Added the following to my local docker file
HZ_USERCODEDEPLOYMENT_ENABLED=true
and also in the code that creates a hazelcast client connecting to my external cluster with
ClientConfig clientConfig = new ClientConfig(); ClientUserCodeDeploymentConfig clientUserCodeDeploymentConfig = new ClientUserCodeDeploymentConfig(); clientUserCodeDeploymentConfig.addClass("com.mycompany.task.ScheduledTask"); clientUserCodeDeploymentConfig.setEnabled(true); clientConfig.setUserCodeDeploymentConfig(clientUserCodeDeploymentConfig);
However, if I remove those two pieces I get the following Exception with a failing test. It doesn't know about my class at all.
com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: com.mycompany.task.ScheduledTask
Side Note:
We are using compact serialization for several maps already and when I try to configure this Runnable task via compact serialization I get the below error. I don't think that's the right approach either.
[Scheduler: myScheduledExecutorService][Partition: 121][Task: 7afe68d5-3185-475f-b375-5a82a7088de3] Exception occurred during run
java.lang.ClassCastException: class com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord cannot be cast to class java.lang.Runnable (com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord is in unnamed module of loader 'app'; java.lang.Runnable is in module java.base of loader 'bootstrap')
at com.hazelcast.scheduledexecutor.impl.ScheduledRunnableAdapter.call(ScheduledRunnableAdapter.java:49) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.scheduledexecutor.impl.TaskRunner.call(TaskRunner.java:78) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.internal.util.executor.CompletableFutureTask.run(CompletableFutureTask.java:64) ~[hazelcast-5.2.0.jar:5.2.0]

Spring Data Jpa doesn't save entity in new Thread

I have a little problem when I want to save entity in other Thread. e.g.
Executors.newCachedThreadPool().submit(() -> userService.save(user))
and in userService I have a lot of logic and at the end I want to save user into database using userRepository.save(user) and repo doesn't save it. What am I doing wrong? If someone want more info let me know

How do I retrieve the json representation of an azure data factory pipeline?

I want to track pipeline changes in source control, and I'm looking for a way to programmatically retrieve the json representation from the ADF.
The .Net routines return the objects, but sadly ToString() does not return json (wouldn't THAT be convenient?), so right now I'm looking at copying the json down by hand (shoot me now!), or possibly trying to recreate the json from the .Net objects (shoot me later!).
Please tell me I'm being dense and there is an obvious way to do this.
You can serialize the object using Newtonsoft Json.
See (https://azure.microsoft.com/en-us/documentation/articles/data-factory-create-data-factories-programmatically/) for how to connect via the ADF SDK
var aadTokenCredentials = new TokenCloudCredentials(ConfigurationManager.AppSettings["SubscriptionId"], GetAuthorizationHeader());
var resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]);
var manager = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri);
var pipeline = manager.Pipelines.Get(resourceGroupName, dataFactoryName, pipelineName);
var pipelineAsJson = JsonConvert.SerializeObject(pipeline.Pipeline, Formatting.Indented);
I was expecting something more complex but looking at the sdk source GitHub it is not doing anything special.
Our team has a deployment tool that takes git changes and deploy them appropriately. Everything is done asynchronously and being controlled and versioned through git.
In a nutshell our deployment has the following flow:
Any completed git merge request triggers a VSO build. This is simply
building the whole solution via MsBuild.
Every successful build is applied a Git tag for tracking of Last Known Good.
Next (if build succeeded) our .net ADFPublisher starts by taking only the changed data factory files and asynchronously publishing them based on their
git operation (modified, add, delete, etc.).
For some failures cases our ADFPublisher will perform a retry.
This whole process (Build + publish) takes ~ 65 seconds and has
already saved us from having several bugs. It also allows us to move
definitions from one environment to another very easily.
Let me know if you think this is something that you will be interested in and I will setup a way to share it with you

Merging runtime-created section config with system config

I am using the EntLib in an environment where database connection strings are retrieved from a separate library call that decrypts a proprietary config file. I have no say over this practice or the format of the config file.
I want to do EntLib exception logging to the database in this setting. I therefore need to set up a EntLib database configuration instance with the name of the database, with the connection string. Since I can't get the connection string until run time, but EntLib does allow run-time configuration, I use the following code, as described in this:
builder.ConfigureData()
.ForDatabaseNamed("Ann")
.ThatIs.ASqlDatabase()
.WithConnectionString(connectionString)
.AsDefault();
The parameter connectionString is the one I've retrieved from the separate library.
The sample code goes on to merge the created configuration info with an empty DictionaryConfigurationSource. I, however, need to merge it with the rest of the configuration code from the app.config. So I do this:
var configSource = new SystemConfigurationSource();
builder.UpdateConfigurationWithReplace(configSource);
EnterpriseLibraryContainer.Current
= EnterpriseLibraryContainer.CreateDefaultContainer(configSource);
... which is based very closely on the sample code.
But: I get an internal error in Microsoft.Practices.EnterpriseLibrary.Common.Configuration.SystemConfigurationSource.Save. The failing code is this:
var fileMap = new ExeConfigurationFileMap { ExeConfigFilename = ConfigurationFilePath };
var config = ConfigurationManager.OpenMappedExeConfiguration(fileMap, ConfigurationUserLevel.None);
config.Sections.Remove(section);
config.Sections.Add(section, configurationSection);
config.Save();
... where 'section' is "connectionStrings". The code fails on the Add method call, saying that you can't add a duplicate section. Inspection shows that the connectionStrings section is still there even after the Remove.
I know from experience that there's always a default entry under connectionStrings when the configuration files are actually read and interpreted, inherited from the machine.config. So perhaps you can never really remove the connectionStrings section.
That would appear to leave me out of luck, though, unless I want to modify the EntLib source, which I do not.
I could perhaps build all the configuration information for the EntLib at run time, using the fluent API. But I'd rather not. The users want their Operations staff to be able to make small changes to the logging without having to involve a developer.
So my question, in several parts: is there a nice simple workaround for this? Does it require a change to the EntLib source? Or have I missed something really simple that would do away with the problem?
I found a workaround, thanks to this post. Rather than taking the system configuration source and attempting to update it from the builder, I copy the sections I set up in app.config into the builder, and then do an UpdateConfigurationWithReplace on an empty dummy configuration source object in order to create a ConfigurationSource that can be used to create the default container.
var builder = new ConfigurationSourceBuilder();
var configSource = new SystemConfigurationSource();
CopyConfigSettings("loggingConfiguration", builder, configSource);
CopyConfigSettings("exceptionHandling", builder, configSource);
// Manually configure the database settings
builder.ConfigureData()
.ForDatabaseNamed("Ann")
.ThatIs.ASqlDatabase()
.WithConnectionString(connectionString)
.AsDefault();
// Update a dummy, empty ConfigSource object with the settings we have built up.
// Remember, this is a config settings object for the EntLib, not for the entire program.
// So it doesn't need all 24 sections or however many you can set in the app.config.
DictionaryConfigurationSource dummySource = new DictionaryConfigurationSource();
builder.UpdateConfigurationWithReplace(dummySource);
// Create the default container using our new ConfigurationSource object.
EnterpriseLibraryContainer.Current
= EnterpriseLibraryContainer.CreateDefaultContainer(dummySource);
The key is this subroutine:
/// <summary>
/// Copies a configuration section from the SystemConfigurationSource to the ConfigurationSourceBuilder.
/// </summary>
/// <param name="sectionName"></param>
/// <param name="builder"></param>
/// <param name="configSource"></param>
private static void CopyConfigSettings(string sectionName, ConfigurationSourceBuilder builder, SystemConfigurationSource configSource)
{
ConfigurationSection section = configSource.GetSection(sectionName);
builder.AddSection(sectionName, section);
}

Spring #Transactional Deadlock

I have the following setup.
Spring 3.0.5
Hibernate 3.5.6
MySql 5.1
To save a record in the DB via Hibernate I have the following workflow
send JSON {id:1,name:"test",children:[...]} to Spring MVC App and use Jackson to transform it into an object graph (if it is an existing instance the JSON has the proper ID of the record in the DB set
save the object in DB via service layer call (details below)
the save function of service layer interface SomeObjectService has the #Transactional annotation on it with readOnly=false and Propagation REQUIRED
the implementation of this service layer SomeObjectServieImpl calls the DAO save
method
the DAO saves the new data via a call of hibernate's merge e.g. hibernateTempate().merge(someObj)
hibernate merge loads the object first from the DB via SELECT
I have a EntityListener who is wired to spring (I used this technique Spring + EntityManagerFactory +Hibernate Listeners + Injection) and listens to #PostLoad
The listener uses a LockingServie to updates one field of someObject to set it as locked (this should actually only happen when someObject is loaded via Hibernate HQL,SQL or Criteria calls but gets called also on merge)
the LockingServie has a function lock(someObj,userId) which is also annotated with #Transactional with readOnly=false and REQUIRED
the update happens via a call of Query query = sess.createQuery("update someObj set lockedBy=:userId"); and then
query.executeUpdate();
after merge has loaded the data it start with updating someObject and inserting relevant children (<= exacely here is the point where the deadlock happens)
return JSON result (this also includes the newly created object ID) back to client.
The problem seems for me that first
the record gets loaded in a transaction
then gets changed in another (inner-)transaction
and then should get updated again with the data of the outer transaction but can't get updated because it is locked.
I can see via MySQL's
SHOW OPEN TABLES
that a child table (that is part of the object graph) is locked.
Interesting fact is that the deadlock doesn't occur on the someObj table but rather on a table that represents a child.
I am a bit lost here. Any help is more than welcome.
BTW can maybe the isolation level get me out of this problem here?
I ended up using #Bozho's HibernateExtendedJpaDialect
which is explained here >>
Hibernate, spring, JPS & isolation - custom isolation not supported
To set the isolation to READ_UNCOMMITED
#Transactional(readOnly = false, propagation = Propagation.REQUIRED, isolation=Isolation.READ_UNCOMMITTED)
public Seizure merge(Seizure seizureObj);
Not a very nice solution I know but at least this solved my problem.
If somebody wanna have a detailed description please ask...
I don't know the solution to the problem, but I would not have a transactional lock method. If at all you need to lock something manually, make it within another transactional service method.