setting hadoop job configuration programmatically

setting hadoop job configuration programmatically - configuration

I am getting OOM exception (Java heap space) for reduce child. I read in the documentation that increasing the value of mapred.reduce.child.java.opts to -Xmx512M or more would help. Since I am not the admin, I cannot change that value in mapred-site.xml. I would like to set that value only for my job through the java program. I tried setting it using Configuration class as follows, but that didn't work.
Configuration config = new Configuration();
config.set("mapred.reduce.child.java.opts", "-Xmx512M");
JobConf conf1 = new JobConf(config, this.getClass());
The version of Hadoop is 1.0.3
What is the proper way of setting the configuration values programmatically?

AS #ThomasJungblut and #octo have pointed out, the procedure I mentioned in the question is the right way of doing it. The OOM exception still persists, so I would start a new thread instead of continuing here.

Related

How can I externalize ISchedulerExecutorService to run tasks in an external hazelcast cluster(Hazecast 5.2) without using UserCodeDeployment?

I am working on externalizing our IScheduledExecutorService so I can run tasks externally on a external cluster. I am able to write a test and get the Runnable to actually run ONLY if I turn on UserCode deployment. If I want to change this task at all and run the tests again I get the below in my external cluster member's logs..
java.lang.IllegalStateException: Class com.mycompany.task.ScheduledTask is already in local cache and has conflicting byte code representation
I want to be able to change the task if I could and redeploy to Hazelcast to just handle it. I do this kind of thing with our external maps now. It can handle different versions of our objects using compact serialization.
Am I stuck using user code deployment for these functional objects? If I need to make a change to it I need to change the class name and redeploy to production. I'm hoping to get this task right the first time and not have to ever do that but I have a way of handling it if I do.
The cluster is already running in production and I'll have to add the following to each member
HZ_USERCODEDEPLOYMENT_ENABLED=true
and the appropriate client code(listed below) to enable this.
What I've done...
Added the following to my local docker file
HZ_USERCODEDEPLOYMENT_ENABLED=true
and also in the code that creates a hazelcast client connecting to my external cluster with
ClientConfig clientConfig = new ClientConfig(); ClientUserCodeDeploymentConfig clientUserCodeDeploymentConfig = new ClientUserCodeDeploymentConfig(); clientUserCodeDeploymentConfig.addClass("com.mycompany.task.ScheduledTask"); clientUserCodeDeploymentConfig.setEnabled(true); clientConfig.setUserCodeDeploymentConfig(clientUserCodeDeploymentConfig);
However, if I remove those two pieces I get the following Exception with a failing test. It doesn't know about my class at all.
com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: com.mycompany.task.ScheduledTask
Side Note:
We are using compact serialization for several maps already and when I try to configure this Runnable task via compact serialization I get the below error. I don't think that's the right approach either.
[Scheduler: myScheduledExecutorService][Partition: 121][Task: 7afe68d5-3185-475f-b375-5a82a7088de3] Exception occurred during run
java.lang.ClassCastException: class com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord cannot be cast to class java.lang.Runnable (com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord is in unnamed module of loader 'app'; java.lang.Runnable is in module java.base of loader 'bootstrap')
at com.hazelcast.scheduledexecutor.impl.ScheduledRunnableAdapter.call(ScheduledRunnableAdapter.java:49) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.scheduledexecutor.impl.TaskRunner.call(TaskRunner.java:78) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.internal.util.executor.CompletableFutureTask.run(CompletableFutureTask.java:64) ~[hazelcast-5.2.0.jar:5.2.0]

is logback losing messages upon configuration reload?

I am trying to figure out whether logback is potentially losing messages.
Quoting from the log4j2 page:
"Like Logback, Log4j 2 can automatically reload its configuration upon modification. Unlike Logback, it will do so without losing log events while reconfiguration is taking place"
So, can anyone comment on logback losing log events? Does it really happen?
(I have seen that event loss might occur using Async appenders, but can be solved using setting discardingThreshold 0, but the statement on log4j2 is talking about configuration reload)
I am trying to understand whether log4j2 is really more reliable, or shall we just use logback...
Thanks.

When logback is reconfigured it removes all the appender references and level settings from the loggers. It then reads the new configuration and applies them to the loggers. While this is happening logging is still continuing.
Log4j 2 separates the loggers from their configuration. Once a new configuration is created the loggers are pointed at the LoggerConfig of the new configuration. So for a brief time you will have some loggers pointing at the old configuration and some pointing at the new one, but they will never be unconfigured.

I'm currently investigating this very question and, looking at the code, it indeed seems that log messages can be lost during reconfiguration. The following code is called by ReconfigureOnChangeFilter's ReconfiguringThread:
private void performXMLConfiguration(LoggerContext lc) {
JoranConfigurator jc = new JoranConfigurator();
jc.setContext(context);
StatusUtil statusUtil = new StatusUtil(context);
List<SaxEvent> eventList = jc.recallSafeConfiguration();
URL mainURL = ConfigurationWatchListUtil.getMainWatchURL(context);
lc.reset();
long threshold = System.currentTimeMillis();
try {
jc.doConfigure(mainConfigurationURL);
if (statusUtil.hasXMLParsingErrors(threshold)) {
fallbackConfiguration(lc, eventList, mainURL);
}
} catch (JoranException e) {
fallbackConfiguration(lc, eventList, mainURL);
}
}
In lc.reset(), all appenders are removed (along with other configuration attributes), then the logger context is reconfigured. There is no apparent synchronization going on.
A quick test verified that messages are lost during reconfiguration.

Output JRuby heap space from application code

Is there a way of outputting the Java heap space, the JRuby application was started with (like -J-Xms2048m), from the application code?

The value passed in at the JVM command line is available in the Runtime JMX bean provided by the JDK, as #stephanlindauer pointed out.
Another way that's less sensitive to how the JVM was launched is to use the Memory bean, like so:
membean = java.lang.management.ManagementFactory.memory_mx_bean
heap = membean.heap_memory_usage
puts heap.max
On my system with no special JRuby flags, this outputs "466092032" (roughly reflecting our default max of 500MB), and when specifying a 2GB maximum heap, (jruby -J-Xmx2g) it outputs "1908932608".

turns out, you can puts values that were passed in via the environment variables like this:
puts java.lang.System.getenv()['JRUBY_OPTS']
or
puts ENV['JRUBY_OPTS']
or
puts java.lang.management.ManagementFactory.getRuntimeMXBean().getInputArguments().to_s
remember to require "java" before.

Warn (or fail) if a package is run without having overriden every pkg connectionstring with a config file entry

It seems like a very common issue with SSIS packages is releasing a package to Production that ends up with running the wrong connectionstring parameters. This could happen by making any one of many mistakes or ommisions. As a result, I find it helpful to dump all ConnectionString values to a log file. This helps me understand what connectionstrings were actually applied to the package at run time.
Now, I am considering having my packages check to see if every connnection object in my package had its connectionstring overriden by an entry in the config file and if not, return a warning or even fail the package. This is to allow easier configuration by extracting all environment variables to a config file. If a connectionstring is never overridden, this risks that a package, when run in production, may use development settings or a package, when run in a non production setting when testing, may accidentily be run against production.
I'd like to borrow from anyone who may have tried to do this. I'd also be interested in suggestions on how to accomplish this with minimal work.
Thx

Technical question 1 - what are my connection string
This is an easy question to answer. In your package, add a Script Task and enumerate through the Connections collection. I fire the OnInformation event and if I had this scheduled, I'd be sure to have the /rep iew options in my dtexec to ensure I record Information, Errors and Warnings.
namespace TurnDownForWhat
{
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
/// <summary>
/// ScriptMain is the entry point class of the script. Do not change the name, attributes,
/// or parent of this class.
/// </summary>
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
bool fireAgain = false;
foreach (var item in Dts.Connections)
{
Dts.Events.FireInformation(0, "SCR Enumerate Connections", string.Format("{0}->{1}", item.Name, item.ConnectionString), string.Empty, 0, ref fireAgain);
}
Dts.TaskResult = (int)ScriptResults.Success;
}
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
Running that on my package, I can see I had two Connection managers, CM_FF and CM_OLE along with their connection strings.
Information: 0x0 at SCR Enum, SCR Enumerate Connections: CM_FF->C:\ssisdata\dba_72929.csv
Information: 0x0 at SCR Enum, SCR Enumerate Connections: CM_OLE->Data Source=localhost\dev2012;Initial Catalog=tempdb;Provider=SQLNCLI11;Integrated Security=SSPI;
Add that to ... your OnPreExecute event for all the packages and no one sees it but every reports back.
Technical question 2 - Missed configurations
I'm not aware of anything that will allow a package to know it's under configuration. I'm sure there's an event as you will see in your Information/Warning messages that a package attempted to apply a configuration, didn't find one and is going to retain it's design time value. Information - I'm configuring X via Y. Warning - tried to configure X but didn't find Y. But how to have a package inspect itself to find that out, I have no idea.
That said, I've seen reference to a property that fails package on missed configuration. I'm not seeing it now, but I'm certain it exists in some crevice. You can supply the /w parameter to dtexec which treats warnings as errors and really, warnings are just errors that haven't grown up yet.
Unspoken issue 1 - Permissions
I had a friend who botched an XML config file as part of their production deploy. Their production server started consuming data from a dev server. Bad things happened. It sounds like you have had a similar situation. The resolution is easy, insulate your environments. Are you using the same service account for your production class SQL Server boxes and dev/test/uat/qa/load/etc? STOP. Make a new one. Don't allow prod to talk to any boxes that aren't in their tier of service. Someone bones a package and doesn't set a configuration? First of all, you'll catch it when it goes from dev to something-before-production because that tier wouldn't be able to talk to anything else that's not that level. But if you're in the ultra cheap shop and you've only got dev and prod, so be it. Non-configured package goes to prod. Prod SQL Agent fires off the package. Package uses default connection manager and fails validation because it can't talk to the dev sales database.
Unspoken issue 2 - template
What's your process when you have a new package to build? Does your team really start from scratch? There are so many ways to solve this problem but the core concept is to define your best practices for Configuration, Logging, Package Protection Level, Transaction levels, etc into some easily consumable form. Maybe that's 3 starter packages: one for raw acquisition, maybe one stages and conforms the data and the last one moves data from conformed into the final destination. Teammates then simply have to pick one to start from and fill in the spots that need it. If they choose to do their own thing, that's the stick you beat them with when their package fails to run in production because they didn't follow the standard path.
There are other approaches here. If you're a strong .NET crew, you can gen your template packages that way. At this point, I create my templates with Biml and use that to drive basic package creation.

If I am understanding you correctly the below solution should work.
My suggestion to you is to turn on the Do not save sensitive option for the ProtectionLevel property at the top level of the package.
This will require you to use package configurations for every connection, otherwise it will not have the credentials to make a connection.

ZeosLib with MYSQL's shared memory protocol?

I started up my local MYSQL server with the shared memory protocol turned on.
How can I connect to my server with ZeosLib? Where do I specify that it is using shared-memory?
I am using Lazarus(freepascal), although the directions would be the same for Delphi (probably).

Even if the TZConnection has no connection string property you can set the additional connection parameters in TZConnection.Properties.
I presume you run your MySQL server this way
mysqld --skip-networking --shared_memory=1 --shared-memory-base-name='MyMemoryDB'
To enable your shared memory connection you might try to add the following configuration lines into the property TZConnection.Properties at design time in Object Inspector.
Note that the protocol must be set as it is and shared-memory-base-name to the same value as you used in the command line parameter. The default value is MYSQL so if you omit the parameter in command line then you should change the following MyMemoryDB values to MYSQL.
So in TZConnection.Properties property try to add these two lines
protocol=memory
shared-memory-base-name=MyMemoryDB
or at runtime in the TZConnection.BeforeConnect event handler use
procedure TForm1.ZConnection1BeforeConnect(Sender: TObject);
begin
ZConnection1.Properties.Add('protocol=memory');
ZConnection1.Properties.Add('shared-memory-base-name=MyMemoryDB');
end;
Hope this will help you somehow. I haven't tested it because I don't have the proper environment.

IF ZeOS support it, it is probably a textual property that can be added to the (TZ)connection options. Just like other clientlib properties.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

setting hadoop job configuration programmatically - configuration

AS #ThomasJungblut and #octo have pointed out, the procedure I mentioned in the question is the right way of doing it. The OOM exception still persists, so I would start a new thread instead of continuing here.

Related

How can I externalize ISchedulerExecutorService to run tasks in an external hazelcast cluster(Hazecast 5.2) without using UserCodeDeployment?

is logback losing messages upon configuration reload?

Output JRuby heap space from application code

Warn (or fail) if a package is run without having overriden every pkg connectionstring with a config file entry

ZeosLib with MYSQL's shared memory protocol?

Categories

Resources