how to save apache spark schema output in mysql database - mysql

Can anyone please tell me if there is any way in apache spark to store a JavaRDD on mysql database? I am taking input from 2 csv files and then after doing join operations on their contents I need to save the output(the output JavaRDD) in the mysql database. I am already able to save the output successfully on hdfs but I am not finding any information related to apache Spark-MYSQL connection. Below I am posting the code for spark sql. This might serve as a reference to those who are looking for an example for spark-sql.
package attempt1;
import java.io.Serializable;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.JavaSQLContext;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
public class Spark_Mysql {
#SuppressWarnings("serial")
public static class CompleteSample implements Serializable {
private String ASSETNUM;
private String ASSETTAG;
private String CALNUM;
public String getASSETNUM() {
return ASSETNUM;
}
public void setASSETNUM(String aSSETNUM) {
ASSETNUM = aSSETNUM;
}
public String getASSETTAG() {
return ASSETTAG;
}
public void setASSETTAG(String aSSETTAG) {
ASSETTAG = aSSETTAG;
}
public String getCALNUM() {
return CALNUM;
}
public void setCALNUM(String cALNUM) {
CALNUM = cALNUM;
}
}
#SuppressWarnings("serial")
public static class ExtendedSample implements Serializable {
private String ASSETNUM;
private String CHANGEBY;
private String CHANGEDATE;
public String getASSETNUM() {
return ASSETNUM;
}
public void setASSETNUM(String aSSETNUM) {
ASSETNUM = aSSETNUM;
}
public String getCHANGEBY() {
return CHANGEBY;
}
public void setCHANGEBY(String cHANGEBY) {
CHANGEBY = cHANGEBY;
}
public String getCHANGEDATE() {
return CHANGEDATE;
}
public void setCHANGEDATE(String cHANGEDATE) {
CHANGEDATE = cHANGEDATE;
}
}
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
JavaSparkContext ctx = new JavaSparkContext("local[2]", "JavaSparkSQL");
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
JavaRDD<CompleteSample> cs = ctx.textFile("C:/Users/cyg_server/Documents/bigDataExample/AssetsImportCompleteSample.csv").map(
new Function<String, CompleteSample>() {
public CompleteSample call(String line) throws Exception {
String[] parts = line.split(",");
CompleteSample cs = new CompleteSample();
cs.setASSETNUM(parts[0]);
cs.setASSETTAG(parts[1]);
cs.setCALNUM(parts[2]);
return cs;
}
});
JavaRDD<ExtendedSample> es = ctx.textFile("C:/Users/cyg_server/Documents/bigDataExample/AssetsImportExtendedSample.csv").map(
new Function<String, ExtendedSample>() {
public ExtendedSample call(String line) throws Exception {
String[] parts = line.split(",");
ExtendedSample es = new ExtendedSample();
es.setASSETNUM(parts[0]);
es.setCHANGEBY(parts[1]);
es.setCHANGEDATE(parts[2]);
return es;
}
});
JavaSchemaRDD complete = sqlCtx.applySchema(cs, CompleteSample.class);
complete.registerAsTable("cs");
JavaSchemaRDD extended = sqlCtx.applySchema(es, ExtendedSample.class);
extended.registerAsTable("es");
JavaSchemaRDD fs= sqlCtx.sql("SELECT cs.ASSETTAG, cs.CALNUM, es.CHANGEBY, es.CHANGEDATE FROM cs INNER JOIN es ON cs.ASSETNUM=es.ASSETNUM;");
JavaRDD<String> result = fs.map(new Function<Row, String>() {
public String call(Row row) {
return row.getString(0);
}
});
result.saveAsTextFile("hdfs://path/to/hdfs/dir-name"); //instead of hdfs I need to save it on mysql database, but I am not able to find any Spark-MYSQL connection
}
}
Here at the end I am saving the result successfully in HDFS. But now I want to save into MYSQL database. Kindly help me out. Thanks

There are two approaches you can use for writing your results back to the database. One is to use something like DBOutputFormat and configure that, and the other is to use foreachPartition on the RDD you want to save and pass in a function which creates a connection to MySQL and writes the result back.

Here is an example using DBOutputFormat.
Create a class that represents your table row -
public class TableRow implements DBWritable
{
public String column1;
public String column2;
#Override
public void write(PreparedStatement statement) throws SQLException
{
statement.setString(1, column1);
statement.setString(2, column2);
}
#Override
public void readFields(ResultSet resultSet) throws SQLException
{
throw new RuntimeException("readFields not implemented");
}
}
Then configure your job and write a mapToPair function. The value doesn't appear to be used. If anyone knows, please post a comment.
String tableName = "YourTableName";
String[] fields = new String[] { "column1", "column2" };
JobConf job = new JobConf();
DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/DatabaseNameHere", "username", "password");
DBOutputFormat.setOutput(job, tableName, fields);
// map your rdd into a table row
JavaPairRDD<TableRow, Object> rows = rdd.mapToPair(...);
rows.saveAsHadoopDataset(job);

Related

Lifecycle of #After method

I am trying to gather some information after every test method, and would like to analyze the gathered information after the test class completes. So, I have a private member variable, a list which I would like to add to after every test method completes. However, at the end of the day, the member variable always remains null.
Note: My test class implements Callable interface.
Here is my code snippet:
{
private List<String statisticsCollector;
private JUnitCore core = null;
private int x = 0;
public MyLoadTest() {
this.core = new JUnitCore();
this.statisticsCollector = new ArrayList<String>();
}
#Override
public List<String> call() {
log.info("Starting a new thread of execution with Thread# -" + Thread.currentThread().getName());
core.run(this.getClass());
return getStatisticsCollector(); // this is always returing a list of size 0
}
#After
public void gatherSomeStatistics() {
x = x+1;
String sb = new String("Currently executing ----" + x);
log.info("Currently executing ----" + x);
addToStatisticsCollector(sb);
}
#Test
#FileParameters(value = "classpath:folder/testB.json", mapper = MyMapper.class)
public void testB(MarsTestDefinition testDefinition) {
runTests(testDefinition);
}
#Test
#FileParameters(value = "classpath:folder/testA.json", mapper = MyMapper.class)
public void testA(MyDefinition testDefinition) {
runTests(testDefinition);
}
public List<String> getStatisticsCollector() {
return this.statisticsCollector;
}
public void addToStatisticsCollector(String sb) {
this.statisticsCollector.add(sb);
}
}
So, why is it always getting reset, even though I am appending to the list in my #After annotated method?
Any help will be highly appreciated. Thanks
Try with following code, is it working ?
private static List<String> statisticsCollector = new ArrayList<String>();
private JUnitCore core = null;
private int x = 0;
public MyLoadTest() {
this.core = new JUnitCore();
}
public List<String> getStatisticsCollector() {
return statisticsCollector;
}

Storing Apache Hadoop Data Output to Mysql Database

I need to store output of map-reduce program into database, so is there any way?
If so, is it possible to store output into multiple columns & tables based on requirement??
please suggest me some solutions.
Thank you..
The great example is shown on this blog, I tried it and it goes really well. I quote the most important parts of the code.
At first, you must create a class representing data you would like to store. The class must implement DBWritable interface:
public class DBOutputWritable implements Writable, DBWritable
{
private String name;
private int count;
public DBOutputWritable(String name, int count) {
this.name = name;
this.count = count;
}
public void readFields(DataInput in) throws IOException { }
public void readFields(ResultSet rs) throws SQLException {
name = rs.getString(1);
count = rs.getInt(2);
}
public void write(DataOutput out) throws IOException { }
public void write(PreparedStatement ps) throws SQLException {
ps.setString(1, name);
ps.setInt(2, count);
}
}
Create objects of previously defined class in your Reducer:
public class Reduce extends Reducer<Text, IntWritable, DBOutputWritable, NullWritable> {
protected void reduce(Text key, Iterable<IntWritable> values, Context ctx) {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
}
try {
ctx.write(new DBOutputWritable(key.toString(), sum), NullWritable.get());
} catch(IOException e) {
e.printStackTrace();
} catch(InterruptedException e) {
e.printStackTrace();
}
}
}
Finally you must configure a connection to your DB (do not forget to add your db connector on the classpath) and register your mapper's and reducer's input/output data types.
public class Main
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
DBConfiguration.configureDB(conf,
"com.mysql.jdbc.Driver", // driver class
"jdbc:mysql://localhost:3306/testDb", // db url
"user", // username
"password"); //password
Job job = new Job(conf);
job.setJarByClass(Main.class);
job.setMapperClass(Map.class); // your mapper - not shown in this example
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class); // your mapper - not shown in this example
job.setMapOutputValueClass(IntWritable.class); // your mapper - not shown in this example
job.setOutputKeyClass(DBOutputWritable.class); // reducer's KEYOUT
job.setOutputValueClass(NullWritable.class); // reducer's VALUEOUT
job.setInputFormatClass(...);
job.setOutputFormatClass(DBOutputFormat.class);
DBInputFormat.setInput(...);
DBOutputFormat.setOutput(
job,
"output", // output table name
new String[] { "name", "count" } //table columns
);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

How to prevent Gson serialize / deserialize the first character of a field (underscore)?

My class:
class ExampleBean {
private String _firstField;
private String _secondField;
// respective getters and setters
}
I want to appear as follows:
{
"FirstField":"value",
"SecondField":"value"
}
And not like this
{
"_FirstField":"value",
"_SecondField":"value"
}
I initialize the parser as follows:
GsonBuilder builder = new GsonBuilder();
builder.setDateFormat(DateFormat.LONG);
builder.setFieldNamingPolicy(FieldNamingPolicy.UPPER_CAMEL_CASE);
builder.setPrettyPrinting();
set_defaultParser(builder.create());
I could see the API and in the documentation of "FieldNamePolicy" but I am surprised that not give the option to skip "_"
I also know I can use the annotation...
# SerializedName (" custom_naming ")
...but do not want to have to write this for alllllll my fields ...
It's very useful for me to distinguish between local variables and fields of a class. :( Any Idea?
EDIT: There would be many obvious solutions, (inheritance, gson overwriting methods, regular expresions). My question is more focused on whether there is a native solution of gson or a less intrusive fix?
Maybe we could propose as new FieldNamePolicy?
GsonBuilder provides a method setFieldNamingStrategy() that allows you to pass your own FieldNamingStrategy implementation.
Note that this replaces the call to setFieldNamingPolicy() - if you look at the source for GsonBuilder these two methods are mutually exclusive as they set the same internal field (The FieldNamingPolicy enum is a FieldNamingStrategy).
public class App
{
public static void main(String[] args)
{
Gson gson = new GsonBuilder()
.setFieldNamingStrategy(new MyFieldNamingStrategy())
.setPrettyPrinting()
.create();
System.out.println(gson.toJson(new ExampleBean()));
}
}
class ExampleBean
{
private String _firstField = "first field value";
private String _secondField = "second field value";
// respective getters and setters
}
class MyFieldNamingStrategy implements FieldNamingStrategy
{
public String translateName(Field field)
{
String fieldName =
FieldNamingPolicy.UPPER_CAMEL_CASE.translateName(field);
if (fieldName.startsWith("_"))
{
fieldName = fieldName.substring(1);
}
return fieldName;
}
}
Output:
{
"FirstField": "first field value",
"SecondField": "second field value"
}
What you want is
import java.lang.reflect.Field;
import java.text.DateFormat;
import com.google.gson.FieldNamingStrategy;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public class GsonExample {
public static void main(String... args) throws Exception {
final GsonBuilder builder = new GsonBuilder();
builder.setDateFormat(DateFormat.LONG);
builder.setPrettyPrinting();
builder.setFieldNamingStrategy(new FieldNamingStrategy() {
#Override
public String translateName(Field f) {
String fieldName = f.getName();
if(fieldName.startsWith("_") && fieldName.length() > 1) {
fieldName = fieldName.substring(1, 2).toUpperCase() + fieldName.substring(2);
}
return fieldName;
}
});
final Gson gson = builder.create();
System.out.println(gson.toJson(new ExampleBean("example", "bean")));
}
private static class ExampleBean {
private final String _firstField;
private final String _secondField;
private ExampleBean(String _firstField, String _secondField) {
this._firstField = _firstField;
this._secondField = _secondField;
}
}
}
which generates
{"FirstField":"example","SecondField":"bean"}

WcfFacility and Sequence contains no elements error?

I have wcf library with service contracts and implementations.
[ServiceContract]
public interface IServiceProtoType
{
[OperationContract]
Response GetMessage(Request request);
[OperationContract]
String SayHello();
}
[DataContract]
public class Request
{
private string name;
[DataMember]
public string Name
{
get { return name; }
set { name = value; }
}
}
[DataContract]
public class Response
{
private string message;
[DataMember]
public string Message
{
get { return message; }
set { message = value; }
}
}
public class MyDemoService : IServiceProtoType
{
public Response GetMessage(Request request)
{
var response = new Response();
if (null == request)
{
response.Message = "Error!";
}
else
{
response.Message = "Hello, " + request.Name;
}
return response;
}
public string SayHello()
{
return "Hello, World!";
}
}
I have windows service project that references this library, where MyService is just an empty shell that inherits ServiceBase. This service is installed and running under local system.
static void Main()
{
ServiceBase.Run(CreateContainer().Resolve());
}
private static IWindsorContainer CreateContainer()
{
IWindsorContainer container = new WindsorContainer();
container.Install(FromAssembly.This());
return container;
}
public class ServiceInstaller : IWindsorInstaller
{
#region IWindsorInstaller Members
public void Install(IWindsorContainer container, Castle.MicroKernel.SubSystems.Configuration.IConfigurationStore store)
{
string myDir;
if (string.IsNullOrEmpty(AppDomain.CurrentDomain.RelativeSearchPath))
{
myDir = AppDomain.CurrentDomain.BaseDirectory;
}
else
{
myDir = AppDomain.CurrentDomain.RelativeSearchPath;
}
var wcfLibPath = Path.Combine(myDir , "WcfDemo.dll");
string baseUrl = "http://localhost:8731/DemoService/{0}";
AssemblyName myAssembly = AssemblyName.GetAssemblyName(wcfLibPath);
container
.Register(
AllTypes
.FromAssemblyNamed(myAssembly.Name)
.InSameNamespaceAs<WcfDemo.MyDemoService>()
.WithServiceDefaultInterfaces()
.Configure(c =>
c.Named(c.Implementation.Name)
.AsWcfService(
new DefaultServiceModel()
.AddEndpoints(WcfEndpoint
.BoundTo(new WSHttpBinding())
.At(string.Format(baseUrl,
c.Implementation.Name)
)))), Component.For<ServiceBase>().ImplementedBy<MyService>());
}
#endregion
}
In Client Console app I have the following code and I am getting the following error:
{"Sequence contains no elements"}
static void Main(string[] args)
{
IWindsorContainer container = new WindsorContainer();
string baseUrl = "http://localhost:8731/DemoService/{0}";
container.AddFacility<WcfFacility>(f => f.CloseTimeout = TimeSpan.Zero);
container
.Register(
Types
.FromAssemblyContaining<IServiceProtoType>()
.InSameNamespaceAs<IServiceProtoType>()
.Configure(
c => c.Named(c.Implementation.Name)
.AsWcfClient(new DefaultClientModel
{
Endpoint = WcfEndpoint
.BoundTo(new WSHttpBinding())
.At(string.Format(baseUrl,
c.Name.Substring(1)))
})));
var service1 = container.Resolve<IServiceProtoType>();
Console.WriteLine(service1.SayHello());
Console.ReadLine();
}
I have an idea what this may be but you can stop reading this now (and I apologize for wasting your time in advance) if the answer to the following is no:
Is one (or more) of Request, Response, or MyDemoService in the same namespace as IServiceProtoType?
I suspect that Windsor is getting confused about those, since you are doing...
Types
.FromAssemblyContaining<IServiceProtoType>()
.InSameNamespaceAs<IServiceProtoType>()
... and then configuring everything which that returns as a WCF client proxy. This means that it will be trying to create proxies for things that should not be and hence a Sequence Contains no Elements exception (not the most useful message IMHO but crushing on).
The simple fix would be just to put your IServiceProtoType into its own namespace (I often have a namespace like XXXX.Services for my service contracts).
If that is not acceptable to you then you need to work out another way to identify just the service contracts - take a look at the If method for example or just a good ol' Component.For perhaps.

Ehcache hangs in test

I am in the process of rewriting a bottle neck in the code of the project I am on, and in doing so I am creating a top level item that contains a self populating Ehcache. I am attempting to write a test to make sure that the basic call chain is established, but when the test executes it hands when retrieving the item from the cache.
Here are the Setup and the test, for reference mocking is being done with Mockito:
#Before
public void SetUp()
{
testCache = new Cache(getTestCacheConfiguration());
recordingFactory = new EntryCreationRecordingCache();
service = new Service<Request, Response>(testCache, recordingFactory);
}
#Test
public void retrievesResultsFromSuppliedCache()
{
ResultType resultType = mock(ResultType.class);
Response expectedResponse = mock(Response.class);
addToExpectedResults(resultType, expectedResponse);
Request request = mock(Request.class);
when(request.getResultType()).thenReturn(resultType);
assertThat(service.getResponse(request), sameInstance(expectedResponse));
assertTrue(recordingFactory.requestList.contains(request));
}
private void addToExpectedResults(ResultType resultType,
Response response) {
recordingFactory.responseMap.put(resultType, response);
}
private CacheConfiguration getTestCacheConfiguration() {
CacheConfiguration cacheConfiguration = new CacheConfiguration("TEST_CACHE", 10);
cacheConfiguration.setLoggingEnabled(false);
return cacheConfiguration;
}
private class EntryCreationRecordingCache extends ResponseFactory{
public final Map<ResultType, Response> responseMap = new ConcurrentHashMap<ResultType, Response>();
public final List<Request> requestList = new ArrayList<Request>();
#Override
protected Map<ResultType, Response> generateResponse(Request request) {
requestList.add(request);
return responseMap;
}
}
Here is the ServiceClass
public class Service<K extends Request, V extends Response> {
private Ehcache cache;
public Service(Ehcache cache, ResponseFactory factory) {
this.cache = new SelfPopulatingCache(cache, factory);
}
#SuppressWarnings("unchecked")
public V getResponse(K request)
{
ResultType resultType = request.getResultType();
Element cacheEntry = cache.get(request);
V response = null;
if(cacheEntry != null){
Map<ResultType, Response> resultTypeMap = (Map<ResultType, Response>) cacheEntry.getValue();
try{
response = (V) resultTypeMap.get(resultType);
}catch(NullPointerException e){
throw new RuntimeException("Result type not found for Result Type: " + resultType);
}catch(ClassCastException e){
throw new RuntimeException("Incorrect Response Type for Result Type: " + resultType);
}
}
return response;
}
}
And here is the ResponseFactory:
public abstract class ResponseFactory implements CacheEntryFactory{
#Override
public final Object createEntry(Object request) throws Exception {
return generateResponse((Request)request);
}
protected abstract Map<ResultType,Response> generateResponse(Request request);
}
After wrestling with it for a while, I discovered that the cache wasn't being initialized. Creating a CacheManager and adding the cache to it resolved the problem.
I also had a problem with EHCache hanging, although only in a hello-world example. Adding this to the end fixed it (the application ends normally).
CacheManager.getInstance().removeAllCaches();
https://stackoverflow.com/a/20731502/2736496