Flume stream to mysql - mysql

I have been trying to stream a data into MySQL database using APACHE KAFKA and FLUME. (Here is my flume configuration file)
agent.sources=kafkaSrc
agent.channels=channel1
agent.sinks=jdbcSink
agent.channels.channel1.type=org.apache.flume.channel.kafka.KafkaChannel
agent.channels.channel1.brokerList=localhost:9092
agent.channels.channel1.topic=kafkachannel
agent.channels.channel1.zookeeperConnect=localhost:2181
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapacity=1000
agent.sources.kafkaSrc.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSrc.channels = channel1
agent.sources.kafkaSrc.zookeeperConnect = localhost:2181
agent.sources.kafkaSrc.topic = kafka-mysql
***agent.sinks.jdbcSink.type = How to declare this?***
agent.sinks.jdbcSink.connectionString = jdbc:mysql://1.1.1.1:3306/test
agent.sinks.jdbcSink.username=user
agent.sinks.jdbcSink.password=password
agent.sinks.jdbcSink.batchSize = 10
agent.sinks.jdbcSink.channel =channel1
agent.sinks.jdbcSink.sqlDialect=MYSQL
agent.sinks.jdbcSink.driver=com.mysql.jdbc.Driver
agent.sinks.jdbcSink.sql=(${body:varchar})
I know how to stream data into hadoop or hbase (logger type or hdfs type), However can't find a type to stream into mysql DB. So my question is how do i declare the jdbcSink.type?

You could always create a custom sink for MySQL. This is what we did at FIWARE with Cygnus tool.
Feel free to get inspired from it: https://github.com/telefonicaid/fiware-cygnus/blob/master/cygnus-ngsi/src/main/java/com/telefonica/iot/cygnus/sinks/NGSIMySQLSink.java
It extends this other custom base class for all our sinks: https://github.com/telefonicaid/fiware-cygnus/blob/master/cygnus-ngsi/src/main/java/com/telefonica/iot/cygnus/sinks/NGSISink.java
Basically, you have to extend AbstractSink and implement the Configurable interface. That means to override al least the following methods:
public Status process() throws EventDeliveryException
and:
public void configure(Context context)
respectively.

Related

Mapreduce with multiple mappers and reducers

I am trying to implement multiple mappers and reducers code. Here is my main method
public static void main(String[] args) {
//create a new configuration
Configuration configuration = new Configuration();
Path out = new Path(args[1]);
try {
//This is the first job to get the number of providers per state
Job numberOfProvidersPerStateJob = Job.getInstance(configuration, "Total number of Providers per state");
//Set the Jar file class, mapper and reducer class
numberOfProvidersPerStateJob.setJarByClass(ProviderCount.class);
numberOfProvidersPerStateJob.setMapperClass(MapForProviderCount.class);
numberOfProvidersPerStateJob.setReducerClass(ReduceForProviderCount.class);
numberOfProvidersPerStateJob.setOutputKeyClass(Text.class);
numberOfProvidersPerStateJob.setOutputValueClass(IntWritable.class);
//Provide the input and output argument this will be needed when running the jar file in hadoop
FileInputFormat.addInputPath(numberOfProvidersPerStateJob, new Path(args[0]));
FileOutputFormat.setOutputPath(numberOfProvidersPerStateJob, new Path(out,"out1"));
if (!numberOfProvidersPerStateJob.waitForCompletion(true)) {
System.exit(1);
}
//Job 2 for getting the state with maximum provider
Job maxJobProviderState = Job.getInstance(configuration, "State With Max Job providers");
//Set the Jar file class, mapper and reducer class
maxJobProviderState.setJarByClass(ProviderCount.class);
maxJobProviderState.setMapperClass(MapForMaxProvider.class);
maxJobProviderState.setReducerClass(ReducerForMaxProvider.class);
maxJobProviderState.setOutputKeyClass(IntWritable.class);
maxJobProviderState.setOutputValueClass(Text.class);
//Provide the input and output argument this will be needed when running the jar file in hadoop
FileInputFormat.addInputPath(maxJobProviderState, new Path(out,"out1"));
FileOutputFormat.setOutputPath(maxJobProviderState, new Path(out,"out2"));
//Exit when results are ready
System.exit(maxJobProviderState.waitForCompletion(true)?0:1);
}
The problem is whenever I am running it. It gives me final output from 2nd mapper class and not the reducer class. It is something like my 2nd reducer class is getting ignored.
You can implement ChainMappers using ( org.apache.hadoop.mapreduce.lib.chain.ChainMapper )and ChainReducers using ( org.apache.hadoop.mapreduce.lib.chain.ChainReducer ) , It will resolve your issue.

JSON to CSV conversion on HDFS

I am trying to convert a JSON file into CSV.
I have a JAVA code which is able to do it perfectly on UNIX file system and on local file system.
I have written below main class to perform this conversion on HDFS.
public class ClassMain {
public static void main(String[] args) throws IOException {
String uri = args[1];
String uri1 = args[2];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
FSDataOutputStream out = fs.create(new Path(uri1));
try{
in = fs.open(new Path(uri));
JsonToCSV toCSV = new JsonToCSV(uri);
toCSV.json2Sheet().write2csv(uri1);
IOUtils.copyBytes(in, out, 4096, false);
}
finally{
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
json2sheet and write2csv are methods which perform the conversion and write operation.
I am running this jar using below command:
hadoop jar json-csv-hdfs.jar com.nishant.ClassMain /nishant/large.json /nishant/output
The problem is, it does not write anything at /nishant/output. It creates a 0 sized /nishant/output file.
Maybe the usage of copyBytes is not a good idea here.
How to achieve this on HDFS if it is working OK on unix FS and local FS.
Here I am trying to convert JSON file to CSV and not trying to map JSON objects to their values
FileSystem needs only one configuration key to successfully connect to HDFS.
conf.set(key, "hdfs://host:port"); // where key="fs.default.name"|"fs.defaultFS"

Execute stored procedure in mvc through EnterpriseLibrary

I am tyring to develop a sample project in mvc. In this, i tried to get the userlist from database (mysql). i am using enterprise library dll to set the database connectivity.
public IEnumerable<UserViewModel> GetUserList()
{
DatabaseProviderFactory factory = new DatabaseProviderFactory();
Database db = factory.Create("MySqlConnection");
DbCommand dbc = db.GetStoredProcCommand("uspGetUserList");
dbc.CommandType = CommandType.StoredProcedure;
return db.ExecuteDataSet(dbc);
}
i know the executedataset is to execute only dataset type... i want the command that execute IEnumerable type.....
thank you
If you want to return an IEnumerable type without manually constructing it (either from a DataSet or an IDataReader) then you can use accessors.
In your case your could would look like this:
public IEnumerable<UserViewModel> GetUserList()
{
DatabaseProviderFactory factory = new DatabaseProviderFactory();
Database db = factory.Create("MySqlConnection");
IEnumerable<UserViewModel> results = db.ExecuteSprocAccessor<UserViewModel>("uspGetUserList");
return results;
}
This assumes that the UserViewModel can be mapped from the stored procedure result set (e.g. column names are the same name as the property names). If not, then you would need to create a custom mapper. See Retrieving Data as Objects from the Developer's Guide for more information about accessors.

How to serialize jdbc connection for spark node distrobution in a foreach

My end goal is to get apache spark to use a jdbc connection to a mysql database for transporting mapped RDD data in scala. Going about this has led to an error explaining that the simply jdbc code I'm using could not be serialized. How do I allow the jdbc class to be serialized?
Typically, the DB session in a driver cannot be serialized b/c it involves threads and open TCP connections to the underlying DB.
As #aaronman mentions, the easiest way at the moment is to include the creation of the driver connection in the closure in a partition foreach. That way you won't have serialization issues with the Driver.
This is a skeleton code of how this can be done:
rdd.foreachPartition {
msgIterator => {
val cluster = Cluster.builder.addContactPoint(host).build()
val session = cluster.connect(db)
msgIterator.foreach {msg =>
...
session.execute(statement)
}
session.close
}
}
As SparkSQL continues to evolve, I expect to have improved support for DB connectivity coming in the future. For example, DataStax created a Cassandra-Spark driver that abstracts out the connection creation per worker in an efficient way, improving on resource usage.
Look also at JdbcRDD which adds the connection handling as a function (executed on the workers)
A JDBC connection object is associated to a specific TCP connection and socket port and hence cannot be serialized. So you should create the JDBC connection in the remote executor JVM process not in the driver JVM process.
One way of achieving this is to have the connection object as a field in a singleton object in Scala (or a static field in Java) as shown below. In the below snippet the statement val session = ExecutorSingleton.session is not executed in the driver but the statement is shipped off to the Executor where it is executed.
case class ConnectionProfile(host: String, username: String, password: String)
object ExecutorSingleton {
var profile: ConnectionProfile = _
lazy val session = createConnection(profile)
def createJDBCSession(profile: ConnectionProfile) = { ... }
}
rdd.foreachPartition {
msgIterator => {
ExecutorSingleton.profile = ConnectionProfile("host", "username", "password")
msgIterator.foreach {msg =>
val session = ExecutorSingleton.session
session.execute(msg)
}
}
}

Is there a way to store instances of own classes in the ApplicationSettings of a Windows Store app?

In a Windows Store app I can only store WinRT types in the ApplicationSettings, according to the documentation. For roamed settings that should be held together I can use ApplicationDataCompositeValue. Trying to store an instance of an own class or struct results in an Exception with the message " WinRT information: Error trying to serialize the value to be written to the application data store. Additional Information: Data of this type is not supported". The term "trying to serialize" indicates that there must be some way so serialize a type for the application data API.
Does anyone know how I could achieve that?
I tried DataContract serialization but it did not work.
I think custom/own types are not supported.
See http://msdn.microsoft.com/en-us/library/windows/apps/hh464917.aspx:
"The Windows Runtime data types are supported for app settings."
But you can serialize your objects to XML and save as string... (see code below)
public static string Serialize(object obj)
{
using (var sw = new StringWriter())
{
var serializer = new XmlSerializer(obj.GetType());
serializer.Serialize(sw, obj);
return sw.ToString();
}
}
public static T Deserialize<T>(string xml)
{
using (var sw = new StringReader(xml))
{
var serializer = new XmlSerializer(typeof(T));
return (T)serializer.Deserialize(sw);
}
}
https://github.com/MyToolkit/MyToolkit/blob/master/src/MyToolkit/Serialization/XmlSerialization.cs
Check out this class too:
https://github.com/MyToolkit/MyToolkit/wiki/XmlSerialization
Disclaimer: The above links are from my project