groupby with spark java - csv

i can read data from csv with spark, but i don't know how to groupBy with specific array. I want to groupBy 'Name'. This is my code :
public class readspark {
public static void main(String[] args) {
final ObjectMapper om = new ObjectMapper();
System.setProperty("hadoop.home.dir", "D:\\Task\\winutils-master\\hadoop-3.0.0");
SparkConf conf = new SparkConf()
.setMaster("local[3]")
.setAppName("Read Spark CSV")
.set("spark.driver.host", "localhost");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> lines = jsc.textFile("D:\\Task\\data.csv");
JavaRDD<DataModel> rdd = lines.map(new Function<String, DataModel>() {
#Override
public DataModel call(String s) throws Exception {
String[] dataArray = s.split(",");
DataModel dataModel = new DataModel();
dataModel.Name(dataArray[0]);
dataModel.ID(dataArray[1]);
dataModel.Addres(dataArray[2]);
dataModel.Salary(dataArray[3]);
return dataModel;
}
});
rdd.foreach(new VoidFunction<DataModel>() {
#Override
public void call(DataModel stringObjectMap) throws Exception {
System.out.println(om.writeValueAsString(stringObjectMap));
}
}
);
}

Spark provides the group by functionality directly:
JavaPairRDD<String, Iterable<DataModel>> groupedRdd = rdd.groupBy(dataModel -> dataModel.getName());
This returns a pair rdd where the key is the Name (determined by the lambda provided to group by) and the value is data models with that name.
If you want to change the group by logic, all you need to do is provide corresponding lambda.

Related

MapReduce Function with JSON Files and JSONParser

i have some problems during writing my mapreduce funtions.
I want to solve the following problem:
I have a JSON file with 1mio JSONObject like this:
{"_id":3951,"title":"Two Family House (2000)","genres":["Drama"],"ratings":[{"userId":173,"rating":5},{"userId":195,"rating":5},{"userId":411,"rating":4},{"userId":593,"rating":2},{"userId":629,"rating":3},{"userId":830,"rating":3},{"userId":838,"rating":5},{"userId":850,"rating":4},{"userId":856,"rating":4},{"userId":862,"rating":5},{"userId":889,"rating":1},{"userId":928,"rating":5},{"userId":986,"rating":4},{"userId":1001,"rating":5},{"userId":1069,"rating":3},{"userId":1168,"rating":3},{"userId":1173,"rating":2},{"userId":1242,"rating":3},{"userId":1266,"rating":5},{"userId":1331,"rating":5},{"userId":1417,"rating":5},{"userId":1470,"rating":4},{"userId":1474,"rating":5},{"userId":1615,"rating":3},{"userId":1625,"rating":4},{"userId":1733,"rating":4},{"userId":1799,"rating":4},{"userId":1865,"rating":5},{"userId":1877,"rating":5},{"userId":1897,"rating":5},{"userId":1946,"rating":4},{"userId":2031,"rating":4},{"userId":2129,"rating":2},{"userId":2353,"rating":4},{"userId":2986,"rating":4},{"userId":3940,"rating":4},{"userId":3985,"rating":3},{"userId":4025,"rating":5},{"userId":4727,"rating":3},{"userId":5333,"rating":3}]}
and more....
One JSON Object is a Movie, which contains a array ratings. I want to count all ratings in the JSON File.
I created a Maven Proct in IntelliJ with the dependencys for Hadoop and JSON Parser. My MapReduce Class is this:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import java.io.IOException;
import java.util.Iterator;
public class RatingCounter {
public static class RatingMapper extends Mapper<JSONObject, Text, Text, Text>{
private Text id = new Text();
private Text ratingAnzahl = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
JSONParser parser = new JSONParser();
try {
Object obj = parser.parse(value.toString());
JSONObject jsonObject = (JSONObject) obj;
String movieId = (String) jsonObject.get("_id");
int count = 0;
// loop array
JSONArray ratings = (JSONArray) jsonObject.get("ratings");
Iterator<String> iterator = ratings.iterator();
while (iterator.hasNext()) {
count++;
}
} catch (ParseException e) {
e.printStackTrace();
}
}
}
public static class RatingReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Text resultValue = new Text();
int allRatings = 0;
while (values.hasNext()){
allRatings += Integer.parseInt(values.toString());
}
resultValue.set(""+allRatings);
context.write(key, resultValue);
}
}
public static void main (String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "ratings count");
job.setJarByClass(RatingCounter.class);
job.setMapperClass(RatingMapper.class);
job.setReducerClass(RatingReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I have no idea, how I can write the functions in Mapper and Reducer. Can someone help me pls?
I've made a few changes to your mapper and reducer.
First, for your mapper, you are not writing the output anywhere and your syntax while extending the Mapper class is also wrong(arguably). The first input to any mapper is a LongWritable (or Object type) offset of line. You can notice the changes below
public static class RatingMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context context) throws IOException, ParseException{
JSONParser parser = new JSONParser();
Object obj = parser.parse(value.toString());
JSONObject jsonObject = (JSONObject) obj;
String movieId = (String) jsonObject.get("_id");
JSONArray ratings = (JSONArray) jsonObject.get("ratings");
context.write(new Text(movieId), new IntWritable(ratings.size()) );
}
}
Notice here, the output of map is written using context.write
Now, coming onto your Reducer some things will change because of the changes I made in the mapper. Also, since your Number of Ratings will always be an integer, you don't need to convert it to Text, use parseInt and then convert to Text again.
public static class RatingReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int allRatings = 0;
while (values.hasNext()){
allRatings += value.get();
}
context.write(key, new IntWritable(resultValue));
}
}

.NET Core Configuration Serialization

Is there a way to serialize an object so that it could then be rehydrated by .Net Core Configuration Binder?
Basically, I'd like to get this Test to pass:
[Test]
public void Can_Serialize_And_Rehydrate()
{
var foo = new Foo{ Prop1 = 42; Prop2 = "Test" }
Dictionary<string, string> serialized = Serialize(Foo);
var deserializedFoo = new Foo();
new ConfigurationBuilder()
.AddInMemoryCollection(serialized)
.Build()
.Bind(deserializedFoo);
Assert.AreEqual(deserializedFoo.Prop1, 42);
Assert.AreEqual(deserializedFoo.Prop2, "Test");
}
Is there a Serializer out-of-the-box, or am I'm going to need to write my own Serialize() method?
AddInMemoryCollection's signature is like below, so why are you trying to serialize your dictionary here? You could just use it as it is.
public static IConfigurationBuilder AddInMemoryCollection(
this IConfigurationBuilder configurationBuilder,
IEnumerable<KeyValuePair<string, string>> initialData)
If you like to know more about how to test your custom configurations, I would suggest to look here:
https://github.com/aspnet/Configuration/blob/1.0.0/test/Microsoft.Extensions.Configuration.Binder.Test/ConfigurationBinderTests.cs
I was able to get this working by "hijacking" a JsonConfigurationProvider and plugging serialized Json directly into it. Not sure if this is the best way, but it does work:
public class ConfigurationSerializer
{
private class CustomJsonProvider : JsonConfigurationProvider
{
public CustomJsonProvider() : base(new JsonConfigurationSource())
{
}
public IDictionary<string, string> GetData(Stream s)
{
Load(s);
// Return the Configuration Dictionary
return Data;
}
}
public Dictionary<string, string> Serialize(object o)
{
var serialized =
JsonConvert.SerializeObject(
o,
new JsonSerializerSettings {NullValueHandling = NullValueHandling.Ignore});
using (var ms = new MemoryStream(Encoding.UTF8.GetBytes(serialized)))
{
var jsonProvider = new CustomJsonProvider();
return jsonProvider
.GetData(ms)
.ToDictionary(key => key.Key, value => value.Value);
}
}
}

how to save apache spark schema output in mysql database

Can anyone please tell me if there is any way in apache spark to store a JavaRDD on mysql database? I am taking input from 2 csv files and then after doing join operations on their contents I need to save the output(the output JavaRDD) in the mysql database. I am already able to save the output successfully on hdfs but I am not finding any information related to apache Spark-MYSQL connection. Below I am posting the code for spark sql. This might serve as a reference to those who are looking for an example for spark-sql.
package attempt1;
import java.io.Serializable;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.JavaSQLContext;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
public class Spark_Mysql {
#SuppressWarnings("serial")
public static class CompleteSample implements Serializable {
private String ASSETNUM;
private String ASSETTAG;
private String CALNUM;
public String getASSETNUM() {
return ASSETNUM;
}
public void setASSETNUM(String aSSETNUM) {
ASSETNUM = aSSETNUM;
}
public String getASSETTAG() {
return ASSETTAG;
}
public void setASSETTAG(String aSSETTAG) {
ASSETTAG = aSSETTAG;
}
public String getCALNUM() {
return CALNUM;
}
public void setCALNUM(String cALNUM) {
CALNUM = cALNUM;
}
}
#SuppressWarnings("serial")
public static class ExtendedSample implements Serializable {
private String ASSETNUM;
private String CHANGEBY;
private String CHANGEDATE;
public String getASSETNUM() {
return ASSETNUM;
}
public void setASSETNUM(String aSSETNUM) {
ASSETNUM = aSSETNUM;
}
public String getCHANGEBY() {
return CHANGEBY;
}
public void setCHANGEBY(String cHANGEBY) {
CHANGEBY = cHANGEBY;
}
public String getCHANGEDATE() {
return CHANGEDATE;
}
public void setCHANGEDATE(String cHANGEDATE) {
CHANGEDATE = cHANGEDATE;
}
}
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
JavaSparkContext ctx = new JavaSparkContext("local[2]", "JavaSparkSQL");
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
JavaRDD<CompleteSample> cs = ctx.textFile("C:/Users/cyg_server/Documents/bigDataExample/AssetsImportCompleteSample.csv").map(
new Function<String, CompleteSample>() {
public CompleteSample call(String line) throws Exception {
String[] parts = line.split(",");
CompleteSample cs = new CompleteSample();
cs.setASSETNUM(parts[0]);
cs.setASSETTAG(parts[1]);
cs.setCALNUM(parts[2]);
return cs;
}
});
JavaRDD<ExtendedSample> es = ctx.textFile("C:/Users/cyg_server/Documents/bigDataExample/AssetsImportExtendedSample.csv").map(
new Function<String, ExtendedSample>() {
public ExtendedSample call(String line) throws Exception {
String[] parts = line.split(",");
ExtendedSample es = new ExtendedSample();
es.setASSETNUM(parts[0]);
es.setCHANGEBY(parts[1]);
es.setCHANGEDATE(parts[2]);
return es;
}
});
JavaSchemaRDD complete = sqlCtx.applySchema(cs, CompleteSample.class);
complete.registerAsTable("cs");
JavaSchemaRDD extended = sqlCtx.applySchema(es, ExtendedSample.class);
extended.registerAsTable("es");
JavaSchemaRDD fs= sqlCtx.sql("SELECT cs.ASSETTAG, cs.CALNUM, es.CHANGEBY, es.CHANGEDATE FROM cs INNER JOIN es ON cs.ASSETNUM=es.ASSETNUM;");
JavaRDD<String> result = fs.map(new Function<Row, String>() {
public String call(Row row) {
return row.getString(0);
}
});
result.saveAsTextFile("hdfs://path/to/hdfs/dir-name"); //instead of hdfs I need to save it on mysql database, but I am not able to find any Spark-MYSQL connection
}
}
Here at the end I am saving the result successfully in HDFS. But now I want to save into MYSQL database. Kindly help me out. Thanks
There are two approaches you can use for writing your results back to the database. One is to use something like DBOutputFormat and configure that, and the other is to use foreachPartition on the RDD you want to save and pass in a function which creates a connection to MySQL and writes the result back.
Here is an example using DBOutputFormat.
Create a class that represents your table row -
public class TableRow implements DBWritable
{
public String column1;
public String column2;
#Override
public void write(PreparedStatement statement) throws SQLException
{
statement.setString(1, column1);
statement.setString(2, column2);
}
#Override
public void readFields(ResultSet resultSet) throws SQLException
{
throw new RuntimeException("readFields not implemented");
}
}
Then configure your job and write a mapToPair function. The value doesn't appear to be used. If anyone knows, please post a comment.
String tableName = "YourTableName";
String[] fields = new String[] { "column1", "column2" };
JobConf job = new JobConf();
DBConfiguration.configureDB(job, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost/DatabaseNameHere", "username", "password");
DBOutputFormat.setOutput(job, tableName, fields);
// map your rdd into a table row
JavaPairRDD<TableRow, Object> rows = rdd.mapToPair(...);
rows.saveAsHadoopDataset(job);

GSON not handlig initialized static list correctly

If I do this:
public static volatile ArrayList<Process> processes = new ArrayList<Process>(){
{
add(new Process("News Workflow", "This is the workflow for the news segment", "image"));
}
};
and then this:
String jsonResponse = gson.toJson(processes);
jsonResponse is null.
But if I do this:
public static volatile ArrayList<Process> processes = new ArrayList<Process>();
processes.add(new Process("nam", "description", "image"));
String jsonResponse = gson.toJson(processes);
Json response is:
[{"name":"nam","description":"description","image":"image"}]
Why is that?
I do not know what is the problem with Gson, but do you know, that you are creating subclass of ArrayList here?
new ArrayList<Process>(){
{
add(new Process("News Workflow", "This is the workflow for the news segment", "image"));
}
};
You can check that by
System.out.println( processes.getClass().getName() );
it won't print java.util.ArrayList.
I think you wanted to use static initialization as
public static volatile ArrayList<Process> processes = new ArrayList<Process>();
static {
processes.add( new Process( "News Workflow", "This is the workflow for the news segment", "image" ) );
};
It seems that there is problem with anonymous classes, same problem is here
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public class GSonAnonymTest {
interface Holder {
String get();
}
static Holder h = new Holder() {
String s = "value";
#Override
public String get() {
return s;
}
};
public static void main( final String[] args ) {
final GsonBuilder gb = new GsonBuilder();
final Gson gson = gb.create();
System.out.println( "h:" + gson.toJson( h ) );
System.out.println( h.get() );
}
}
UPD: look at Gson User Guide - Finer Points with Objects, last point "...anonymous classes, and local classes are ignored and not included in serialization or deserialization..."

How to serialize such a custom type to json with google-gson?

First, I have a very simple java bean which can be easily serialized to json:
class Node {
private String text;
// getter and setter
}
Node node = new Node();
node.setText("Hello");
String json = new Gson().toJson(node);
// json is { text: "Hello" }
Then in order to make such beans have some dynamic values, so I create a "WithData" base class:
Class WithData {
private Map<String, Object> map = new HashMap<String, Object>();
public void setData(String key, Object value) { map.put(key, value); }
public Object getData(String key) = { return map.get(key); }
}
class Node extends WithData {
private String text;
// getter and setter
}
Now I can set more data to a node:
Node node = new Node();
node.setText("Hello");
node.setData("to", "The world");
But Gson will ignore the "to", the result is still { text: "Hello" }. I expect it to be: { text: "Hello", to: "The world" }
Is there any way to write a serializer for type WithData, that all classes extend it will not only generate its own properties to json, but also the data in the map?
I tried to implement a custom serializer, but failed, because I don't know how to let Gson serialize the properties first, then the data in map.
What I do now is creating a custom serializer:
public static class NodeSerializer implements JsonSerializer<Node> {
public JsonElement serialize(Node src,
Type typeOfSrc, JsonSerializationContext context) {
JsonObject obj = new JsonObject();
obj.addProperty("id", src.id);
obj.addProperty("text", src.text);
obj.addProperty("leaf", src.leaf);
obj.addProperty("level", src.level);
obj.addProperty("parentId", src.parentId);
obj.addProperty("order", src.order);
Set<String> keys = src.getDataKeys();
if (keys != null) {
for (String key : keys) {
obj.add(key, context.serialize(src.getData(key)));
}
}
return obj;
};
}
Then use GsonBuilder to convert it:
Gson gson = new GsonBuilder().
registerTypeAdapter(Node.class, new NodeSerializer()).create();
Tree tree = new Tree();
tree.addNode(node1);
tree.addNode(node2);
gson.toJson(tree);
Then the nodes in the tree will be converted as I expected. The only boring thing is that I need to create a special Gson each time.
Actually, you should expect Node:WithData to serialize as
{
"text": "Hello",
"map": {
"to": "the world"
}
}
(that's with "pretty print" turned on)
I was able to get that serialization when I tried your example. Here is my exact code
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import java.net.MalformedURLException;
import java.util.HashMap;
import java.util.Map;
public class Class1 {
public static void main(String[] args) throws MalformedURLException {
GsonBuilder gb = new GsonBuilder();
Gson g = gb.setPrettyPrinting().create();
Node n = new Node();
n.setText("Hello");
n.setData("to", "the world");
System.out.println(g.toJson(n));
}
private static class WithData {
private Map<String, Object> map = new HashMap<String, Object>();
public void setData(String key, Object value) { map.put(key, value); }
public Object getData(String key) { return map.get(key); }
}
private static class Node extends WithData {
private String text;
public Node() { }
public String getText() {return text;}
public void setText(String text) {this.text = text;}
}
}
I was using the JDK (javac) to compile - that is important because other compilers (those included with some IDEs) may remove the information on which Gson relies as part of their optimization or obfuscation process.
Here are the compilation and execution commands I used:
"C:\Program Files\Java\jdk1.6.0_24\bin\javac.exe" -classpath gson-2.0.jar Class1.java
"C:\Program Files\Java\jdk1.6.0_24\bin\java.exe" -classpath .;gson-2.0.jar Class1
For the purposes of this test, I put the Gson jar file in the same folder as the test class file.
Note that I'm using Gson 2.0; 1.x may behave differently.
Your JDK may be installed in a different location than mine, so if you use those commands, be sure to adjust the path to your JDK as appropriate.