I am in the process of writing a new Flink job that ingests JSON data from a Kafka source. I am "stuck" with using Flink 1.15.2. But apart from this very free. I would like to stick to the DataStream API, mostly of a learning experience.
My data looks like this:
{ "SomeField": 123, "someOtherField": 4554, "another_different_field": 34543}
As you can see there are multiple different naming schemas present. Inside of my Flink job and further downstream I would like to clean this up. But how to do this efficiently? Create my own POJOSerializer?
Or just the type hints? E.g. something like this:
public class MyDataTypeInfoFactory extends TypeInfoFactory<MyData> {
#Override
public TypeInformation<MyData> createTypeInfo(
Type t, Map<String, TypeInformation<?>> genericParameters) {
return Types.POJO(MyData.class, new HashMap<>(...)); // some code here?
}
}
Or go with JSONKeyValueDeserializationSchema and then map the objects?
Any pointers?
Related
I'm using Flink to process some JSON-format data coming from some Data Source.
For now, my process is quite simple: extract each element from the JSON-format data and print them into log file.
Here is my piece of code:
// create proper deserializer to deserializer the JSON-format data into ObjectNode
PravegaDeserializationSchema<ObjectNode> adapter = new PravegaDeserializationSchema<>(ObjectNode.class, new JavaSerializer<>());
// create connector to receive data from Pravega
FlinkPravegaReader<ObjectNode> source = FlinkPravegaReader.<ObjectNode>builder()
.withPravegaConfig(pravegaConfig)
.forStream(stream)
.withDeserializationSchema(adapter)
.build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> dataStream = env.addSource(source).name("Pravega Stream");
dataStream.???.print();
Saying that the data coming from Pravega is like this: {"name":"titi", "age":18}
As I said, for now I simply need to extract name and age and print them.
So how could I do this?
As my understanding, I need to make some customized codes at ???. I might need to create a custom POJO class which contains ObjectNode. But I don't know how. I've read the official doc of Flink and also tried to google about how to create a custom POJO for Flink but I can't still figure out clearly.
Could you please show me an example?
Why don't You simply use something more meaningful instead of JavaSerializer? Perhaps something from here.
You could then create a POJO with the fields you want to use and simply deserialize JSON data to Your POJO instead of ObjectNode
Also, if there is some specific reason that You need to have ObjectNode on deserialization then You can simply do something like :
//I assume You have created the class named MyPojo
dataStream.map(new MapFunction<ObjectNode, MyPojo>() {
ObjectMapper mapper = new ObjectMapper();
#Override
public MyPojo map(final ObjectNode value) throws Exception {
mapper.readValue(value.asText(), MyPojo.class)
}
})
I want to create two different JSON documents and each contains 5 fields. I have a POJO class with 10 attributes. I want to form json1 with 5 attributes and json2 with 5 attributes using that POJO class. Is there any way to construct these objects?
Consider writing two separate wrapper classes which each expose the fields you want for the two cases, and pass the pojo as a constructor arg.
So, one of them exposes one set of properties and might look like this:
public class JsonObject1 {
private MyPojo myPojo;
public JsonObject1(MyPojo myPojo) {
this.myPojo = myPojo;
}
public void getProperty1() {
return myPojo.getProperty1();
}
......
}
and the other is similar, but exposes the other subset of properties.
Alternatively, you could add two methods (possibly to your POJO, or possibly to a service class that is exposing the POJO) that each returns a Map (eg a HashMap) where you've copied across the specific properties you want for each view, and then convert those Maps to JSON. This is less "model-driven", but might be less work overall. Thanks to #fvu for this observation!
public Map<String, Object> getPojoAsMap1() {
Map<String, Object> m = new HashMap<>();
m.put("property1", pojo.getProperty1());
....
return m;
}
It's also possible that the two different JSON representations are trying to tell you that your POJO should be split up into two POJOs - sometimes things like this are hints about how your code could be improved. But it depends on the circumstances, and it might not apply in this case.
I have been working on an RCP Client to handle weather data.
What i do is 2 things, first i scraped the JSON i will be using and put it into a dart file. See: https://dartpad.dartlang.org/a9c1fe8ce34c608eaa28
My server.dart page, will import the weather data, and then carry out the following:
import "dart:io";
import "weather_data.dart";
import "dart:convert";
import "package:rpc/rpc.dart";
find ApiServer _apiServer = new ApiServer(prettyPrint:true);
main() async {
Weather w = new Weather(WeatherJson);
TestServer ts = new TestServer(w);
_apiServer.addApi(ts);
HttpServer server = await HttperServer.bind(InternetAddress.ANY_IP_V4, 12345);
server.listen(_apiServer.httpRequestHandler);
}
class Weather{
Map weather;
Weather(this.weather){
Map get daily => weather["daily"];
}
}
#ApiClass(name:"test_server", version: 'v1', description: 'This is a test server api to ping for some quick sample data.')
class TestServer {
Weather myWeather;
TestServer(this.myWeather){
}
#ApiMethod(method:'GET', path: 'daily')
Map<String, Object> getDaily(){
return myWeather.daily;
}
}
So, the server starts correctly, and i will go to localhost:12345/test_server/v1/daily and it will return this:
{
"summary": {},
"icon": {},
"data": {}
}
which is not correct. If you look up the JSON data, summary and icon are both strings and data is an array. They are also empty, and should contain the data i wanted to return.
Why does this occur? Is it because i am returning a Map<String, Object>? I was trying to set it up to be: Map<String, dynamic> but the dart compiler didnt like it.
How do i get this data to return the correct dataset?
The Dart website for RPC is located at: https://github.com/dart-lang/rpc
and you can see that under methods, the return value of a method can be either an instance of a class or a future. That makes sense as per usual, so I set it to be a Map<String,Object> though trying to be vague about it by saying: Map was not sufficient.
Edit:
When doing this mostly in dart pad without RPC, it seems to work correctly, by a sample of: https://dartpad.dartlang.org/3f6dc5779617ed427b75
This leads me to believe something is wrong with the Parsing tool as it seems the return type in dartpad allows to return Map, Map<String, Object>, and Map<String, dynamic>.
Having had a quick look at the RPC package README here https://pub.dartlang.org/packages/rpc, it seems that methods marked as Api methods (with #ApiMethod) should return an instance of a class with simple fields such as:
class ResourceMessage {
int id;
String name;
int capacity;
}
The RPC package will take that instance and serialize it into JSON based on the field names.
From the README:
The MyResponse class must be a non-abstract class with an unnamed
constructor taking no required parameters. The RPC backend will
automatically serialize all public fields of the the MyResponse
instance into JSON ...
You are returning a nested Map representation of the JSON you want the RPC operation to emit and would guess that the RPC package does not handle it as you are expecting it to.
Re: this from your question:
This leads me to believe something is wrong with the Parsing tool as
it seems the return type in dartpad allows to return Map, Map, and Map.
There is no 'parsing' on JSON going on on your example. The data you have is a set of nested literal Dart Maps, Lists and Strings with the same structure as the JSON it was derived from. It just happens to look like JSON.
In your example you are just selecting and printing a sub-map of your data map (data['daily']), which prints out the String that results from calling toString() - which is recursive so you get the contents of all the nested maps and lists within it.
So it's not a 'deep copy' issue, but a difference in how toString() and the RPC code processes a set of nested maps.
BTW: the return type of your getDaily() method is immaterial. What is returned is just a Map whatever the declared return type of the method is. Remember types in Dart are optional and there for editors and compilers to spot potentially incorrect code. See https://www.dartlang.org/docs/dart-up-and-running/ch02.html#variables.
I am going to piggyback off of #Argenti Apparatus here as there was a lot of information gained from him.
Long story short, the required return type of the method:
#ApiMethod(method:'GET', path: 'daily')
Map<String,Object> getDaily(){ // <-- Map<String,Object>
return myWeather.daily;
}
is the error.
I went through and updated the method signature to be Map<String,String> and it parsed it entirely correct. It did not parse the object as a string, but actually parsed it as a full recursed object.
I went through and for the sake of code cleanliness also changed signatures of Weather properties to reflect what they actually were, Map<String,Object> as well.
All in all, When defining it to be an value type of Object, it was returning curly braces, but setting it as a String parsed it correctly.
I ran it through JSLint to confirm it is correct as well.
I gave a +1 to the helper, I had to dig deeper into the code to see WHY it wasnt doing a Map correctly.
This also I feel, is plausibly a bug in RPC Dart.
How to change from "setLineTokenizer(new DelimitedLineTokenizer()...)" to "JsonLineMapper" in the first code below? Basicaly, it is working with csv but I want to change it to read a simple json file. I found some threads here asking about complex json but this is not my case. Firstly I thought that I should use a very diferent approach from csv way, but after I read SBiAch05sample.pdf (see the link and snippet at the bottom), I understood that FlatFileItemReader can be used to read json format.
In almost similiar question, I can guess that I am not in the wrong direction. Please, I am trying to find the simplest but elegant and recommended way for fixing this snippet code. So, the wrapper below, unless I am really obligated to work this way, seems to go further. Additionally, the wrapper seems to me more Java 6 style than my tentative which takes advantage of anonimous method from Java 7 (as far as I can judge from studies). Please, any advise is higly appreciated.
//My Code
#Bean
#StepScope
public FlatFileItemReader<Message> reader() {
log.info("ItemReader >>");
FlatFileItemReader<Message> reader = new FlatFileItemReader<Message>();
reader.setResource(new ClassPathResource("test_json.js"));
reader.setLineMapper(new DefaultLineMapper<Message>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(new String[] { "field1", "field2"...
//Sample using a wrapper
http://www.manning.com/templier/SBiAch05sample.pdf
import org.springframework.batch.item.file.LineMapper;
import org.springframework.batch.item.file.mapping.JsonLineMapper;
import com.manning.sbia.ch05.Product;
public class WrappedJsonLineMapper implements LineMapper<Product> {
private JsonLineMapper delegate;
public Product mapLine(String line, int lineNumber) throws Exception {
Map<String,Object> productAsMap
= delegate.mapLine(line, lineNumber);
Product product = new Product();
product.setId((String)productAsMap.get("id"));
product.setName((String)productAsMap.get("name"));
product.setDescription((String)productAsMap.get("description"));
product.setPrice(new Float((Double)productAsMap.get("price")));
return product;
}
public void setDelegate(JsonLineMapper delegate) {
this.delegate = delegate;
}
}
Really you have two options for parsing JSON within a Spring Batch job:
Don't create a LineMapper, create a LineTokenizer. Spring Batch's DefaultLineMapper breaks up the parsing of a record into two phases, parsing the record and mapping the result to an object. The fact that the incoming data is JSON vs a CSV only impacts the parsing piece (which is handled by the LineTokenizer). That being said, you'd have to write your own LineTokenizer to parse the JSON into a FieldSet.
Use the provided JsonLineMapper. Spring Batch provides a LineMapper implementation that uses Jackson to deserialize JSON objects into java objects.
In either case, you can't map a LineMapper to a LineTokenizer as they accomplish two different things.
This is a simplified version of the problem i am solving but conceptually equivalent.
This project is using castle windsor and I am trying to keep all factories in the container.
I have a single object that represents data parsed from a text file. After parsing this file I need to write a new text file with 2 line based on data in the original object.
lets say the text file is
Some Person, Work Phone, Mobil Phone
this gets parsed into
public class Person
{
public string Name{get;set;}
public stirng WorkPhone {get;set;}
public stirng MobilPhone {get;set;}
}
Now this is a simplified example so keep that in mind please. The next step is to creat new object instances that represent each line we will write to the text file
public interface IFileEntry
{
string Name{get;set;}
string Number{get;set;}
}
public class PersonWorkPhoneEntry : IFileEntry
{
public string Name {get;set;}
public string Number{get;set;}
public override ToString(){....}
}
public class PersonMobilPhoneEntry: IFileEntry
{
public string Name{get;set;}
public string Number{get;set;}
public override ToString(){....}
}
so being that we are using Castle for this lets make a factory
public interface IFileEntryFactory
{
IFileEntry Create(string entryType, stirng Name, string Number
}
I have created my own implementation for the DefaultTypedFactoryComponentSelector and install that for this factory only.
public class FileEntryComponentSelector : DefaultTypedFactoryComponentSelector
{
protected override string GetComponentName(System.Reflection.MethodInfo method, object[] arguments)
{
if (method.Name == "Create" && arguments.length == 3)
{
return (string)arguments[0];
}
return base.GetComponentName(method, arguments);
}
}
This works,
var workEntry = _factory.Create("PersonWorkPhoneEntry", person.Name, person.WorkPhone)
var mobilEntry = _factory.Create("PersonMobilPhoneEntry", person.Name, person.WorkPhone)
//then write the tostring to a text file
Sorry for the long setup but i think its needed. What I am trying to do Is
public interface IFileEntryFactory
{
IFileEntry Create(string entryType, stirng Name, string Number
IFileEntry[] Create(Person person)
}
var entries = _factory.Create(person);
foreach(var e in entries)
///write to text file.
I have been digging all over for a solution like this with no results.
What seems to be a possible solution taking the example shown here (Castle Windsor Typed Factory Facility with generics)
Im currently working on implementing something like this now, not sure if this is the right way to solve this problem.
The questions:
are there any other ways to have the factory return the array of
needed objects
what is the best practice for solving something like
this
any examples and reading for advanced factories
It is possible to make a Factory return to you an array of objects which are already registered in the container. Here is an example
container.Register(Component.For<IMyStuffProvider>().AsFactory()) // registration
public interface IStuffProvider
{
IEnumerable<IMyStuff> GetAllStuff();
void Release(IMyStuff stuff);
}
This code makes possible that every registered implementation of IMyStuff gets returned by the factory.
But I think that your problem is different : you are using the factory for the wrong purpose. TypedFactory is to get instances of objects that are already registered in the container during app start and not to manipulate files. Their purpose is to solve problems regarding dependencies.
If you are parsing a csv/txt into objects and then writing some of the rows back into another csv/txt you have to make
IFileEntryManager (with an implementation) with a methods like DeserializeFileToObjects, WriteObjectsToFile, etc.
IFileEntryManagerFactory to create and return IFileEntryManager. ( Castle typed factory here :) )
Now inject your IFileEntryManagerFactory in your ctor of the class that needs to serialize/deserialize text files and and use it to get your FileEntryManager which in turn will act upon your text files.
If you have different objects like Person, Company, Employee... etc. and you want to handle them with generic manipulator - it is ok. The best way is to implement a Generic Repository. Lets say ICsvRepository<T>. Just search for 'Generic Rpository in c#' and ignore that fact that most of the implementation examples are with EntityFramework as a persistence store. Behind the interface you can make it read/write to csv rather than to DB.
Lets generalize it. If you have to deal with resources - files, sql, blobs, tables, message bus or whatever resource persistent/non persistent which comes in or goes out of your application you have to manipulate it through an abstraction IMyResourceManager with its corresponding manipulation methods. If you have several implementations of IMyResourceManager and you want to decide during runtime which implementation you want then you have to make IMyResourceManagerFactory with a component selector or factory method and place your differentiation logic there.
That is why I think you do not need a TypedFactory for text file read/write but a pure ITextFileManipulator which you have to register in the container and get it through constructor. You may need a typed factory if you go for ICsvRepository<T> where T is your Person class. Inside the implementation of ICsvRepository<T> you will need ICsvFileManipulator.