I had a Kinesis stream containing binary ION records, and I needed to read that stream in Flink.
My solution, 2 years ago, was to write a Base64 SerDe (just about 20 lines of Java code) and use that for the KinesisConsumer.
Now I have the same requirements, but need to use PyFlink.
I guess I could create the same Java file, compile and package it in a jar, add that as a dependency for pyflink, and write a Python wrapper...
That sounds like a lot of effort for a task seemed simple like that.
Is there any simpler way? Like some config in the Kinesis Java SDK which does the base64 encode before yielding the record to the SDK user (Flink), which is same as what the aws cli is doing. Or, even simpler, to have that Java SDK convert an ION record from binary to text mode.
Thanks!
Related
I'm working on a project that uses parallel methods to convert text from one form to another. We're going to implement a CSV to JSON converter to demonstrate the speedups that are possible using our parallel framework.
We want to benchmark our converter once it's finished. What are the fastest libraries/stand-alone programs/etc out there that are capable of doing CSV-JSON conversion? I found a list of potential candidates here:Large CSV to JSON/Object in Node.js, but I'm not sure how fast the listed options are. In the worst case I'll benchmark them myself, but if someone already knows what the "best in class" converters are it'd save me some time.
Looks like the maintainer of csvtojson has developed a benchmark application. I think I can add my csv to json converter to his benchmark project to test my converter.
if your project can consider in-browser apps, I suggest csvtojson as it is by far the speediest converter on the market as of 2017.
I created it myself so I may be a bit biaised, but I specifically developed it for a bigger project that required big csv to json crunching.
Tell me if it served.
I have the following setup (that I cannot change) and I'd like some advice from people who have been down that road. I'm not sure if this is the right place to ask, but here goes anyway.
Various JSON messages are placed on a different channels of a JMS queue (Universal Messaging/webMethods).
Before the data can be stored in relational-style DBs it has to be transformed: renamed, arrays flattened and some structures from nested objects extracted.
Data has to be appended to MySQL (as a serving layer for a visualization tool) and Hive (for long-term storage).
We're stuck on Spark 1.4.1 and may move to 1.6.0 in a few months' time. So, structured streaming is not (yet) an option.
At some point the events will be streamed directly to real-time dashboards, so having something in place that is capable of doing that now would be ideal.
Ideally coding is done in Scala (because we already have considerable batch-based repo with Spark and Scala), so the minimal requirement is JVM-based.
I've looked at Spark Streaming but it does not have a JMS adapter and as far as I can tell operating on JSON would be done using a SQLContext instance on the DStream's RDDs. I understand that it's possible to write a custom adapter, but then I'm not sure if Spark is still the best/easiest solution. I've also looked at the doc for Samza and Flink but did not find much for JMS and/or JSON, at least not natively.
Apache Camel seems like it might have a substantial set of connectors but I'm not too familiar with it, and I get the impression it does not do the streaming part, 'just' the bit where you connect to various systems. There's also Akka although I get the impression it's more of a replacement for messaging systems and JMS is set.
There is an almost bewildering amount of available tools and I'm at this point at a loss what to look at or what to look out for. What do you recommend based on your experience that I use to pick up the messages, transform, and insert into Hive and MySQL?
The new Play 2.4 has added out of the box support for json Writes and Reads for the new Java 8 time classes, but Play 2.3.x is still stuck with the Joda time json support only. Is there a way to get the Java 8 time json support on the 2.3.x? How the custom Reads and Writes for ZonedDateTime would look like?
You can copy the play 2.4 Writes and Reads code directly from their source code, or read it and adapt your own:
Writes:
https://github.com/playframework/playframework/blob/702e89841fc54f5603a0d981c3488ed9883561fe/framework/src/play-json/src/main/scala/play/api/libs/json/Writes.scala
Reads:
https://github.com/playframework/playframework/blob/cde65d987b6cf3c307dfab8269b87a65c5e84575/framework/src/play-json/src/main/scala/play/api/libs/json/Reads.scala
If you copy the files wholesale and remove the contravariant functor reads/writes, they will have no external dependencies beyond Java8 & Scala.
I'm obviously not advocating this kind of copy & paste in general, but I'd don't see that it would do any harm here, as it's just a stop-gap until your project migrates to play 2.4, at which point they can be deleted.
I'm writing data to a JSON file in Processing with the saveJSONObject command. I would like to access that JSON file with another program (MAX/MSP) while my sketch is still open. The problem is, MAX is unable to read from the file while my sketch is running. Only after I close the sketch is MAX able to import data from my file.
Is Processing keeping that file open somehow while the sketch is running? Is there any way I can get around this problem?
It might be easier to stream your data straight to MaxMSP using the OSC protocol. On the Processing side, have a look at the oscP5 library and on the Max side at the udpreceive object.
You could send your JSON object as a string and unpack that in Max (maybe using the JavaScript support already present in Max), but it might be simpler to mimic the structure of your JSON object as the arguments of the OSC message object which you simply umpack in Max directly.
Probably, because I/O is usually buffered (notably for performance reasons, ans also because the hardware is doing I/O by blocks).
Try to flush the output channel, perhaps using PrintWriter::flush or something similar.
Details are implementation specific (and might be operating system specific).
I'd like to compute very large JSON files (about 400 MB each) in Scala.
My use-case is batch-processing. I can receive several very big files (up to 20 GB, then cut to be processed) at the same moment and I really want to process them quickly as a queue (but it's not the subject of this post!). So it's really about distributed architecture and performance issues.
My JSON file format is an array of objects, each JSON object contains at least 20 fields. My flow is composed of two major steps. The first one is the mapping of the JSON object into a Scala object. And the second step is some transformations I'm making on the Scala object data.
To avoid loading all the file in memory, I'd like a parsing library where I can have incremental parsing. There are so many libraries (Play-JSON, Jerkson, Lift-JSON, the built in scala.util.parsing.json.JSON, Gson) and I cannot figure out which one to take, with the requirement to minimize dependencies.
Do you have any ideas of a library I can use for high-volume parsing with good performances?
Also, I'm searching a way to process in parallel the mapping of the JSON file and the transformations made on the fields (between several nodes).
Do you think I can use Apache Spark to do it? Or are there alternative ways to accelerate/distribute the mapping/transformation?
Thanks for any help.
Best regards, Thomas
Considering a scenario without Spark, I would advise to stream the json with Jackson Streaming (Java) (see for example there), map each Json object to a Scala case class and send them to an Akka router with several routees that do the transformation part in parallel.