I'd like to compute very large JSON files (about 400 MB each) in Scala.
My use-case is batch-processing. I can receive several very big files (up to 20 GB, then cut to be processed) at the same moment and I really want to process them quickly as a queue (but it's not the subject of this post!). So it's really about distributed architecture and performance issues.
My JSON file format is an array of objects, each JSON object contains at least 20 fields. My flow is composed of two major steps. The first one is the mapping of the JSON object into a Scala object. And the second step is some transformations I'm making on the Scala object data.
To avoid loading all the file in memory, I'd like a parsing library where I can have incremental parsing. There are so many libraries (Play-JSON, Jerkson, Lift-JSON, the built in scala.util.parsing.json.JSON, Gson) and I cannot figure out which one to take, with the requirement to minimize dependencies.
Do you have any ideas of a library I can use for high-volume parsing with good performances?
Also, I'm searching a way to process in parallel the mapping of the JSON file and the transformations made on the fields (between several nodes).
Do you think I can use Apache Spark to do it? Or are there alternative ways to accelerate/distribute the mapping/transformation?
Thanks for any help.
Best regards, Thomas
Considering a scenario without Spark, I would advise to stream the json with Jackson Streaming (Java) (see for example there), map each Json object to a Scala case class and send them to an Akka router with several routees that do the transformation part in parallel.
Related
We often work with scientific datasets distributed as small (<10G compressed), individual, but complex files (xml/json/parquet). UniProt is one example, and here is a schema for it.
We typically process data like this using Spark since it is supported well. I wanted to see though what might exist for doing work like this with the Dataframe or Bag APIs. A few specific questions I had are:
Does anything exist for this other than writing custom python functions for Bag.map or Dataframe/Series.apply?
Given any dataset compatible with Parquet, are there any secondary ecosystems of more generic (possibly JIT compiled) functions for at least doing simple things like querying individual fields along an xml/json path?
Has anybody done work to efficiently infer a nested schema from xml/json? Even if that schema was an object that Dask/Pandas can’t use, simply knowing it would be helpful for figuring out how to write functions for something like Bag.map. I know there are a ton of Python json schema inference libraries, but none of them look to be compiled or otherwise built for performance when applied to thousands or millions of individual json objects.
I have the following setup (that I cannot change) and I'd like some advice from people who have been down that road. I'm not sure if this is the right place to ask, but here goes anyway.
Various JSON messages are placed on a different channels of a JMS queue (Universal Messaging/webMethods).
Before the data can be stored in relational-style DBs it has to be transformed: renamed, arrays flattened and some structures from nested objects extracted.
Data has to be appended to MySQL (as a serving layer for a visualization tool) and Hive (for long-term storage).
We're stuck on Spark 1.4.1 and may move to 1.6.0 in a few months' time. So, structured streaming is not (yet) an option.
At some point the events will be streamed directly to real-time dashboards, so having something in place that is capable of doing that now would be ideal.
Ideally coding is done in Scala (because we already have considerable batch-based repo with Spark and Scala), so the minimal requirement is JVM-based.
I've looked at Spark Streaming but it does not have a JMS adapter and as far as I can tell operating on JSON would be done using a SQLContext instance on the DStream's RDDs. I understand that it's possible to write a custom adapter, but then I'm not sure if Spark is still the best/easiest solution. I've also looked at the doc for Samza and Flink but did not find much for JMS and/or JSON, at least not natively.
Apache Camel seems like it might have a substantial set of connectors but I'm not too familiar with it, and I get the impression it does not do the streaming part, 'just' the bit where you connect to various systems. There's also Akka although I get the impression it's more of a replacement for messaging systems and JMS is set.
There is an almost bewildering amount of available tools and I'm at this point at a loss what to look at or what to look out for. What do you recommend based on your experience that I use to pick up the messages, transform, and insert into Hive and MySQL?
I have a large JSON file, its size is 5.09 GB. I want to convert it to an XML file. I tried online converters but the file is too large for them. Does anyone know how to to do that?
The typical way to process XML as well as JSON files is to load these files completely into memory. Then you have a so called DOM which allows you various kinds of data processing. But neither XML nor JSON are really designed for storing that much data you have here. To my experience you typically will run into memory problems as soon as you exceed a 200 MByte limit. This is because DOMs are created that are composed from individual objects. This approach results in a huge memory overhead that far exceeds the amount of data you want to process.
The only way for you to process files like that is basically to take a stream approach. The basic idea: Instead of parsing the whole file and loading it into memory you parse and process the file "on the fly". As data is read it is parsed and events are triggered on which your software can react and perform some actions as needed. (For details on that have a look at the SAX API in order to understand this concept in more detail.)
As you stated you are processing JSON, not XML. Stream APIs for JSON should be available in the wild as wel. Anyway you could implement one fairly easily yourself: JSON is a pretty simple data format.
Nevertheless such an approach is not optimal: Typically such a concept will result in very slow data processing because of millions of method invocations involved: For every item encountered you typically need to call a method in order to perform some data processing task. This together with additional checks about what kind of information you currently have encountered in the stream will slow down data processing pretty much.
You really should consider to use a different kind of approach. First split your file into many small ones, then perform processing on them. This approach might not seem to be very elegant, but it helps to keep your task much simpler. This way you gain a main advantage: It will be much easier for you to debug your software. Unfortunately you are not very specific about your problem, so I can only guess, but large files typically imply that the data model is pretty complex. Therefor you will probably be much better off by having many small files instead of a single huge one. And later it allows you to dig into individual aspects of your data and the data processing process as needed. You will probably fail getting any detailed insights into that while having a single large file of 5 GByte to process. On errors you will have trouble to identify which part of the huge file is causing the problem.
As I already stated you unfortunately are very unspecific about your problem. Sorry, but because of having no more details about your problem (and your data in particular) I can only give you these general recommendations about data processing. I do not know any details about your data, so I can not give you any recommendation about which approach will work best in your case.
Scenario: I am having a arbitrary JSON size ranging from 300 KB - 6.5 MB read from MongoDB. Since it is very much arbitrary/dynamic data I cannot have struct type defined in golang, So I am using map[sting]interface{} type. And string of JSON data is parsed by using encoding/json's Unmarshal method. Some what similar to what is mentioned in Generic JSON with interface{}.
Issue: But the problem is its taking more time (around 30ms to 180ms) to parse the string json into map[string]interface{}. (Comparing to php parsing json using json_encode/decode / igbinary/ msgpack)
Question: Is there any way to pre-process it and store in cache?
I mean parse string into map[string]interface{} and serialize it and store to some cache, then when we retrieve it should not take much time to unserialization and proceed with execution.
Note: I am Newbie for the golang any suggestion are highly appriciated. Thanks
Updates: Serialization using Gob, binary built-in package & Msgpack implementation for Golang package are already tried. No luck, No improvement in the time to unserialization.
The standard library package for JSON is notoriously slow. There is a good reason for that: it use RTTI to provide a really flexible interface that is really simple. Hence the unmarshalling being slower than PHP's…
Fortunately, there is an alternative way, which is to implement the json.Unmarshaller interface on the types you want to use. If this interface is implemented, the package will use it instead of its standard method so you can see huge performance boosts.
And to help you, there is a small group of tools that appeared, among which:
https://godoc.org/github.com/benbjohnson/megajson
https://godoc.org/github.com/pquerna/ffjson
(listing here the main players from memory, there must be others)
These tools will generate tailored implementations of the json.Unmarshaller interface for the types you requested. And with go:generate, you can even integrate them seamlessly to your build step.
I am running into problems where some of our data stores are seeing a lot of throughput. We are using POJOs serialized to JSON using Jackson. What are some of the ways we can compress JSON data?
One initial thought suggested using BSON but apparently its not much smaller than JSON.
Check out CJSON.
You can see some comparisons here.
If you're not wedded to JSON you could try MessagePack:
MessagePack is a binary-based efficient object serialization library.
It enables to exchange structured objects between many languages like
JSON. But unlike JSON, it is very fast and small.
There are implementations in many languages.