parsing a huge json file without increasing heap size - json

I have problem parsing a huge json file (200mb). At first i tried to use JACKSON to parse the json as a tree. However, i encountered heap size problem. For some reason, increasing heap size is not an option.
JSON format :
{
"a1":{ "b1":{"c1":"somevalue", "c2":"somevalue"}, ... },
"a2":{ "b1":{"c1":"somevalue"},"c3":"somevalue"}, ... },
....
}
what i want to do is to produce strings like
str1 = "{ "b1":{"c1":"somevalue", "c2":"somevalue"}, ... }"
str2 = "{ "b1":{"c3":"somevalue"},"c4":"somevalue"}, ... }"
Is there any way to do this without heap problem?
In python, there is simple way to do this and no heap problem(no JVM)
data = json.loads(xxx)
for key,val in data.iteritems():
puts val
some thoughts:
I might not need to use Jackson tree approach since i only want string.
Streaming Jackson might be an option, but i have difficulties write it because our json format is quite flexible. Any suggestion will be appreciated!
Thanks

Using object-based data-binding is bit more memory-efficient, so if you can define Java classes to match the structure, that is much better way: faster, uses less memory.
But sometimes tree model is needed when the structure is not known in advance.
Streaming API can help, and you can also mix approaches: iterate over JSON tokens, and then use JsonParser.readValueAs(MyType.class) or JsonParser.readValueAsTree().
This lets you only build in-memory tree or object for subset of JSON input.

Finally I use a streaming approach. I open a stream from http and each time read a fix amount of bytes to buffer. after I identify i have built a valid string in the buffer, i emit the string and truncate the buffer. In this way I use very few memory. Thanks!

Related

Swift unable to preserve order in String made from JSON for hash verification

We receive a JSON object from network along with a hash value of the object. In order to verify the hash we need to turn that JSON into a string and then make a hash out of it while preserving the order of the elements in the way they are in the JSON.
Say we have:
[
{"site1":
{"url":"https://this.is.site.com/",
"logoutURL":"",
"loadStart":[],
"loadStop":[{"someMore":"smthelse"}],
"there's_more": ... }
},
{"site2":
....
}
]
The Android app is able to get same hash value, and while debugging it we fed same simple string into both algorithms and were able to get out same hash out of it.
The difference that is there happens because of the fact that dictionaries are unordered structure.
While debugging we see that just before feeding a string into a hash algorithm, the string looks like the original JSON, just without the indentations, which means it preserves the order of items in it (on Android that is):
[{"site1":{"url":"https://this.is.site.com/", ...
While doing this with many approaches by now I'm not able to achieve the same: string that I get is different in order and therefore results in a different hash. Is there a way to achieve this?
UPDATE
It appears the problem is slightly different - thanks to #Rob Napier's answer below: I need a hash of only a part of incoming string (that has JSON in it), which means for getting that part I need to first parse it into JSON or struct, and after that - while getting the string value of it - the order of items is lost.
Using JSONSerialization and JSONDecoder (which uses JSONSerialization), it's not possible to reproduce the input data. But this isn't needed. What you're receiving is a string in the first place (as an NSData). Just don't get rid of it. You can parse the data into JSON without throwing away the data.
It is possible to create JSON parsers from scratch in Swift that maintain round-trip support (I have a sketch of such a thing at RNJSON). JSON isn't really that hard to parse. But what you're describing is a hash of "the thing you received." Not a hash of "the re-serialized JSON."

Prevent parsing a JSON node with common-lisp YASON library

I am using the Yason library in common-lisp, I want to parse a json string but would like the parser to keep one a its node unparsed.
Typically with an example like that:
{
"metadata1" : "mydata1",
"metadata2" : "mydata2",
"payload" : {...my long payload object},
"otherNodesToParse" : {...}
}
How can I set the yason parser to parse my json but skip the payload node and keep it as a string in the json format.
Use: let's say I just want the envelope data (everything that's not the payload), and to forward the payload as-is (as json string) to another system.
If I parse the whole json (so including payload) and then re-encode the payload to json, it is inefficient. The payload size could also be pretty big.
How do you know where the end of the payload object is in the stream? You do so by parsing the stream: if you don't parse the stream you simply can't know where the end of the object is: that's the nature of JSON's syntax (as it is the nature of CL's default syntax). For instance the only way you can know the difference between where to continue after
{x:1}
and after
{x:1.2}
is by parsing the two things.
So you must necessarily parse the whole thing.
So the answer to your question is: you can't do this.
You could (but not, I think, with YASON) decide that you did not want to build an object as a result of the parse. And perhaps, if the stream you are parsing corresponds to something with random access like a string or a file, you could note the start and end positions in the stream to later extract a string from it corresponding to the unparsed data (or you could perhaps build it up as you go).
It looks as if some or all of this might be possible with CL-JSON, but you'd have to work at it.
Unless the objects you are reading are vast the benefit of this seems questionable-to-none. If you really do want to do something like this efficiently you need a serialisation scheme which tells you how long things are.

Parsing a json column in tsv file to Spark RDD

I'm trying to port an existing Python (PySpark) script to Scala in an effort to improve performance.
I'm having trouble with something troublingly basic though -- how to parse a json column in Scala?
Here is the Python version
# Each row in file is tab separated, example:
# 2015-10-10 149775392 {"url": "http://example.com", "id": 149775392, "segments": {"completed_segments": [321, 4322, 126]}}
action_files = sc.textFile("s3://my-s3-bucket/2015/10/10/")
actions = (action_files
.map(lambda row: json.loads(row.split('\t')[-1]))
.filter(lambda a: a.get('url') != None and a.get('segments') != None and a.get('segments').get('completed_segments') != None)
.map(lambda a: (action['url'], {"url": action['url'], "action_id": action["id"], "completed_segments": action["segments"]["completed_segments"],}))
.partitionBy(100)
.persist())
Basically, I'm just trying to parse the json column and then transform it into a simplified version that I can process further in SparkSQL
As a new Scala user, I'm finding that there are dozens of libraries json parsing libraries for this simple task. Doesn't look like there is one in the stdlib. From what I've read so far, looks like the languages strong typing is was makes this simple task a bit of a chore.
I'd appreciate any push in the right direction!
PS. By the way, if I'm missing something obvious that is making the PySpark version crawl, I'd love to hear about it! I'm porting a Pig Script from Hadoop/MR, and performance dropped from 17min with MR to over 5 and a half hours on Spark! I'm guessing it is serialization overhead to and from Python....
If your goal is to pass data to SparkSQL anyway and you're sure that you don't have malformed fields (I don't see any exception handling in your code) I wouldn't bother with parsing manually at all:
val raw = sqlContext.read.json(action_files.flatMap(_.split("\t").takeRight(1)))
val df = raw
.withColumn("completed_segments", $"segments.completed_segments")
.where($"url".isNotNull && $"completed_segments".isNotNull)
.select($"url", $"id".alias("action_id"), $"completed_segments")
Regarding you Python code:
don't use != to compare to None. A correct way is to use is / is not. It is semantically correct (None is a singleton) and significantly faster. See also PEP8
don't duplicate data unless you have to. Emitting url twice means a higher memory usage and subsequent network traffic
if you plan to use SparkSQL check for missing values can be perform on a DataFrame, same as in Scala. I would also persist DataFrame not a RDD.
On a side note I am rather skeptical abut serialization being a real problem here. There is an overhead but a real impact shouldn't be anywhere near to what you've described.

how do i create a huge json file

I want to create a large file containing a big list of records from a database.
This file is used by another process.
When using xml i don't have to load everything into memory and can just use XML::Writer
When using JSON we create normally a perl data structure and use the to_json function to dump the results.
This means that I have to load everything into the memory.
Is there a way to avoid it?
Is JSON suitable for large files?
Just use JSON::Streaming::Writer
Description
Most JSON libraries work in terms of in-memory data structures. In Perl, JSON
serializers often expect to be provided with a HASH or ARRAY ref containing
all of the data you want to serialize.
This library allows you to generate syntactically-correct JSON without first
assembling your complete data structure in memory. This allows large structures
to be returned without requiring those structures to be memory-resident, and
also allows parts of the output to be made available to a streaming-capable
JSON parser while the rest of the output is being generated, which may
improve performance of JSON-based network protocols.
Synopsis
my $jsonw = JSON::Streaming::Writer->for_stream($fh)
$jsonw->start_object();
$jsonw->add_simple_property("someName" => "someValue");
$jsonw->add_simple_property("someNumber" => 5);
$jsonw->start_property("someObject");
$jsonw->start_object();
$jsonw->add_simple_property("someOtherName" => "someOtherValue");
$jsonw->add_simple_property("someOtherNumber" => 6);
$jsonw->end_object();
$jsonw->end_property();
$jsonw->start_property("someArray");
$jsonw->start_array();
$jsonw->add_simple_item("anotherStringValue");
$jsonw->add_simple_item(10);
$jsonw->start_object();
# No items; this object is empty
$jsonw->end_object();
$jsonw->end_array();
Furthermore there is the JSON::Streaming::Reader :)

In Perl, how do you pack a hash and unpack it back to its original state?

I want to save a hash as a packed string in a db, I get the pack part down ok, but I'm having a problem getting the hash back
test hash
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
I thought maybe I could use JSON:XS like in this example to convert the hash to a json string, and then packing the JSON string...
Thoughts on this approach?
Storable is capable of storing Perl structures very precisely. If you need to remember that something is a weak reference, etc, you want Storable. Otherwise, I'd avoid it.
JSON (Cpanel::JSON::XS) and YAML are good choices.
You can have problems if you store something using one version of Storable and try to retrieve it using an earlier version. That means all machines that access the database must have the same version of Storable.
Cpanel::JSON::XS is faster than Storable.
A fast YAML module is probably faster than Storable.
JSON can't store objects, but YAML and Storable can.
JSON and YAML are human readable (well, for some humans).
JSON and YAML are easy to parse and generate in other languages.
Usage:
my $for_the_db = encode_json($hash);
my $hash = decode_json($from_the_db);
I don't know what you mean by "packing". The string produces by Cpanel::JSON::XS's encode_json can be stored as-is into a BLOB field, while the string produced by Cpanel::JSON::XS->new->encode can be stored as-is into a TEXT field.
You may want to give the Storable module a whirl.
It can :
store your hash(ref) as a string with freeze
thaw it out at the time of retrieval
There are a lot of different ways to store a data structure in a scalar and then "restore" it back to it's original state. There are advantages and disadvantages to each.
Since you started with JSON, I'll show you can example using it.
use JSON;
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
my $stored = encode_json($hash);
my $restored = decode_json($stored);
Storable, as was already suggested, is also a good idea. But it can be rather quirky. It's great if you just want your own script/system to store and restore the data, but beyond that, it can be a pain in the butt. Even transferring data across different operating systems can cause problems. It was recommended that you use freeze, and for most local applications, that's the right call. If you decide to use Storable for sending data across multiple machines, look at using nfreeze instead.
That being said, there are a ton of encoding methods that can handle "storing" data structures. Look at YAML or XML.
I'm not quite sure what you mean by "convert the hash to a JSON string, and then packing the JSON string". What further "packing" is required? Or did you mean "storing"?
There's a number of alternative methods for storing hashes in a database.
As Zaid suggested, you can use Storable to freeze and thaw your hash. This is likely to be the fastest method (although you should benchmark with the data you're using if speed is critical). But Storable uses a binary format which is not human readable, which means that you will only be able to access this field using Perl.
As you suggested, you can store the hash as a JSON string. JSON has the advantage of being fairly human readable, and there are JSON libraries for most any language, making it easy to access your database field from something other than Perl.
You can also switch to a document-oriented database like CouchDB or MongoDB, but that's a much bigger step.