I want to create a large file containing a big list of records from a database.
This file is used by another process.
When using xml i don't have to load everything into memory and can just use XML::Writer
When using JSON we create normally a perl data structure and use the to_json function to dump the results.
This means that I have to load everything into the memory.
Is there a way to avoid it?
Is JSON suitable for large files?
Just use JSON::Streaming::Writer
Description
Most JSON libraries work in terms of in-memory data structures. In Perl, JSON
serializers often expect to be provided with a HASH or ARRAY ref containing
all of the data you want to serialize.
This library allows you to generate syntactically-correct JSON without first
assembling your complete data structure in memory. This allows large structures
to be returned without requiring those structures to be memory-resident, and
also allows parts of the output to be made available to a streaming-capable
JSON parser while the rest of the output is being generated, which may
improve performance of JSON-based network protocols.
Synopsis
my $jsonw = JSON::Streaming::Writer->for_stream($fh)
$jsonw->start_object();
$jsonw->add_simple_property("someName" => "someValue");
$jsonw->add_simple_property("someNumber" => 5);
$jsonw->start_property("someObject");
$jsonw->start_object();
$jsonw->add_simple_property("someOtherName" => "someOtherValue");
$jsonw->add_simple_property("someOtherNumber" => 6);
$jsonw->end_object();
$jsonw->end_property();
$jsonw->start_property("someArray");
$jsonw->start_array();
$jsonw->add_simple_item("anotherStringValue");
$jsonw->add_simple_item(10);
$jsonw->start_object();
# No items; this object is empty
$jsonw->end_object();
$jsonw->end_array();
Furthermore there is the JSON::Streaming::Reader :)
Related
Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.
suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !
I'm new to Structured Streaming, and I'd like to know is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs. The format in Kafka value is 50+ fields syslog-like csv, and manually splitting is painfully slow.
Here's the brief part of my code (see full gist here)
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "myserver:9092")
.option("subscribe", "mytopic")
.load()
.select(split('value, """\^""") as "raw")
.select(ColumnExplode('raw, schema.size): _*) // flatten WrappedArray
.toDF(schema.fieldNames: _*) // apply column names
.select(fieldsWithTypeFix: _*) // cast column types from string
.select(schema.fieldNames.map(col): _*) // re-order columns, as defined in schema
.writeStream.format("console").start()
with no further operations, I can only achieve roughly 10MB/s throughput on a 24-core 128GB mem server. Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs.
No. The so-called output schema for kafka external data source is fixed and cannot be changed ever. See this line.
Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
I don't think so. I'd even say that CSV is a simpler text format than JSON (as there's simply a single separator usually).
Using split standard function is the way to go and think you can hardly get better performance since it's to split a row and take every element to build the final output.
I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.
I want to save a hash as a packed string in a db, I get the pack part down ok, but I'm having a problem getting the hash back
test hash
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
I thought maybe I could use JSON:XS like in this example to convert the hash to a json string, and then packing the JSON string...
Thoughts on this approach?
Storable is capable of storing Perl structures very precisely. If you need to remember that something is a weak reference, etc, you want Storable. Otherwise, I'd avoid it.
JSON (Cpanel::JSON::XS) and YAML are good choices.
You can have problems if you store something using one version of Storable and try to retrieve it using an earlier version. That means all machines that access the database must have the same version of Storable.
Cpanel::JSON::XS is faster than Storable.
A fast YAML module is probably faster than Storable.
JSON can't store objects, but YAML and Storable can.
JSON and YAML are human readable (well, for some humans).
JSON and YAML are easy to parse and generate in other languages.
Usage:
my $for_the_db = encode_json($hash);
my $hash = decode_json($from_the_db);
I don't know what you mean by "packing". The string produces by Cpanel::JSON::XS's encode_json can be stored as-is into a BLOB field, while the string produced by Cpanel::JSON::XS->new->encode can be stored as-is into a TEXT field.
You may want to give the Storable module a whirl.
It can :
store your hash(ref) as a string with freeze
thaw it out at the time of retrieval
There are a lot of different ways to store a data structure in a scalar and then "restore" it back to it's original state. There are advantages and disadvantages to each.
Since you started with JSON, I'll show you can example using it.
use JSON;
my $hash = {
test_string => 'apples,bananas,oranges',
test_subhash => { like => 'apples' },
test_subarray => [ red, yellow, orange ]
}
my $stored = encode_json($hash);
my $restored = decode_json($stored);
Storable, as was already suggested, is also a good idea. But it can be rather quirky. It's great if you just want your own script/system to store and restore the data, but beyond that, it can be a pain in the butt. Even transferring data across different operating systems can cause problems. It was recommended that you use freeze, and for most local applications, that's the right call. If you decide to use Storable for sending data across multiple machines, look at using nfreeze instead.
That being said, there are a ton of encoding methods that can handle "storing" data structures. Look at YAML or XML.
I'm not quite sure what you mean by "convert the hash to a JSON string, and then packing the JSON string". What further "packing" is required? Or did you mean "storing"?
There's a number of alternative methods for storing hashes in a database.
As Zaid suggested, you can use Storable to freeze and thaw your hash. This is likely to be the fastest method (although you should benchmark with the data you're using if speed is critical). But Storable uses a binary format which is not human readable, which means that you will only be able to access this field using Perl.
As you suggested, you can store the hash as a JSON string. JSON has the advantage of being fairly human readable, and there are JSON libraries for most any language, making it easy to access your database field from something other than Perl.
You can also switch to a document-oriented database like CouchDB or MongoDB, but that's a much bigger step.