What does dehydration mean in the context of data loading? - hydration

There are a few threads which discuss the meaning of data hydration.
But I can't find any definitions of data dehydration.
I can see three possible meanings
1) Extracting the current state of an object
2) Clearing the state of an object
3) Both 1) and 2)
EDIT
This is not a duplicate of other threads where the meaning of hydration is discussed. This question is about dehydration.

Dehydration is simply the opposite of hydration. In context of data loading, given you have an API that reads data from an ORM. The process to transform the ORM object tree into a JSON with primitive types in it, is dehydration. The extraction of data from the hydrated objects.
As you already mentioned:
Extracting the current state of an object

Data dehydration usually refers to the process of storing db data off in a compressed file. The data is then removed from the db (and as such is inaccessible)
The process of rehydration is loading the stored data into the original or ancillary table so that it can be worked with.

Related

Why does Spark read data even no actions are called?

I have a confusion of the lazy load on Spark while using spark.read.json.
I have the following code:
df_location_user_profile = [
f"hdfs://hdfs_cluster:8020/data/*/*"
]
df_json = spark.read.json(json_data_files)
While the JSON data on HDFS is partitioned by year and month (year=yyyy, month=mm) and I want to retrieve all data of that dataset.
For this code block, I only read data from the defined location and there are no actions is executed. But I found on the Spark UI the following stage with giant input data.
As I understand, the lazy load fashion of Spark will not read data until an action is called. Then this makes me confused.
After that, I call the count() action then the new stage is created and Spark read data again.
My question is why does Spark read data when no action is called (on the first job, stage)? How can I optimize this?
It is doing a pass to evaluate the schema as it was not supplied. Aka infer schema.

Explicitly providing schema in a form of a json in Spark / Mongodb integration

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read

Read small json files with declared schema from S3Bucket - Spark Performance Issue

I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.

how do i create a huge json file

I want to create a large file containing a big list of records from a database.
This file is used by another process.
When using xml i don't have to load everything into memory and can just use XML::Writer
When using JSON we create normally a perl data structure and use the to_json function to dump the results.
This means that I have to load everything into the memory.
Is there a way to avoid it?
Is JSON suitable for large files?
Just use JSON::Streaming::Writer
Description
Most JSON libraries work in terms of in-memory data structures. In Perl, JSON
serializers often expect to be provided with a HASH or ARRAY ref containing
all of the data you want to serialize.
This library allows you to generate syntactically-correct JSON without first
assembling your complete data structure in memory. This allows large structures
to be returned without requiring those structures to be memory-resident, and
also allows parts of the output to be made available to a streaming-capable
JSON parser while the rest of the output is being generated, which may
improve performance of JSON-based network protocols.
Synopsis
my $jsonw = JSON::Streaming::Writer->for_stream($fh)
$jsonw->start_object();
$jsonw->add_simple_property("someName" => "someValue");
$jsonw->add_simple_property("someNumber" => 5);
$jsonw->start_property("someObject");
$jsonw->start_object();
$jsonw->add_simple_property("someOtherName" => "someOtherValue");
$jsonw->add_simple_property("someOtherNumber" => 6);
$jsonw->end_object();
$jsonw->end_property();
$jsonw->start_property("someArray");
$jsonw->start_array();
$jsonw->add_simple_item("anotherStringValue");
$jsonw->add_simple_item(10);
$jsonw->start_object();
# No items; this object is empty
$jsonw->end_object();
$jsonw->end_array();
Furthermore there is the JSON::Streaming::Reader :)

Better to filter a stream of data at its start or end?

I'm working on a project in which I need to process a huge amount (multiple gigabytes) of comma separated value (CSV) files.
What I basically do is as follows:
Create an object that knows how to
read all related files
Register with this object a set of Listeners that are interested in the data
Read each line of each file, dispatching an object created from the line of data
to each of the listeners
Each Listener decides whether this piece of data is useful / relevant
I'm wondering whether it would be better to filter instead at the source side, e.g. each listener has an associated Predicate object that determines whether a given piece of data should be dispatched to the listener, in which case the process would look more like
Create an object that knows how to
read all related files
Register with this object a set of pairs
Read each line of each file, dispatching an object created from the line of data
to each of the listeners if its associated Predicate returns true for the data
The net effect is the same, it's just a matter of where the filtering takes place.
(Again, the only reason I have this 'stream' of data that I process one entry at a time is because I'm dealing with gigabytes of CSV files, and I can't create a collection, filter it, and then deal with it - I need to filter as I go)
Unless the cost of the call to the listener is huge (Remoting, WCF,...) I would stay with a really simple interface and let the listener decide what to do with the row.