How to process values in CSV format in streaming queries over Kafka source? - csv

I'm new to Structured Streaming, and I'd like to know is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs. The format in Kafka value is 50+ fields syslog-like csv, and manually splitting is painfully slow.
Here's the brief part of my code (see full gist here)
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "myserver:9092")
.option("subscribe", "mytopic")
.load()
.select(split('value, """\^""") as "raw")
.select(ColumnExplode('raw, schema.size): _*) // flatten WrappedArray
.toDF(schema.fieldNames: _*) // apply column names
.select(fieldsWithTypeFix: _*) // cast column types from string
.select(schema.fieldNames.map(col): _*) // re-order columns, as defined in schema
.writeStream.format("console").start()
with no further operations, I can only achieve roughly 10MB/s throughput on a 24-core 128GB mem server. Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.

is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs.
No. The so-called output schema for kafka external data source is fixed and cannot be changed ever. See this line.
Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
I don't think so. I'd even say that CSV is a simpler text format than JSON (as there's simply a single separator usually).
Using split standard function is the way to go and think you can hardly get better performance since it's to split a row and take every element to build the final output.

Related

Loading Raw JSON Into Delta Lake (Like in Snowflake)

I am testing Delta Lake for a simple use case that is very easy in Snowflake, but I'm having a heck of a time understanding if it can be done, much less actually doing it.
I want to be able to load a JSON file "raw," without specifying a schema, and I want to be able to query and flatten it later. In Snowflake, I can create a column of type VARIANT and load the JSON text there, and later I can ask for the different parts by using :: and lateral flatten, etc.
The examples I've seen so far about Delta Lake have had "schema inference" or "autoloading" stipulations, and with those it seems that even if I don't specify a schema, one is created for me and then I still have to guess (or look up) what columns Delta Lake created for me so I can query those parts of the JSON. It seems a little too complicated.
This page has the following comment:
When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type.
... but it provides no example of how to do that. To me this suggests that you can somehow store the raw JSON and query it later, but I don't know how. Just make a STRING column and insert the JSON as string? Can someone post an example?
Am I trialing the wrong tool for what I need, or am I missing something? Thank you for your help.
As far as I'm aware, there is no direct equivalent to the VARIANT column in Snowflake. What that page is suggesting is storing the data as a string, and then using the semi-structured access operators to parse it as JSON on the fly.
e.g. given a table named devices with a column named specifications of type string with value
"""{
"device": "potato phone",
"sku": "POTATO0001",
}"""
Then you can query it like this:
SELECT specifications:device, specifications:sku from devices
edit: to address some of your other questions
This doesn't do schema enforcement. It's possible to create a Struct column in delta lake that can store structured data, but all the data in that column need to be compatible with the Struct schema. If you are querying a JSON string column, you are on your own for schema management.

Spark from_avro 2nd argument is a constant string, any way to obtain schema string from some column of each record?

suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !

Spark partition projection/pushdown and schema inference with partitioned JSON

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

how do i create a huge json file

I want to create a large file containing a big list of records from a database.
This file is used by another process.
When using xml i don't have to load everything into memory and can just use XML::Writer
When using JSON we create normally a perl data structure and use the to_json function to dump the results.
This means that I have to load everything into the memory.
Is there a way to avoid it?
Is JSON suitable for large files?
Just use JSON::Streaming::Writer
Description
Most JSON libraries work in terms of in-memory data structures. In Perl, JSON
serializers often expect to be provided with a HASH or ARRAY ref containing
all of the data you want to serialize.
This library allows you to generate syntactically-correct JSON without first
assembling your complete data structure in memory. This allows large structures
to be returned without requiring those structures to be memory-resident, and
also allows parts of the output to be made available to a streaming-capable
JSON parser while the rest of the output is being generated, which may
improve performance of JSON-based network protocols.
Synopsis
my $jsonw = JSON::Streaming::Writer->for_stream($fh)
$jsonw->start_object();
$jsonw->add_simple_property("someName" => "someValue");
$jsonw->add_simple_property("someNumber" => 5);
$jsonw->start_property("someObject");
$jsonw->start_object();
$jsonw->add_simple_property("someOtherName" => "someOtherValue");
$jsonw->add_simple_property("someOtherNumber" => 6);
$jsonw->end_object();
$jsonw->end_property();
$jsonw->start_property("someArray");
$jsonw->start_array();
$jsonw->add_simple_item("anotherStringValue");
$jsonw->add_simple_item(10);
$jsonw->start_object();
# No items; this object is empty
$jsonw->end_object();
$jsonw->end_array();
Furthermore there is the JSON::Streaming::Reader :)