I'm reading a Kafka json file from Azure ADLS Gen2 storage account on Azure Databricks. I dont seem to be able to convert the value binary payload to string so that I can perform the from_json conversion. I've tried various flavours of the cast and in all cases the original binary value is shown in my final transformation.
I've tried ...
df.selectExpr("CAST(value as STRING)")
as well as ...
df.select(col("value").cast("string"))
I know i'm doing something stupid, cause this is a trivial transformation, but I cant work out what I'm doing wrong.
I'm using Azure Databricks Runtime 11.3 LTS ML
Sample data that i'm using is a Databricks Academy dataset
I'm expecting the above code to transform this into human readable string format but the final transformation for the 'value' is identical to the original binary.
What you're doing is correct to get it to a string, but perhaps the data is simply not actually a string? Perhaps your data is actually an encoded (base64, maybe) / encrypted string, or a binary payload such as Avro, Protobuf, etc which are not human readable.
Without knowing how it was produced, you cannot know how to deserialize it; and if you are reading a .json file, as you say, then Spark doesn't care about file extensions...
Related
I want to convert incoming JSON data from Kafka into a dataframe.
I am using structured streaming with Scala 2.12
Most people add a hard coded schema, but if the json can have additional fields, it requires changing the code base every-time, which is tedious.
One approach is to write it into a file and infer it with but I rather avoid doing that.
Is there any other way to approach this problem?
Edit: Found a way to turn a json string into a dataframe but cant extract it from the stream source, it is possible to extract it?
One way is to store the schema itself in the message headers (not in the key or value).
Though, this increases message size, it will be easy to parse the JSON value without the need for any external resource like a file or a schema registry.
New messages can have new schemas while at the same time old messages can still be processed using their old schema itself, because the schema is within the message itself.
Alternatively, you can version the schemas and include an id for every schema in the message headers (or) a magic byte in the key or value and infer the schema from there.
This approach is followed by Confluent Schema registry. It allows you to basically go through different versions of same schema and see how your schema has evolved over time.
Read the data as string and then convert it to map[string,String], this way you can process the any json without even knowing its schema
based on JavaTechnical answer , the best approach would be to use a schema registry and
avro data instead of json, there is no going around hardcoding a schema (for now).
include your schema name and id as a header and use them to read the schema from the schema registry.
use the from_avro fucntion to turn that data into a df!
I have some data in a csv format. However they are already a string, since i have got them from an HTTP request.
I would like to use Data Frames, in order to view the data.
However i don't know how to parse it, because the CSV package only accepts files, not Strings.
One solution would be to write the content of the String into a file, and then to read it out again. But there has to be a better way!
Use IOBuffer(your_string):
CSV.read(IOBuffer(your_string), DataFrame)
I am trying to parse the JSON files and insert into the SQL DB.My parser worked perfectly fine as long as the files are small (less than 5 MB).
I am getting "Out of memory exception" when trying to read the large(> 5MB) files.
if (System.IO.Directory.Exists(jsonFilePath))
{
string[] files = System.IO.Directory.GetFiles(jsonFilePath);
foreach (string s in files)
{
var jsonString = File.ReadAllText(s);
fileName = System.IO.Path.GetFileName(s);
ParseJSON(jsonString, fileName);
}
}
I tried the JSONReader approach, but no luck on getting the entire JSON into string or variable.Please advise.
Use 64 bit, check RredCat's answer on a similar question:
Newtonsoft.Json - Out of memory exception while deserializing big object
NewtonSoft Jason Performance Tips
Read the article by David Cox about tokenizing:
"The basic approach is to use a JsonTextReader object, which is part of the Json.NET library. A JsonTextReader reads a JSON file one token at a time. It, therefore, avoids the overhead of reading the entire file into a string. As tokens are read from the file, objects are created and pushed onto and off of a stack. When the end of the file is reached, the top of the stack contains one object — the top of a very big tree of objects corresponding to the objects in the original JSON file"
Parsing Big Records with Json.NET
The json file is too large to fit in memory, in any form.
You must use a JSON reader that accepts a filename or stream as input. It's not clear from your question which JSON Reader you are using. From which library?
If your JSON reader builds the whole JSON tree, you will still run out of memory. As you read the JSON file, either cherry pick the data you are looking for, or write data structures to another on-disk format that can be easily queried, for example, an sqlite database.
Hi is it possible to import any random json file into cassandra.
The json file is not exported from sstable2json. The json file is from a different website and needs to be imported into cassandra. Please could anyone advise whether this is possible
JSON support won't be introduced until Cassandra 3.0 (see CASSANDRA-7970) and in this case you still need to define a schema for your json data to map to. You do have some other options:
Use maps which sort of map to JSON. Maps can be indexed as of Cassandra 2.1 (CASSANDRA-4511) There is also a good Stack Exchange post about this.
You mention 'any random json file'. You could just have a string column that contains the raw JSON, but then you lose any query-ability of that data.
Come up with some kind of schema for your JSON data and map it to a CQL table and write some code that parses the JSON and writes it to the CQL table mapping to that data. This doesn't sound like an option for you since you want to be able to import any random JSON file.
If you are looking to only do json document storage, you might want to look at more document-oriented solutions instead of a column-oriented solution like cassandra.
I have a solution where I need to be able to log multiple JSON Objects to a file. Essentially doing one log file per day. What is the easiest way to write (and later read) these from a single file?
How does MongoDB handle this with BSON? What does it use as a separator between "records"?
Does Protocol Buffers, BSON, MessagePack, etc... offer compression and the record concept? Compression would be a nice benefit.
With protocol buffers you could define the message as follows:
Message JSONObject {
required string JSON = 1;
}
Message DailyJSONLog {
repeated JSONObject JSON = 1;
}
This way you would just read the file from memory and deserialize it. Its essentially the same way for serializing them as well. Once you have the file (serialized DailyJSONLog) on disk, you can easily just append serialized JSONObjects to the end of that file (since the DailyJSONLog message is very simply a repeated field).
The only issue with this is if you have a LOT of messages each day or if you want to start at a certain location during the day (you're not able to easily get to the middle (or arbitrary) of the repeated list).
I've gotten around this by taking a JSONObject, serializing it and then base64 encoding it. I'd store these to a file separating by a new line. This allows you to very easily see how many records are in each file, gain access to any arbitrary JSON object within the file and to trivially keep expanding the file (you can expand the above 'repeated' message as well pretty trivially but it is a one way easy operation...)
Compression is a different topic. Protocol Buffers will not compress strings. If you were to define a pb message to match your JSON message, then you will get the benefit of having pb possibly 'compress' any integers into their [varint][1] encoded format. You will get 'less' compression if you try above base64 encoding route as well.