Dataframes reading json files with changing schema - json

I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data.
rdd=sc.textFile(baseSchemaWithAllColumns.json).union(pathToActualFile.json)
sqlContext.read.json(rdd)
//Create dataframe and then save as temp table and query
I know the above is just work around and we need a cleaner solution to accept json files with varying schema.
I understand that there are two other ways to understand schema as mentioned here
However, for that it looks like we need to parse the json and map each field to the data received.
There seems to be an option for parquet schema merger, but that looks like mostly at the reading from the dataframe - or am I missing something here.
What is the best way to read a changing schema of json files and work with Spark SQL for querying.
Can I just read the json file as is and save as temp table and then use mergeSchema=true while querying

Related

From Kafka json message to Snowflake table

I am trying to implement a Snowflake Sink connector so that can i load messages coming to a Kafka topic directly to an appropriate Snowflake table. So far, I could only get to the point of loading the raw json into a table with two columns (RECORD_METADATA and RECORD_CONTENT). My goal would be to directly load the json messages into an appropriate table by flattening them. I have the structure of what the table should be, so I could create a table and directly load into that. But I need a way for the load process to flatten the messages.
I have been looking online and through the documentation, but haven't found a clear way to do this.
Is it possible or do I have to first load the raw json and then do transformations to get the table that I want?
Thanks
You have to load the raw JSON first then you can do transformations.
Each Kafka message is passed to Snowflake in JSON format or Avro format. The Kafka connector stores that formatted information in a single column of type VARIANT. The data is not parsed, and the data is not split into multiple columns in the Snowflake table.
For more information you can read here

Azure Data Factory - copy task using Rest API is only returning first row upon execution

I have a copy task in ADF that is pulling data from a REST API into an Azure SQL Database. I've created the mappings, and pulled in a collection reference as follows:
preview of json data
source
sink
mappings
output
You will notice it's only outputting 1 row (the first row) when running the copy task. I know this is usually because you are pulling from a nested JSON array, in which the collection reference should resolve this to pull from the array - but I can't for the life of me get it to pull multiple records even after setting the collection.
There's a trick to this. You import schemas, then you put the name of the array in collection reference then you import schemas again then it works
Screen shot from azure data factory
Because of Azure Data Factory design limitation, pulling JSON data and inserting into Azure SQL Database isn't a good approach. Even after using the "Collective reference" you might not get the desired results.
The recommended approach is to store the output of REST API as a JSON file in Azure blob storage by Copy Data activity. Then you can use that file as Source and do transformation in Data Flow. Also you can use Lookup activity to get the JSON data and invoke the Stored Procedure to store the data in Azure SQL Database(This way will be cheaper and it's performance will be better).
Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization.
Refer this third-party tutorial for more details.
Hey I had this issue and I noticed that the default column names for the json branches were really long and in my target csv the header row got truncated after a bit and I was able to get ADF working by just renaming them in the mapping section.
For example i had:
['hours']['monday']['openIntervals'][0]['endTime'] in source and changed it to MondayCloseTime in destination.
Just started working. Can also just turn off the header on the output for a quick test before re writing all the column names as that also got it working for me
I assume it writes out the truncated header row at the same time as the 1st row of data and then tries to use that header row afterwards but as it doesn't match what its expecting it just ends. Bit annoying it doesn't give an error or anything but anyway this worked for me.

Redshift/S3 - Copy the contents of a Redshift table to S3 as JSON?

It's straightforward to copy JSON data on S3 into a Redshift table using the standard Redshift COPY command.
However, I'm also looking for the inverse operation: to copy the data contained within an existing Redshift table to JSON that is stored in S3, so that a subsequent Redshift COPY command can recreate the Redshift table exactly as it was originally.
I know about the Redshift UNLOAD commnd, but it doesn't seem to offer any option to store the data in S3 directly in JSON format.
I know that I can write per-table utilities to parse and reformat the output of UNLOAD for each table, but I'm looking for a generic solution which allows me to do this Redshift-to-S3-JSON extract on any specified Redshift table.
I couldn't find any existing utilities that will do this. Did I miss something?
Thank you in advance.
I think the only way is to unload in CSV and write a simple lambda function that turns an input CSV into JSON taking the CSV header as keys and values of every row as values.
There is no built in way to do this yet. So you might have to hack your query with some hardcoding :
https://sikandar89dubey.wordpress.com/2015/12/23/how-to-dump-data-from-redshift-to-json/

Performance overhead while using infer schema vs explicitly passing schema while loading CSV file data in spark dataframe

I am loading CSV data in spark dataframe with setting inferSchema option to true. Although the schema of my CSV file is always going to be same and I am aware of the exact schema.
Is it a good idea to manually provide the schema instead of inferring the schema? Does explicitly providing schema improves the performance?
Yes, it's good. Schema Infter will cause that file will be read twice - once for Schema Infer, second for read into Dataset.
From Spark code for DataFrameReader - similar is in DataStreamReader:
This function will go through the input once to determine the input
schema if inferSchema is enabled. To avoid going through the
entire data once, disable inferSchema option or specify the
schema explicitly using schema.
Link to code
However, it may be difficult to maintain schema for 100 Datasets with 200 columns each. You should also have in mind maintainability - so, typical answer will be - it depends :) For not-so-big schemas or not-so-difficult infer but with large files, I recommend using custom schema written in code

Create table structure in postgresql from json file

I would like to know if there is a way to create a table structure in postgresql using a JSON file. The story is that I exported JSON data that uses a schema from mongoDB, now I would like to create the same structure (the schema in mongo) in a table in postgresql so then I can import the data from that JSON file into it. Is there a way to do that base on the JSON file, or should I just create the table and its structure myself using the postgres JSON type and then import the data? I'm just looking for opinions, suggestions or articles that could be related to this, any help would be really appreciate it. Thanks.