PyArrow setting column types with Table.from_pydict (schema) - pyarrow

With a PyArrow table created as pyarrow.Table.from_pydict(d) all columns are string types.
Creating a schema object as below [1], and using it as pyarrow.Table.from_pydict(d, schema=s) results in errors such as:
pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int
Is there a means to set column types in tables created from dictionaries? Context is writing to Parquet files. A similar approach in Pandas is df.astype(schema).dtypes.
[1]
schema = pa.schema([
('id', pa.int32()),
('message_id', pa.string()),
('transaction_id', pa.string()),
])

The correct approach seems to be pyarrow.Table.from_pydict(d).cast(schema)

Related

Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
Loaded the json and parse only the type ( "data" is loaded as a string)
attach to each row the corresponding schema (a DDL as string in a new column)
try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?

Spark: retrieve datatype of nested Struct column

I am currently working on a job that loads a nested Json file as a dataframe, performs some transformations on it and then loads it into a delta table. The testdata I work with has a lot of nested columns, but its possible that the json files the job will load in the future dont come with all the columns all the time (or that they have different datatypes). Therefore, I want to first check if a column is there and which datatype it has. Problem: I dont get it to work, because I dont know how to derive the datatype of a column from the nested schema of the dataframe.
Example: How can I get the datatype of ecuId?
my approach so far was:
df.withColumn("datatype", isinstance(col("reportData.ecus.element.ecuId"), (float, int, str, list, dict, tuple)))
or
df.withColumn("datatype", isinstance(jsonDF.reportData.ecus.element.ecuId, (float, int, str, list, dict, tuple)))
For both versions I get the error message: "col should be Column"
Even when I try a very basic
df.withColumn("datatype", type(jsonDF.reportData.ecus.element.ecuId))
I get the same error.
It appears as if I have a complete misconception of how to work with nested structures?
Can you please explain to me how I get the datatype? Thanks a lot in advance!
The reason you got the error col should be Column, is because withColumn expects the second parameter as a Column object, not a pure Python object.
The closest approach I got is a bit "hacky", by parsing the schema of dataframe manually.
from pyspark.sql import functions as F
(df
.withColumn('schema', F.lit(df.dtypes[0][1]))
.withColumn('datatype', F.regexp_extract('schema', 'ecus.*.ecuId:([^>]*)', 1))
.show(10, False)
)
# Output
# +----------+----------------------------------------+--------+
# |reportData|schema |datatype|
# +----------+----------------------------------------+--------+
# |{[{1000}]}|struct<ecus:array<struct<ecuId:bigint>>>|bigint |
# +----------+----------------------------------------+--------+

Spark partition projection/pushdown and schema inference with partitioned JSON

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file

Spark Option: inferSchema vs header = true

Reference to pyspark: Difference performance for spark.read.format("csv") vs spark.read.csv
I thought I needed .options("inferSchema" , "true") and .option("header", "true") to print my headers but apparently I could still print my csv with headers.
What is the difference between header and schema? I don't really understand the meaning of "inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default".
The header and schema are separate things.
Header:
If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc.
Setting this to true or false should be based on your input file.
Schema:
The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work).
By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input.
As an alternative to reading a csv with inferSchema you can provide the schema while reading. This have the advantage of being faster than inferring the schema while giving a dataframe with the correct column types. In addition, for csv files without a header row, column names can be given automatically. To provde schema see e.g.: Provide schema while reading csv file as a dataframe
There are two ways we can specify schema while reading the csv file.
Way1: Specify the inferSchema=true and header=true.
val myDataFrame = spark.read.options(Map("inferSchema"->"true", "header"->"true")).csv("/path/csv_filename.csv")
Note: Using this approach while reading data, it will create one more additional stage.
Way2: Specify the schema explicitly.
val schema = new StructType()
.add("Id",IntegerType,true)
.add("Name",StringType,true)
.add("Age",IntegerType,true)
val myDataFrame = spark.read.option("header", "true")
.schema(schema)
.csv("/path/csv_filename.csv")

Custom Formatting of JSON output using Spark

I have a dataset with a bunch of BigDecimal values. I would like to output these records to a JSON file, but when I do the BigDecimal values will often be written with trailing zeros (123.4000000000000), but the spec we are must conform to does not allow this (for reasons I don't understand).
I am trying to see if there is a way to override how the data is printed to JSON.
Currently, my best idea is to convert each record to a string using JACKSON and then writing the data using df.write().text(..) rather than JSON.
I suggest to convert Decimal type to String before writing to JSON.
Below code is in Scala, but you can use it in Java easily
import org.apache.spark.sql.types.StringType
# COLUMN_NAME is your DataFrame column name.
val new_df = df.withColumn('COLUMN_NAME_TMP', df.COLUMN_NAME.cast(StringType)).drop('COLUMN_NAME').withColumnRenamed('COLUMN_NAME_TMP', 'COLUMN_NAME')