Spark Option: inferSchema vs header = true - csv

Reference to pyspark: Difference performance for spark.read.format("csv") vs spark.read.csv
I thought I needed .options("inferSchema" , "true") and .option("header", "true") to print my headers but apparently I could still print my csv with headers.
What is the difference between header and schema? I don't really understand the meaning of "inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default".

The header and schema are separate things.
Header:
If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc.
Setting this to true or false should be based on your input file.
Schema:
The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work).
By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input.
As an alternative to reading a csv with inferSchema you can provide the schema while reading. This have the advantage of being faster than inferring the schema while giving a dataframe with the correct column types. In addition, for csv files without a header row, column names can be given automatically. To provde schema see e.g.: Provide schema while reading csv file as a dataframe

There are two ways we can specify schema while reading the csv file.
Way1: Specify the inferSchema=true and header=true.
val myDataFrame = spark.read.options(Map("inferSchema"->"true", "header"->"true")).csv("/path/csv_filename.csv")
Note: Using this approach while reading data, it will create one more additional stage.
Way2: Specify the schema explicitly.
val schema = new StructType()
.add("Id",IntegerType,true)
.add("Name",StringType,true)
.add("Age",IntegerType,true)
val myDataFrame = spark.read.option("header", "true")
.schema(schema)
.csv("/path/csv_filename.csv")

Related

Spark partition projection/pushdown and schema inference with partitioned JSON

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Load CSV in Spark with types in non standard format

I've got a csv file that I want to read with Spark, specifying a schema to get the types I need. Something like that:
Dataset<Row> ds = sqlContext.read()
.format("csv")
.option("header", "false")
.schema(customSchema)
.load("myCsvFilePath.csv");
But in my csv file some columns are recorded in a non-standard way, for example double values uses comma as decimal separator or datetime values are strings formatted as dd.MM.yyyy.
Is it possible to define such schema? Or I should read this columns as strings and then parse them explicitly?
Converting the odd formats to standard ones is part of the dataprep pipeline you'd want to use spark for - so yes read these columns as strings and then using either built-in functions or udf you can replace columns with fixed ones (e.g. using withColumn)
import org.apache.spark.sql.functions._
df.withColumn("fixed_date",unix_timestamp(col("date_column"),"dd.MM.YYYY")).withColumn("fixed_double",regexp_replace(col("double_column"),",",".").cast("double"))

append data to an existing json file

Appreciate if someone can point me to the right direction in here, bit new to python :)
I have a json file that looks like this:
[
{
"user":"user5",
"games":"game1"
},
{
"user":"user6",
"games":"game2"
},
{
"user":"user5",
"games":"game3"
},
{
"user":"user6",
"games":"game4"
}
]
And i have a small csv file that looks like this:
module_a,module_b
10,20
15,16
1,11
2,6
I am trying to append the csv data into the above mentioned json so it looks this, keeping the order as it is:
[
{
"user":"user5",
"module_a":"10",
"games":"game1",
"module_b":"20"
},
{
"user":"user6",
"module_a":"15",
"games":"game2",
"module_b":"16"
},
{
"user":"user5",
"module_a":"1",
"games":"game3",
"module_b":"11"
},
{
"user":"user6",
"module_a":"2",
"games":"game4",
"module_b":"6"
}
]
what would be the best approach to achive this keep the output order as it is.
Appreciate any guidance.
JSON specification doesn't prescribe orderness and it won't be enforced (unless it's a default mode of operation of the underlying platform) by any JSON parser so going a long way just to keep the order when processing JSON files is usually pointless. To quote:
An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.
...
JSON parsing libraries have been observed to differ as to whether or
not they make the ordering of object members visible to calling
software. Implementations whose behavior does not depend on member
ordering will be interoperable in the sense that they will not be
affected by these differences.
That being said, if you really insist on order, you can parse your JSON into a collections.OrderedDict (and write it back from it) which will allow you to inject data at specific places while keeping the overall order. So, first load your JSON as:
import json
from collections import OrderedDict
with open("input_file.json", "r") as f: # open the JSON file for reading
json_data = json.load(f, object_pairs_hook=OrderedDict) # read & parse it
Now that you have your JSON, you can go ahead and load up your CSV, and since there's not much else to do with the data you can immediately apply it to the json_data. One caveat, tho - since there is no direct map between the CSV and the JSON one has to assume index as being the map (i.e. the first CSV row being applied to the first JSON element etc.) so we'll use enumerate() to track the current index. There is also no info on where to insert individual values so we'll assume that the first column goes after the first JSON object entry, the second goes after the second entry and so on, and since they can have different lenghts we'll use itertools.izip_longest() to interleave them. So:
import csv
from itertools import izip_longest # use zip_longest on Python 3.x
with open("input_file.csv", "rb") as f: # open the CSV file for reading
reader = csv.reader(f) # build a CSV reader
header = next(reader) # lets store the header so we can get the key values later
for index, row in enumerate(reader): # enumerate and iterate over the rest
if index >= len(json_data): # there are more CSV rows than we have elements in JSO
break
row = [(header[i], v) for i, v in enumerate(row)] # turn the row into element tuples
# since collections.OrderedDict doesn't support random access by index we'll have to
# rebuild it by mixing in the CSV elements with the existing JSON elements
# use json_data[i].items() on Python 3.x
data = (v for p in izip_longest(json_data[index].iteritems(), row) for v in p)
# then finally overwrite the current element in json_data with a new OrderedDict
json_data[index] = OrderedDict(data)
And with our CSV data nicely inserted into the json_data, all that's left is to write back the JSON (you may overwrite the original file if you wish):
with open("output_file.json", "w") as f: # open the output JSON file for writing
json.dump(json_data, f, indent=2) # finally, write back the modified JSON
This will produce the result you're after. It even respects the names in the CSV header so you can replace them with bob and fred and it will insert those keys in your JSON. You can even add more of them if you need more elements added to your JSON.
Still, just because it's possible, you really shouldn't rely on JSON orderness. If it's user-readibility you're after, there are far more suitable formats with optional orderness like YAML.

How does spark infers numeric types from JSON?

Trying to create a DataFrame from a JSON file, but when I load data, spark automatically infers that the numeric values in the data are of type Long, although they are actually Integers, and this is also how I parse the data in my code.
Since I'm loading the data in a test env, I don't mind using a few workarounds to fix the schema. I've tried more than a few, such as:
Changing the schema manually
Casting the data using a UDF
Define the entire schema manually
The issue is that the schema is quite complex, and the fields I'm after are nested, which makes all of the options above irrelevant or too complex to write from scratch.
My main question is, how does spark decides if a numeric value is an Integer or Long? and is there anything I can do to enforce that all\some numerics are of a specific type?
Thanks!
It's always LongType by default.
From the source code:
// For Integer values, use LongType by default.
case INT | LONG => LongType
So you cannot change this behaviour. You can iterate by columns and then do casting:
for (c <- schema.fields.filter(_.dataType.isInstanceOf[NumericType])) {
df.withColumn(c.name, col(c.name).cast(IntegerType))
}
It's only a snippet, but something like this should help you :)