I have a file on s3 in json format(filename=a). I read it and create a dataframe (df) using sqlContext.read.json. On checking df.printSchema; the schema is not what I want. So I specify my own schema with double and string type.
Then I reload the json data in a dataframe (df3) specifying the above schema but when I do df3.head(1) I see "None" values for some of my variables.
See code below -
df = sqlContext.read.json(os.path.join('file:///data','a'))
print df.count()
df.printSchema()
df.na.fill(0)
After specifying my own schema (sch). Since the schema code is long I haven't included it here.
sch=StructType(List(StructField(x,DoubleType,true),StructField(y,DoubleType,true)))
f = sc.textFile(os.path.join('file:///data','a'))
f_json = f.map(lambda x: json.loads(x))
df3 = sqlContext.createDataFrame(f_json, sch)
df3.head(1)
[Row(x=85.7, y=None)]
I obtain 'None' values for all my columns with DoubleType (datatype) when I do df3.head(1).Am I doing something wrong when I reload the df3 dataframe?
I was able to take care of "None" by doing df.na.fill(0)!
Related
I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column
When I read JSON through spark( using scala )
val rdd = spark.sqlContext.read.json("/Users/sanyam/Downloads/data/input.json")
val df = rdd.toDF()
df.show()
println(df.schema)
//val schema = df.schema.add("_corrupt_record",org.apache.spark.sql.types.StringType,true)
//val rdd1 = spark.sqlContext.read.schema(schema).json("/Users/sanyam/Downloads/data/input_1.json")
//rdd1.toDF().show()
this results in following DF:
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
| appId| appTimestamp|appVersion| bankCode|bankLocale| data|date| environment| event| id| logTime| logType| msid| muid| owner|recordType| uuid|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
StructType(StructField(appId,StringType,true), StructField(appTimestamp,StringType,true), StructField(appVersion,StringType,true), StructField(bankCode,StringType,true), StructField(bankLocale,StringType,true), StructField(data,StringType,true), StructField(date,StringType,true), StructField(environment,StringType,true), StructField(event,StringType,true), StructField(id,LongType,true), StructField(logTime,LongType,true), StructField(logType,StringType,true), StructField(msid,StringType,true), StructField(muid,StringType,true), StructField(owner,StringType,true), StructField(recordType,StringType,true), StructField(uuid,StringType,true))
If I want to apply validation for any further json I read then I take schema as a variable and parse that in .schema as an argument [refer the commented lines of code], but even the corrupt records don't go into _corrupt_record column(which should happen by default), instead it parses that bad records as null in all columns and this is resulting into data loss as theie is no record of it.
Although when you add _corrupt_record column in schema explicitly everything works fine and the corrupt_record goes into the respective column, I want to know the reason why this is so?
(Also, if you give a malformed Json, spark automatically handles it by making a _corrupt_record column, so how come schema validation needs explicit column addition earlier) ??
Reading corrupt json data returns schema as [_corrupt_record: string]. But you are reading the corrupt data with schema which is wrong and hence you are getting the whole row as null.
But when you add _corrupt_record explicitly you get whole json record in that column and I assume getting null in all other columns.
I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898
When I create a dataframe from json file, the fields from the json file are sorted by default in the dataframe. How to avoid this sorting?
Jsonfile having one json message per line:
{"name":"john","age":10,"class":2}
{"name":"rambo","age":11,"class":3}
When I create Data frame from this file as:
val jDF = sqlContext.read.json("/user/inputfiles/sample.json")
a DF is created as jDF: org.apache.spark.sql.DataFrame = [age: bigint, class: bigint, name: string]
. In the DF the fields are sorted by default.
How do we avoid this from happening?
Im unable to understand what is going wrong here.
Appreciate any help in sorting out the problem.
For Question 1:
A simple way is to do select on the DataFrame:
val newDF = jDF.select("name","age","class")
The order of parameters is the order of the columns you want.
But this could be verbose if there are many columns and you have to define the order yourself.
I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition
(similar to writing in Parquet format)
folder in form of
partition_column_name=partition_value
( i.e partition_date=2016-05-03). To do so, I ran the following command :
(df.write
.partitionBy('partition_date')
.mode('overwrite')
.format("com.databricks.spark.csv")
.save('/tmp/af_organic'))
but partition folders had not been created
any idea what sould I do in order for spark DF automatically create those folders?
Thanks,
Spark 2.0.0+:
Built-in csv format supports partitioning out of the box so you should be able to simply use:
df.write.partitionBy('partition_date').mode(mode).format("csv").save(path)
without including any additional packages.
Spark < 2.0.0:
At this moment (v1.4.0) spark-csv doesn't support partitionBy (see databricks/spark-csv#123) but you can adjust built-in sources to achieve what you want.
You can try two different approaches. Assuming your data is relatively simple (no complex strings and need for character escaping) and looks more or less like this:
df = sc.parallelize([
("foo", 1, 2.0, 4.0), ("bar", -1, 3.5, -0.1)
]).toDF(["k", "x1", "x2", "x3"])
You can manually prepare values for writing:
from pyspark.sql.functions import col, concat_ws
key = col("k")
values = concat_ws(",", *[col(x) for x in df.columns[1:]])
kvs = df.select(key, values)
and write using text source
kvs.write.partitionBy("k").text("/tmp/foo")
df_foo = (sqlContext.read.format("com.databricks.spark.csv")
.options(inferSchema="true")
.load("/tmp/foo/k=foo"))
df_foo.printSchema()
## root
## |-- C0: integer (nullable = true)
## |-- C1: double (nullable = true)
## |-- C2: double (nullable = true)
In more complex cases you can try to use proper CSV parser to preprocess values in a similar way, either by using UDF or mapping over RDD, but it will be significantly more expensive.
If CSV format is not a hard requirement you can also use JSON writer which supports partitionBy out-of-the-box:
df.write.partitionBy("k").json("/tmp/bar")
as well as partition discovery on read.