AWS Glue write_dynamic_frame_from_options encounters schema exception - csv

I'm new to Pyspark and AWS Glue and I'm having an issue when I try to write out a file with Glue.
When I try to write some output into s3 using Glue's write_dynamic_frame_from_options it's getting an exception and saying
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 199.0 (TID 7991, 10.135.30.121, executor 9): java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
Header length: 7, schema size: 6
CSV file: s3://************************************cache.csv
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:180)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:176)
at scala.Option.foreach(Option.scala:257)
at .....
It seems like its saying that my dataframe's schema has 6 fields, but the csv has 7. I don't understand which csv it's talking about, because I am actually trying to create a new csv from the dataframe...
Any insight to this specific issue or to how the write_dynamic_frame_from_options method works in general would be very helpful!
Here is the source code for the function in my job that is causing this issue.
def update_geocache(glueContext, originalDf, newDf):
logger.info("Got the two df's to union")
logger.info("Schema of the original df")
originalDf.printSchema()
logger.info("Schema of the new df")
newDf.printSchema()
# add the two Dataframes together
unioned_df = originalDf.unionByName(newDf).distinct()
logger.info("Schema of the union")
unioned_df.printSchema()
##root
#|-- location_key: string (nullable = true)
#|-- addr1: string (nullable = true)
#|-- addr2: string (nullable = true)
#|-- zip: string (nullable = true)
#|-- lat: string (nullable = true)
#|-- lon: string (nullable = true)
# Create just 1 partition, because there is so little data
unioned_df = unioned_df.repartition(1)
logger.info("Unioned the geocache and the new addresses")
# Convert back to dynamic frame
dynamic_frame = DynamicFrame.fromDF(
unioned_df, glueContext, "dynamic_frame")
logger.info("Converted the unioned tables to a Dynamic Frame")
# Write data back to S3
# THIS IS THE LINE THAT THROWS THE EXCEPTION
glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://" + S3_BUCKET + "/" + TEMP_FILE_LOCATION
},
format="csv"
)

Related

How to incorporate projected columns in scanner into new dataset partitioning

Let's say I load a dataset
myds=ds.dataset('mypath', format='parquet', partitioning='hive')
myds.schema
# On/Off_Peak: string
# area: string
# price: decimal128(8, 4)
# date: date32[day]
# hourbegin: int32
# hourend: int32
# inflation: string rename to Inflation
# Price_Type: string
# Reference_Year: int32
# Case: string
# region: string rename to Region
My end goal is to resave the dataset with the following projection:
projection={'Region':ds.field('region'),
'Date':ds.field('date'),
'isPeak':pc.equal(ds.field('On/Off_Peak'),ds.scalar('On')),
'Hourbegin':ds.field('hourbegin'),
'Hourend':ds.field('hourend'),
'Inflation':ds.field('inflation'),
'Price_Type':ds.field('Price_Type'),
'Area':ds.field('area'),
'Price':ds.field('price'),
'Reference_Year':ds.field('Reference_Year'),
'Case':ds.field('Case'),
}
I make a scanner
scanner=myds.scanner(columns=projection)
Now I try to save my new dataset with
ds.write_dataset(scanner, 'newpath',
partitioning=['Reference_Year', 'Case', 'Region'], partitioning_flavor='hive',
format='parquet')
but I get
KeyError: 'Column Region does not exist in schema'
I can work around this by changing my partitioning to ['Reference_Year', 'Case', 'region'] to match the non-projected columns (and then later changing the name of all those directories) but is there a way to do it directly?
Suppose my partitioning needed the compute for more than just the column name changing. Would I have to save a non-partitioned dataset in one step to get the new column and then do another save operation to create the partitioned dataset?
EDIT: this bug has been fixed in pyarrow 10.0.0
It looks like a bug to me. It's as if write_dataset is looking at the dataset_schema rather than the projected_schema
I think you can get around it by calling to_reader on the scanner.
table = pa.Table.from_arrays(
[
pa.array(['a', 'b', 'c'], pa.string()),
pa.array(['a', 'b', 'c'], pa.string()),
],
names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns={
"Region": ds.field('region'),
"Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)
ds.write_dataset(
scanner.to_reader(),
'newpath',
partitioning=['Region'], partitioning_flavor='hive',
format='parquet')
I've reported the issue here

Skip empty or faulty rows with Serde

I have a file with valid rows that I'm parsing to a struct using Serde and the csv crate.
#[derive(Debug, Deserialize)]
struct Circle {
x: f32,
y: f32,
radius: f32,
}
fn read_csv(path: &str) -> Result<Vec<Circle>, csv::Error> {
let mut rdr = csv::ReaderBuilder::new().delimiter(b';').from_path(path)?;
let res: Vec<Circle> = rdr
.deserialize()
.map(|record: Result<Circle, csv::Error>| {
record.unwrap_or_else(|err| panic!("There was a problem parsing a row: {}", err))
})
.collect();
Ok(res)
}
This code work for the most times, but sometimes when I get files they contain "empty" rows at the end:
x;y;radius
6398921.770;146523.553;0.13258
6398921.294;146522.452;0.13258
6398914.106;146526.867;0.13258
;;;
This makes the parsing fail with
thread 'main' panicked at 'There was a problem parsing a row: CSV
deserialize error: record 4 (line: 4, byte: 194): field 0: cannot
parse float from empty string', src/main.rs:90:41 note: run with
RUST_BACKTRACE=1 environment variable to display a backtrace
How can I handle faulty rows without manipulating the file contents beforehand?
Thanks!

Convert CSV with dynamic columns to parquet

I have csv files for a table that have dynamic columns with uncertain order:
csv file 1:
name, id, age, job
Amy, 001, 30, SDE
csv file 2:
id, job, name
002, PM, Brandon
I converted the csv files to parquet files in pyspark,
spark.read.csv(input_path, header = True).write.parquet(output_path)
and when I read the parquet using sparksql, the data has been shifted.
name, id, age, job
Amy, 001, 30, SDE
002, PM, Brandon
What I want is:
name, id, age, job
Amy, 001, 30, SDE
Brandon, 002, null, PM
I know parquet is a columnar format. When it comes to reordering, it should be able to write to parquet by column names, so the data won't get shifted. Or, the problem could be the read.csv because its formats depend on ordering, so it won't work in dynamic order.
Is there any config I can add to the code to make it work? or any other ways?
You have to use mergeSchema=true option as below.
scala> spark.read.option("mergeSchema", "true").parquet("/user/hive/warehouse/test.db/csv_test/")
scala> res16.printSchema
root
|-- id: string (nullable = true)
|-- job: string (nullable = true)
|-- name: string (nullable = true)
|-- age: string (nullable = true)
scala> res16.show
+---+---+-------+----+
| id|job| name| age|
+---+---+-------+----+
|001|SDE| Amy| 30|
|002| PM|Brandon|null|
+---+---+-------+----+
Please, beware schema merging is expensive operation.
https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#schema-merging

Parsing JSON as Array(String) in Kemal

I want to create an endpoint that receives JSON data and should parse it as an array of strings.
POST /
{
"keys": ["foo", "bar"]
}
I'm running into problems with the type system. This is what I tried (.as(Array(String))) but it does not compile:
require "kemal"
def print_keys(keys : Array(String))
puts "Got keys: #{keys}"
end
post "/" do |env|
keys = env.params.json["keys"].as(Array(String)) # <-- ERROR
print_keys(keys)
end
Kemal.run
The error message is:
8 | keys = env.params.json["keys"].as(Array(String)) # <-- ERROR
^
Error: can't cast (Array(JSON::Any) | Bool | Float64 | Hash(String, JSON::Any) | Int64 | String | Nil) to Array(String)
If I change the code to parse not Array(String) but instead String, it compiles without problems. Why does it make a difference in the .as method that the type is Array(String) instead of String?
How can the code be changed to parse arrays of strings?
I found an example in the documentation, which uses JSON.mapping. In my concrete example, it could be written as follows:
require "kemal"
def print_keys(keys : Array(String))
puts "Got keys: #{keys}"
end
class KeyMappings
JSON.mapping({
keys: Array(String)
})
end
post "/" do |env|
json = KeyMappings.from_json env.request.body.not_nil!
print_keys(json.keys)
end
Kemal.run

Pig: parse bytearray as a string/json

I have some json data format saved to S3 in SequenceFile format by secor. I want to analyze it using Pig. Using elephant-bird I managed to get it from S3 in bytearray format, but I wasn't able to convert it to chararray, which is apparently needed to parse Json:
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare BYTES_CONVERTER 'com.twitter.elephantbird.pig.util.BytesWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
grunt> A = LOAD 's3n://...logs/raw_logs/...events/dt=2015-12-08/1_0_00000000000085594299'
USING $SEQFILE_LOADER ('-c $LONG_CONVERTER', '-c $BYTES_CONVERTER')
AS (key: long, value: bytearray);
grunt> B = LIMIT A 1;
grunt> DUMP B;
(85653965,{"key": "val1", other json data, ...})
grunt> DESCRIBE B;
B: {key: long,value: bytearray}
grunt> C = FOREACH B GENERATE (key, (chararray)value);
grunt> DUMP C;
2015-12-08 19:32:09,133 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders.
Cannot determine how to convert the bytearray to string.
Using TextConverter insted of the BytesWritableConverter just leaves me with empty values, like:
(85653965,)
It's apparent that Pig was able to cast the byte array to a string to dump it, so it doesn't seem like it should be imposible. How do I do that?