I have a CSV file of data in the form
21.06.2016 23:00:00.349, 153.461, 153.427
21.06.2016 23:00:00.400, 153.460, 153.423
etc
The initial step of creating a frame involves the optional inclusion of a 'schema' to specify or rename column heads and specify types:
let df = Frame.ReadCsv(__SOURCE_DIRECTORY__ + "/data/GBPJPY.csv", hasHeaders=true, inferTypes=false, schema="TS (DateTimeOffset), Bid (float(3)), Ask (float(3))")
I would like to specify the first column of string values to be ParseExact'ed to DateTimeOffset of the format
"dd.mm.yyyy HH:mm:ss.fff"
(I'm assuming the use of the setting System.Globalization.CultureInfo.InvariantCulture).
How do I express the schema such that it will parse the datetime string in that first Frame.ReadCsv("file.csv", schema = ........ )? Or is this not possible to accomplish within the schema statement?
Related
I am working on pyspark (3.x) and delta lake. I am facing some challenges w.r.t to datatypes.
We are receiving data as JSON data type, we are doing some flattening on the JSON datasets and saving it as delta tables with options as "mergeSchema" as true as shown below. We are not imposing any schema on the table.
df.write\
.format("delta")\
.partitionBy("country","city")\
.option("mergeSchema","true")\
.mode("append")\
.save(delta_path)\
The problem we are facing is- the data type of JSON fields gets change very often,for example In delta table "field_1" is getting stored with datatype as StringType but the datatype for 'field_1' for new JSON is coming as LongType. Due to this we are getting merge incompatible exception.
ERROR : Failed to merge fields 'field_1' and 'field_1'. Failed to merge incompatible data types StringType and LongType
How to handle such datatype evolution in delta tables, I dont want to handle datatype changes at a field level because we have more than 300+ fields coming as part of json.
According to the article Diving Into Delta Lake: Schema Enforcement & Evolution the option mergeSchema=true can handle the following scenarios:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
The article also gives a hint on what can be done in your case:
"Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column “Foo” was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:"
Dropping a column
Changing an existing column’s data type (in place)
Renaming column names that differ only by case (e.g. “Foo” and “foo”)
I have also take an approach similar to nilesh1212, that is, manually merge schema.
In my case my script can handle nested types, can be found here:
https://github.com/miguellobato84/spark-delta-schema-evolution
Also, I wrote this article regarding this issue
https://medium.com/#miguellobato84/improving-delta-lake-schema-evolution-2cce8db2f0f5
In order get my issue resolved, I have written a new function that essentially merges schema of the delta table (if delta table exist) and JSON schema.
At a high level, I have created a new schema- this new schema is essentially a combination of common columns from delta lake table and new columns from JSON fields, by creating this new schema I recreate a data frame by applying this new schema.
This has solved my issue.
def get_merged_schema(delta_table_schema, json_data_schema):
print('str(len(delta_table_schema.fields)) -> ' + str(len(delta_table_schema.fields)))
print('str(len(json_data_schema.fields)) -> '+ str(len(json_data_schema.fields)))
no_commom_elements=False
no_new_elements=False
import numpy as np
struct_field_array=[]
if len(set(delta_table_schema.names).intersection(set(json_data_schema.names))) > 0:
common_col=set(delta_table_schema.names).intersection(set(json_data_schema.names))
print('common_col len: -> '+ str(len(common_col)))
for name in common_col:
for f in delta_table_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_commom_elements=True
print("no common elements")
if len(np.setdiff1d(json_data_schema.names,delta_table_schema.names)) > 0:
diff_list = np.setdiff1d(json_data_schema.names,delta_table_schema.names)
print('diff_list len: -> '+ str(len(diff_list)))
for name in diff_list:
for f in json_data_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_new_elements=True
print("no new elements")
print('len(StructType(struct_field_array)) -> '+str(len(StructType(struct_field_array))))
df=spark.createDataFrame(spark.sparkContext.emptyRDD(),StructType(struct_field_array))
if no_commom_elements and no_new_elements:
return StructType(None)
else:
return df.select(sorted(df.columns)).schema
When I read JSON through spark( using scala )
val rdd = spark.sqlContext.read.json("/Users/sanyam/Downloads/data/input.json")
val df = rdd.toDF()
df.show()
println(df.schema)
//val schema = df.schema.add("_corrupt_record",org.apache.spark.sql.types.StringType,true)
//val rdd1 = spark.sqlContext.read.schema(schema).json("/Users/sanyam/Downloads/data/input_1.json")
//rdd1.toDF().show()
this results in following DF:
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
| appId| appTimestamp|appVersion| bankCode|bankLocale| data|date| environment| event| id| logTime| logType| msid| muid| owner|recordType| uuid|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
StructType(StructField(appId,StringType,true), StructField(appTimestamp,StringType,true), StructField(appVersion,StringType,true), StructField(bankCode,StringType,true), StructField(bankLocale,StringType,true), StructField(data,StringType,true), StructField(date,StringType,true), StructField(environment,StringType,true), StructField(event,StringType,true), StructField(id,LongType,true), StructField(logTime,LongType,true), StructField(logType,StringType,true), StructField(msid,StringType,true), StructField(muid,StringType,true), StructField(owner,StringType,true), StructField(recordType,StringType,true), StructField(uuid,StringType,true))
If I want to apply validation for any further json I read then I take schema as a variable and parse that in .schema as an argument [refer the commented lines of code], but even the corrupt records don't go into _corrupt_record column(which should happen by default), instead it parses that bad records as null in all columns and this is resulting into data loss as theie is no record of it.
Although when you add _corrupt_record column in schema explicitly everything works fine and the corrupt_record goes into the respective column, I want to know the reason why this is so?
(Also, if you give a malformed Json, spark automatically handles it by making a _corrupt_record column, so how come schema validation needs explicit column addition earlier) ??
Reading corrupt json data returns schema as [_corrupt_record: string]. But you are reading the corrupt data with schema which is wrong and hence you are getting the whole row as null.
But when you add _corrupt_record explicitly you get whole json record in that column and I assume getting null in all other columns.
I'm very new in Hadoop,
I'm using Spark with Java.
I have dynamic JSON, exmaple:
{
"sourceCode":"1234",
"uuid":"df123-....",
"title":"my title"
}{
"myMetaDataEvent": {
"date":"10/10/2010",
},
"myDataEvent": {
"field1": {
"field1Format":"fieldFormat",
"type":"Text",
"value":"field text"
}
}
}
Sometimes I can see only field1 and sometimes I can see field1...field50
And maybe the user can add fields/remove fields from this JSON.
I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,
How can I do it?
I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text
I have around 100B JSON records per day that I need to insert to Hadoop,
So what is the recommanded way to do that?
*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.
Thanks
I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).
For a field having a string data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);
For a field having a int data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);
After you have added all the fields in a List<StructField>,
Eg:-
List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:-
innerField.add(field1);
innerField.add(field2);
// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.
ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));
StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);
You can then create the schema :-
StructType structuredSchema = DataTypes.createStructType(getArrayField);
Then I read the json using the schema generated using the Dataset API.
Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);
I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?
It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.
Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html
Table structure like -
db.define_table('parent',
Field('name'),format='%(name)s')
db.define_table('children',
Field('name'),
Field('mother','reference parent'),
Field('father','reference parent'))
db.children.mother.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
db.children.father.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
Controller :
grid = SQLFORM.grid(db.children, orderby=[db.children.id],
csv=True,
fields=[db.children.id, db.children.name, db.children.mother, db.children.father])
return dict(grid=grid)
Here grid shows proper values i.e names of the mother and father from the parent table.
But when I try to export it via csv link - resulted excelsheet shows ids and not the names of mother and father.
Please help!
The CSV download just gives you the raw database values without first applying each field's represent attribute. If you want the "represented" values of each field, you have two options. First, you can choose the TSV (tab-separated-values) download instead of CSV. Second, you can define a custom export class:
import cStringIO
class CSVExporter(object):
file_ext = "csv"
content_type = "text/csv"
def __init__(self, rows):
self.rows = rows
def export(self):
if self.rows:
s = cStringIO.StringIO()
self.rows.export_to_csv_file(s, represent=True)
return s.getvalue()
else:
return ''
grid = SQLFORM.grid(db.mytable, exportclasses=dict(csv=(CSVExporter, 'CSV')))
The exportclasses argument is a dictionary of custom download types that can be used to override existing types or add new ones. Each item is a tuple including the exporter class and the label to be used for the download link in the UI.
We should probably add this as an option.