I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that I have ... I only want 10 fields out of 50+ fields returned for each record. Is there a way (similar to filter) that I can use to specify the names of some fields that I want remove from spark.
Just a small descriptive example. Consider I have the Schema "Person" with 3 fields "Name", "Age", and "Gender" and I'm not interested in the "Age" field and wold like to remove it. Can I use spark some how to do that. ? Thanks
If you are using Spark 1.2, you can do the following (using Scala)...
If you already know what fields you want to use, you can construct the schema for these fields and apply this schema to the JSON dataset. Spark SQL will return a SchemaRDD. Then, you can register it and query it as a table. Here is a snippet...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
When you look at the schema of peopleSchemaRDD (peopleSchemaRDD.printSchema()), you will only see name and gender field.
Or, if you want to explore the dataset and determine what fields you want after you see all fields, you can ask Spark SQL to infer the schema for you. Then, you can register the SchemaRDD as a table and use projection to remove unneeded fields. Here is a snippet...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")
You can specify what fields you would like to have in the schemaRDD. Below is an example. Create a case class, with only the fields that you need. Read the data into an rdd, then specify the only the fileds that you need(in the same order as you have specified the schema in the case class).
Sample Data: People.txt
foo,25,M
bar,24,F
Code:
case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")
Related
I am working on pyspark (3.x) and delta lake. I am facing some challenges w.r.t to datatypes.
We are receiving data as JSON data type, we are doing some flattening on the JSON datasets and saving it as delta tables with options as "mergeSchema" as true as shown below. We are not imposing any schema on the table.
df.write\
.format("delta")\
.partitionBy("country","city")\
.option("mergeSchema","true")\
.mode("append")\
.save(delta_path)\
The problem we are facing is- the data type of JSON fields gets change very often,for example In delta table "field_1" is getting stored with datatype as StringType but the datatype for 'field_1' for new JSON is coming as LongType. Due to this we are getting merge incompatible exception.
ERROR : Failed to merge fields 'field_1' and 'field_1'. Failed to merge incompatible data types StringType and LongType
How to handle such datatype evolution in delta tables, I dont want to handle datatype changes at a field level because we have more than 300+ fields coming as part of json.
According to the article Diving Into Delta Lake: Schema Enforcement & Evolution the option mergeSchema=true can handle the following scenarios:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
The article also gives a hint on what can be done in your case:
"Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column “Foo” was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:"
Dropping a column
Changing an existing column’s data type (in place)
Renaming column names that differ only by case (e.g. “Foo” and “foo”)
I have also take an approach similar to nilesh1212, that is, manually merge schema.
In my case my script can handle nested types, can be found here:
https://github.com/miguellobato84/spark-delta-schema-evolution
Also, I wrote this article regarding this issue
https://medium.com/#miguellobato84/improving-delta-lake-schema-evolution-2cce8db2f0f5
In order get my issue resolved, I have written a new function that essentially merges schema of the delta table (if delta table exist) and JSON schema.
At a high level, I have created a new schema- this new schema is essentially a combination of common columns from delta lake table and new columns from JSON fields, by creating this new schema I recreate a data frame by applying this new schema.
This has solved my issue.
def get_merged_schema(delta_table_schema, json_data_schema):
print('str(len(delta_table_schema.fields)) -> ' + str(len(delta_table_schema.fields)))
print('str(len(json_data_schema.fields)) -> '+ str(len(json_data_schema.fields)))
no_commom_elements=False
no_new_elements=False
import numpy as np
struct_field_array=[]
if len(set(delta_table_schema.names).intersection(set(json_data_schema.names))) > 0:
common_col=set(delta_table_schema.names).intersection(set(json_data_schema.names))
print('common_col len: -> '+ str(len(common_col)))
for name in common_col:
for f in delta_table_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_commom_elements=True
print("no common elements")
if len(np.setdiff1d(json_data_schema.names,delta_table_schema.names)) > 0:
diff_list = np.setdiff1d(json_data_schema.names,delta_table_schema.names)
print('diff_list len: -> '+ str(len(diff_list)))
for name in diff_list:
for f in json_data_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_new_elements=True
print("no new elements")
print('len(StructType(struct_field_array)) -> '+str(len(StructType(struct_field_array))))
df=spark.createDataFrame(spark.sparkContext.emptyRDD(),StructType(struct_field_array))
if no_commom_elements and no_new_elements:
return StructType(None)
else:
return df.select(sorted(df.columns)).schema
I'm very new in Hadoop,
I'm using Spark with Java.
I have dynamic JSON, exmaple:
{
"sourceCode":"1234",
"uuid":"df123-....",
"title":"my title"
}{
"myMetaDataEvent": {
"date":"10/10/2010",
},
"myDataEvent": {
"field1": {
"field1Format":"fieldFormat",
"type":"Text",
"value":"field text"
}
}
}
Sometimes I can see only field1 and sometimes I can see field1...field50
And maybe the user can add fields/remove fields from this JSON.
I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,
How can I do it?
I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text
I have around 100B JSON records per day that I need to insert to Hadoop,
So what is the recommanded way to do that?
*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.
Thanks
I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).
For a field having a string data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);
For a field having a int data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);
After you have added all the fields in a List<StructField>,
Eg:-
List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:-
innerField.add(field1);
innerField.add(field2);
// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.
ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));
StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);
You can then create the schema :-
StructType structuredSchema = DataTypes.createStructType(getArrayField);
Then I read the json using the schema generated using the Dataset API.
Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);
I have a CSV file of data in the form
21.06.2016 23:00:00.349, 153.461, 153.427
21.06.2016 23:00:00.400, 153.460, 153.423
etc
The initial step of creating a frame involves the optional inclusion of a 'schema' to specify or rename column heads and specify types:
let df = Frame.ReadCsv(__SOURCE_DIRECTORY__ + "/data/GBPJPY.csv", hasHeaders=true, inferTypes=false, schema="TS (DateTimeOffset), Bid (float(3)), Ask (float(3))")
I would like to specify the first column of string values to be ParseExact'ed to DateTimeOffset of the format
"dd.mm.yyyy HH:mm:ss.fff"
(I'm assuming the use of the setting System.Globalization.CultureInfo.InvariantCulture).
How do I express the schema such that it will parse the datetime string in that first Frame.ReadCsv("file.csv", schema = ........ )? Or is this not possible to accomplish within the schema statement?
i tried to fetch data from mongodb using mongoengine with flask. query is work perfect the problem is when i convert query result into json its show only fields name.
here is my code
view.py
from model import Users
result = Users.objects()
print(dumps(result))
model.py
class Users(DynamicDocument):
meta = {'collection' : 'users'}
user_name = StringField()
phone = StringField()
output
[["id", "user_name", "phone"], ["id", "user_name", "phone"]]
why its show only fields name ?
Your query returns a queryset. Use the .to_json() method to convert it.
Depending on what you need from there, you may want to use something like json.loads() to get a python dictionary.
For example:
from model import Users
# This returns <class 'mongoengine.queryset.queryset.QuerySet'>
q_set = Users.objects()
json_data = q_set.to_json()
# You might also find it useful to create python dictionaries
import json
dicts = json.loads(json_data)
I would like to generate a simple json file from an database.
I am not an expert in parsing json files using python nor NDB database engine nor GQL.
What is the right query to search the data? see https://developers.google.com/appengine/docs/python/ndb/queries
How should I write the code to generate the JSON using the same schema as the json described here below?
Many thanks for your help
Model Class definition using NDB:
# coding=UTF-8
from google.appengine.ext import ndb
import logging
class Albums(ndb.Model):
"""Models an individual Event entry with content and date."""
SingerName = ndb.StringProperty()
albumName = ndb.StringProperty()
Expected output:
{
"Madonna": ["Madonna Album", "Like a Virgin", "True Blue", "Like a Prayer"],
"Lady Gaga": ["The Fame", "Born This Way"],
"Bruce Dickinson": ["Iron Maiden", "Killers", "The Number of the Beast", "Piece of Mind"]
}
For consistency, model names should by singular (Album not Albums), and property names should be lowercase_with_underscores:
class Album(ndb.Model):
singer_name = ndb.StringProperty()
album_name = ndb.StringProperty()
To generate the JSON as described in your question:
1) Query the Album entities from the datastore:
albums = Album.query().fetch(100)
2) Iterate over them to form a python data structure:
albums_dict = {}
for album in albums:
if not album.singer_name in albums_dict:
albums_dict[album.singer_name] = []
albums_dict[album.singer_name].append(album.album_name)
3) use json.dumps() method to encode to JSON.
albums_json = json.dumps(albums_dict)
Alternatively, you could use the built in to_dict() method:
albums = Album.query().fetch(100)
albums_json = json.dumps([a.to_dict() for a in albums])