Pyspark explain difference with and without custom schema for reading csv

Pyspark explain difference with and without custom schema for reading csv - csv

I am reading a CSV file that has a header but creating my custom schema to read. I wanted to understand if there is a difference shown in explain if I provide a schema or not. My curiosity raised from this statement about read.csv on the doc
Loads a CSV file and returns the result as a DataFrame.
This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.
I could see the time delay in my prompt when I provide the schema in comparison to inferSchema being used. But I don't see any difference in explain function. Below are my code and the output with schema provided
>> friends_header_df = spark.read.csv(path='resources/fakefriends-header.csv',schema=custom_schems, header='true', sep=',')
>> print(friends_header_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv
== Analyzed Logical Plan ==
id: int, name: string, age: int, numFriends: int
Relation[id#8,name#9,age#10,numFriends#11] csv
== Optimized Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv
== Physical Plan ==
FileScan csv [id#8,name#9,age#10,numFriends#11] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,age:int,numFriends:int>
And below for reading with inferSchema option
>> friends_noschema_df = spark.read.csv(path='resources/fakefriends-header.csv',header='true',inferSchema='true',sep=',')
>> print(friends_noschema_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv
== Analyzed Logical Plan ==
userID: int, name: string, age: int, friends: int
Relation[userID#32,name#33,age#34,friends#35] csv
== Optimized Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv
== Physical Plan ==
FileScan csv [userID#32,name#33,age#34,friends#35] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<userID:int,name:string,age:int,friends:int>
Except for the numbers changing for the columns in Parsed Logical plan, I don't see any explanation for spark reading all the data once.

InferSchema = false is the default option. You will get all columns as strings for the DF. But if you provide a schema you get your output.
Inferring a Schema means Spark will kick off an extra Job underwater to do exactly that; you can see that in fact. It will take longer but you will not see - as you state - anything in the explained Plan. Underwater is "underwater".

Related

Pyspark delta lake json datatype evolution issue ( merge incompatible exception )

I am working on pyspark (3.x) and delta lake. I am facing some challenges w.r.t to datatypes.
We are receiving data as JSON data type, we are doing some flattening on the JSON datasets and saving it as delta tables with options as "mergeSchema" as true as shown below. We are not imposing any schema on the table.
df.write\
.format("delta")\
.partitionBy("country","city")\
.option("mergeSchema","true")\
.mode("append")\
.save(delta_path)\
The problem we are facing is- the data type of JSON fields gets change very often,for example In delta table "field_1" is getting stored with datatype as StringType but the datatype for 'field_1' for new JSON is coming as LongType. Due to this we are getting merge incompatible exception.
ERROR : Failed to merge fields 'field_1' and 'field_1'. Failed to merge incompatible data types StringType and LongType
How to handle such datatype evolution in delta tables, I dont want to handle datatype changes at a field level because we have more than 300+ fields coming as part of json.

According to the article Diving Into Delta Lake: Schema Enforcement & Evolution the option mergeSchema=true can handle the following scenarios:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
The article also gives a hint on what can be done in your case:
"Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column “Foo” was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:"
Dropping a column
Changing an existing column’s data type (in place)
Renaming column names that differ only by case (e.g. “Foo” and “foo”)

I have also take an approach similar to nilesh1212, that is, manually merge schema.
In my case my script can handle nested types, can be found here:
https://github.com/miguellobato84/spark-delta-schema-evolution
Also, I wrote this article regarding this issue
https://medium.com/#miguellobato84/improving-delta-lake-schema-evolution-2cce8db2f0f5

In order get my issue resolved, I have written a new function that essentially merges schema of the delta table (if delta table exist) and JSON schema.
At a high level, I have created a new schema- this new schema is essentially a combination of common columns from delta lake table and new columns from JSON fields, by creating this new schema I recreate a data frame by applying this new schema.
This has solved my issue.
def get_merged_schema(delta_table_schema, json_data_schema):
print('str(len(delta_table_schema.fields)) -> ' + str(len(delta_table_schema.fields)))
print('str(len(json_data_schema.fields)) -> '+ str(len(json_data_schema.fields)))
no_commom_elements=False
no_new_elements=False
import numpy as np
struct_field_array=[]
if len(set(delta_table_schema.names).intersection(set(json_data_schema.names))) > 0:
common_col=set(delta_table_schema.names).intersection(set(json_data_schema.names))
print('common_col len: -> '+ str(len(common_col)))
for name in common_col:
for f in delta_table_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_commom_elements=True
print("no common elements")
if len(np.setdiff1d(json_data_schema.names,delta_table_schema.names)) > 0:
diff_list = np.setdiff1d(json_data_schema.names,delta_table_schema.names)
print('diff_list len: -> '+ str(len(diff_list)))
for name in diff_list:
for f in json_data_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_new_elements=True
print("no new elements")
print('len(StructType(struct_field_array)) -> '+str(len(StructType(struct_field_array))))
df=spark.createDataFrame(spark.sparkContext.emptyRDD(),StructType(struct_field_array))
if no_commom_elements and no_new_elements:
return StructType(None)
else:
return df.select(sorted(df.columns)).schema

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.

I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.

Parsing Google Custom Search API for Elasticsearch Documents

After retrieving results from the Google Custom Search API and writing it to JSON, I want to parse that JSON to make valid Elasticsearch documents. You can configure a parent - child relationship for nested results. However, this relationship seems to not be inferred by the data structure itself. I've tried automatically loading, but not results.
Below is some example input that doesn't include things like id or index. I'm trying to focus on creating the correct data structure. I've tried modifying graph algorithms like depth-first-search but am running into problems with the different data structures.
Here's some example input:
# mock data structure
google = {"content": "foo",
"results": {"result_one": {"persona": "phone",
"personb": "phone",
"personc": "phone"
},
"result_two": ["thing1",
"thing2",
"thing3"
],
"result_three": "none"
},
"query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}
# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
{"content":"foo"},
{"results": ["result_one", "result_two", "result_three"]},
{"result_one": ["persona", "personb", "personc"]},
{"persona": "phone"},
{"personb": "phone"},
{"personc": "phone"},
{"result_two":["thing1","thing2","thing3"]},
{"result_three": "none"},
{"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]
Here is my current approach this is still a work in progress:
def recursive_dfs(graph, start, path=[]):
'''recursive depth first search from start'''
path=path+[start]
for node in graph[start]:
if not node in path:
path=recursive_dfs(graph, node, path)
return path
def branching(google):
""" Get branches as a starting point for dfs"""
branch = 0
while branch < len(google):
if google[google.keys()[branch]] is dict:
#recursive_dfs(google, google[google.keys()[branch]])
pass
else:
print("branch {}: result {}\n".format(branch, google[google.keys()[branch]]))
branch += 1
branching(google)
You can see that recursive_dfs() still needs to be modified to handle string, and list data structures.
I'll keep going at this but if you have thoughts, suggestions, or solutions then I would very much appreciate it. Thanks for your time.

here is a possible answer to your problem.
def myfunk( inHole, outHole):
for keys in inHole.keys():
is_list = isinstance(inHole[keys],list);
is_dict = isinstance(inHole[keys],dict);
if is_list:
element = inHole[keys];
new_element = {keys:element};
outHole.append(new_element);
if is_dict:
element = inHole[keys].keys();
new_element = {keys:element};
outHole.append(new_element);
myfunk(inHole[keys], outHole);
if not(is_list or is_dict):
new_element = {keys:inHole[keys]};
outHole.append(new_element);
return outHole.sort();

How can I identify where or why a JSON is invalid (RJSONIO)

I'm dealing with a data column that is just massive JSON columns.
Each row value is ~50,000 characters.
After spending some time trying to fiddle with fromJSON to go from JSON -> dataframe where columns = JSON keys, and getting numerous errors in doing so, I used isValidJSON() across the column and found that about 75% of my JSON is "invalid".
Now, I'm fully confident based on the source that this data is in fact valid JSON straight from the DB, so I would love to be able to identify where in the 50,000 characters the fromJSON function is running into trouble.
I've tried debug() but it just tells me at which function call the error occurs.
I'd share sample rows if they weren't all so cumbersome, but it's a healthy mix of values, imagine a df with df$features:
{"names":["bob","alice"],"ages":{"bob":20,"alice":21}, "id":54, "isTrue":false}... ad infinitum
Code I'm trying to run:
iValid <- function(x){return(isValidJSON(I(x)))}
sapply(df$features,iValid)
[1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE...
> fromJSON(df$features[2])
debugging in: fromJSON(df$features[2])
debug: standardGeneric("fromJSON")
Browse[2]> n
debugging in: fromJSON(content, handler, default.size, depth, allowComments,
asText = FALSE, data, maxChar, simplify = simplify, ...,
nullValue = nullValue, simplifyWithNames = simplifyWithNames,
encoding = encoding, stringFun = stringFun)
debug: standardGeneric("fromJSON")
Browse[3]> n
Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input
>

Using Python's csv.dictreader to search for specific key to then print its value

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()

The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}

The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Pyspark explain difference with and without custom schema for reading csv - csv

Related

Pyspark delta lake json datatype evolution issue ( merge incompatible exception )

Parse complex Json string contained in Hadoop

Parsing Google Custom Search API for Elasticsearch Documents

How can I identify where or why a JSON is invalid (RJSONIO)

Using Python's csv.dictreader to search for specific key to then print its value

Categories

Resources