pyspark 2 csv read quote is ignored - csv

tx = 'a,b,c,"[""d"", ""e""]""'
file=open('temp.csv','wt')
file.writelines(tx)
file.close()
sparkSession.read.csv('temp.csv', quote='"').show()
+---+---+---+-------+---------+
|_c0|_c1|_c2| _c3| _c4|
+---+---+---+-------+---------+
| a| b| c|"[""d""| ""e""]""|
+---+---+---+-------+---------+
Where the desired output is
+---+---+---+-------------------+
|_c0|_c1|_c2| _c3 |
+---+---+---+-------------------+
| a| b| c|"[""d"", ""e""]""| |
+---+---+---+-------------------+

I am not familiar with PySpark, but there seems to be something wrong with the quotes (one too many) - should probably be:
'a,b,c,"[""d"", ""e""]"'
and the output should then be:
+---+---+---+-------------------+
|_c0|_c1|_c2| _c3 |
+---+---+---+-------------------+
| a| b| c|["d", "e"] |
+---+---+---+-------------------+

Related

Pyspark - how to group by and create a key value pair column

I have a data similar to the below one:
Col1,col2,col3
a,1,#
b,2,$
c,3,%
I need to create a new column with col2 as key and col3 as value, similar to below:
Col1,col2,col3,col4
a,1,#,{1:#}
b,2,$,{2:$}
c,3,%,{4:%}
How can I achieve this using pyspark?
Try format_string:
import pyspark.sql.functions as F
df2 = df.withColumn('col4', F.format_string('{%d:%s}', 'col2', 'col3'))
df2.show()
+----+----+----+-----+
|Col1|col2|col3| col4|
+----+----+----+-----+
| a| 1| #|{1:#}|
| b| 2| $|{2:$}|
| c| 3| %|{3:%}|
+----+----+----+-----+
If you want a key-value relationship, maps might be more appropriate:
df2 = df.withColumn('col4', F.create_map('col2', 'col3'))
df2.show()
+----+----+----+--------+
|Col1|col2|col3| col4|
+----+----+----+--------+
| a| 1| #|[1 -> #]|
| b| 2| $|[2 -> $]|
| c| 3| %|[3 -> %]|
+----+----+----+--------+
You can also convert the map to a JSON string, similar to your expected output:
df2 = df.withColumn('col4', F.to_json(F.create_map('col2', 'col3')))
df2.show()
+----+----+----+---------+
|Col1|col2|col3| col4|
+----+----+----+---------+
| a| 1| #|{"1":"#"}|
| b| 2| $|{"2":"$"}|
| c| 3| %|{"3":"%"}|
+----+----+----+---------+

CSV Columns removed From file while loading Dataframe

While loading csv via databricks, below the 2nd row 4th column is not loaded.
The csv's no of columns varies per row.
In test_01.csv,
a,b,c
s,d,a,d
f,s
Loaded above csv file via databricks as below
>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
Tried loading with textfile
rdd = sc.textFile ("sample_files/test_01.csv")
rdd.collect()
[u'a,b,c', u's,d,a,d', u'f,s']
But not conversion of above rdd to dataframe causes error
Was able to solve by specifying the schema as below.
df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")
df2.show()
+---+---+----+----+----+
| e1| e2| e3| e4| e5|
+---+---+----+----+----+
| a| b| c|null|null|
| s| d| a| d|null|
| f| s|null|null|null|
+---+---+----+----+----+
Tried with inferschema. still not working
df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
But is there any other way without using schema as the no of column varies?
Ensure you have fixed headers ie rows can have a data missing but column names should be fixed.
If you don't specify column names, you can still create the schema while reading the csv:
val schema = new StructType()
.add(StructField("keyname", StringType, true))

adding a unique consecutive row number to dataframe in pyspark

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

Split string into multiple rows with one character each

I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.

Join query issue in MySQL

I'm storing the records in hierarchy.
Ex.
Account -> Hospital -> Department
Account -> Hospital -> Department -> Section
I'm storing the association of all the records in following manner.
+------+---------------+----------+---------------+-----------+
| Id | ParentType | ParentId | Child Type | ChildId |
+------+---------------+----------+---------------+-----------+
| 1| account| 1| hospital| 10|
| 2| account| 1| hospital| 20|
| 3| hospital| 10| department| 100|
| 4| hospital| 10| department| 101|
| 5| department| 100| device| 1000|
| 6| department| 101| device| 1001|
| 6| department| 101| device| 1002|
| 1| account| 2| hospital| 30|
| 2| account| 2| hospital| 40|
| 3| hospital| 30| department| 200|
| 4| hospital| 40| department| 201|
| 5| department| 200| section| 5000|
| 5| department| 200| section| 5001|
| 6| section| 5000| device| 2001|
| 6| section| 5001| device| 2002|
+------+---------------+----------+---------------+-----------+
So, account with id 1, follows first hierarchy; whereas account with id 2 follows second hierarchy.
I need to fetch the records for the given level.
Ex.
Get all the devices belonging to account with id = 1
Get all the devices belonging to department with id = 200 and account with id = 2
and so on.
I can retrieve these with queries like:
First query:
SELECT a3.ChildType, a3.ChildId FROM association_lookup a1 -- [got hosp level]
JOIN association_lookup a2 ON a2.parentId = a1.ChildId -- [got dept level]
JOIN association_lookup a3 ON a3.parentId = a2.ChildId AND a3.ParentType = a2.ChildType -- [got device level]
WHERE a1.ParentId = 1 AND a1.ParentType = 'account'
AND a3.ChildType = 'device'
I can make this as dynamic query with self joins equal to level difference - 1. i.e. account level = 0, device level = 3; hence 2 joins.
But now, if I want to associate device against hospital level instead of department level; like:
| xx| hospital| 10| device| 1003|
then for the same query this device will be skipped and only the devices associated with department level will be returned. How can I get all the devices (i.e. under both hospital level and department level).
That is a horrible way to store data.
I suggest restructuring and creating separate tables each entity.
I.e. create table account, create table hospital ...
Then you can jion properly. Everything else would require dynamic iterative selection which is not built in to mysql and needs to be done with an external program or by hand.
You can write a script to dynamicall generate a table for each parenttype and childtype though.