I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.
Related
I have the following google-sheet:
| | A | B | C | D |
| 1| | Item1 | Item2 | Item3 |
| 2| Value1 | yes | no | no |
| 3| Value2 | no | no | yes |
| 4| Value3 | no | yes | no |
I need to import specific data from this sheet into another range so result should contain values from A, which have "yes" between B:E, and related item from the first row, like:
Value1 Item1
Value2 Item3
Value3 Item2
I can import using query and condition for "yes", but no ideas how to read related cell with Item1, Item2.. from first row above:
=query({importrange("my_range_id", "Data!A1:Z999"}, "select Col1 where Col2='yes' or Col3='yes' or Col4='yes'")
Thanks for help in advance!
May be you can first import data, then retrieve the column corresponding to 'yes' by this formula using matrix multiplication
={A2:A5,arrayformula(mmult((if(B2:D5="yes",1,0)),transpose(column(B1:D1)-1)))}
and then retrieve the header of the column
=offset($A$1,,G2)
As an addition, in case you still want to use a query:
Try this in F2:
=ARRAYFORMULA(QUERY(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
IF(A2:A<>"",
"♦"&A2:A&"😊"&B1:D1&"😊"&B2:D, ))
,,999^99)),,999^99), "♦")), "😊"),"Select Col1,Col2 where Col3 contains 'YES'"))
Here is an integrated formula
={A2:A,arrayformula(vlookup(mmult((if($B$2:$D="yes",1,0)),transpose(column($B$1:$D$1)-1)),{arrayformula(transpose(COLUMN(A1:D1)-1)),transpose(A1:D1)},2,0))}
I have a data similar to the below one:
Col1,col2,col3
a,1,#
b,2,$
c,3,%
I need to create a new column with col2 as key and col3 as value, similar to below:
Col1,col2,col3,col4
a,1,#,{1:#}
b,2,$,{2:$}
c,3,%,{4:%}
How can I achieve this using pyspark?
Try format_string:
import pyspark.sql.functions as F
df2 = df.withColumn('col4', F.format_string('{%d:%s}', 'col2', 'col3'))
df2.show()
+----+----+----+-----+
|Col1|col2|col3| col4|
+----+----+----+-----+
| a| 1| #|{1:#}|
| b| 2| $|{2:$}|
| c| 3| %|{3:%}|
+----+----+----+-----+
If you want a key-value relationship, maps might be more appropriate:
df2 = df.withColumn('col4', F.create_map('col2', 'col3'))
df2.show()
+----+----+----+--------+
|Col1|col2|col3| col4|
+----+----+----+--------+
| a| 1| #|[1 -> #]|
| b| 2| $|[2 -> $]|
| c| 3| %|[3 -> %]|
+----+----+----+--------+
You can also convert the map to a JSON string, similar to your expected output:
df2 = df.withColumn('col4', F.to_json(F.create_map('col2', 'col3')))
df2.show()
+----+----+----+---------+
|Col1|col2|col3| col4|
+----+----+----+---------+
| a| 1| #|{"1":"#"}|
| b| 2| $|{"2":"$"}|
| c| 3| %|{"3":"%"}|
+----+----+----+---------+
I have a spark data frame
| item_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 brand Samsung
| id_1 ram 6GB
| id_2 brand Apple
| id_2 ram 4GB
_____________________________________________________________________________
I want to group this data frame by item_id and output as a file with each line being a json object
{id_1: "properties":[{"brand":['Samsung']},{"ram":['6GB']} ]}
{id_2: "properties":[{"brand":['Apple']},{"ram":['4GB']} ]}
This is a big distributed data frame so , converting to pandas is not an option.
Is this kind of transformation even possible in pyspark
In scala, but python version will be very similar (sql.functions):
val df = Seq((1,"brand","Samsung"),(1,"ram","6GB"),(1,"ram","8GB"),(2,"brand","Apple"),(2,"ram","6GB")).toDF("item_id","attribute_key","attribute_value")
+-------+-------------+---------------+
|item_id|attribute_key|attribute_value|
+-------+-------------+---------------+
| 1| brand| Samsung|
| 1| ram| 6GB|
| 1| ram| 8GB|
| 2| brand| Apple|
| 2| ram| 6GB|
+-------+-------------+---------------+
df.groupBy('item_id,'attribute_key)
.agg(collect_list('attribute_value).as("list2"))
.groupBy('item_id)
.agg(map(lit("properties"),collect_list(map('attribute_key,'list2))).as("prop"))
.select(to_json(map('item_id,'prop)).as("json"))
.show(false)
output:
+------------------------------------------------------------------+
|json |
+------------------------------------------------------------------+
|{"1":{"properties":[{"ram":["6GB","8GB"]},{"brand":["Samsung"]}]}}|
|{"2":{"properties":[{"brand":["Apple"]},{"ram":["6GB"]}]}} |
+------------------------------------------------------------------+
I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+
I have got 3 tables:
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| id | A_id | | A_id| B_id | value | | B_id| B_id_ | value |
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 1| 5| | 5| 1| aa| | 1| 2| zzxx|
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 2| 3| | 3| 3| bb| | 2| | vvyy|
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 3| 4| bbll|
+-----+----------+-------+
| 5| | oopp|
+-----+----------+-------+
| 4| 5| mmnn|
+-----+----------+-------+
What SELECT statement i need to use, so that output would look like this(table3 can be up to 4 levels deep into it self):
+----+------------------------------+
| id | value |
+----+------------------------------+
| 1| aa\zzxx\vvyy|
+----+------------------------------+
| 2| bb\bbll\mmnn\oopp|
+----+------------------------------+
As i don't have much experience with DB and SQL, this is hard for me. And I have no vision about how to do this.
This has to be done in MySQL. Hardest thing as i have read is the recursive query in MySQL since it doesn't exist, so people have to simulate it. I have read some SO topics about the recursive Query, but i understood that's not for me.
Any help is appreciated.
By hard and fast learning I managed to solve my problem. Code below.
SELECT DISTINCT
OTHER.DATA,
concat(
'/',ifnull(t4.value,''), CASE WHEN (t4.value is NULL) then '' else '/' END,
ifnull(t3.value,''), CASE WHEN (t3.value is NULL) then '' else '/' END,
ifnull(t2.value,''), CASE WHEN (t2.value is NULL) then '' else '/' END,
ifnull(t1.value,''), CASE WHEN (t1.value is NULL) then '' else '/' END,
table2.value
) as 'My Column name'
FROM
table1
LEFT JOIN table2 ON
(table1.A_id = table2.A_id)
LEFT JOIN table3 as t1 ON
(t1.B_id = table2.B_id)
LEFT JOIN table3 AS t2 ON
(t2.B_id = t1.B_id_)
LEFT JOIN table3 AS t3 ON
(t3.B_id = t2.B_id_)
LEFT JOIN table3 AS t4 ON
(t4.B_id = t3.B_id_)
Big Thanks to #Damodaran and his solution for recursive query.
How to create a MySQL hierarchical recursive query
Be careful with using this code, as I have used it for DB, which is only queried for data. So this approach might be slow on other different usage. If you use this, I suggest you to think about indexing some fields.