I am have a source MySql table. I have to export the date to Hive for analytical purpose. Initially when the data size in MySQL was less full export of Mysql data to Hive was not an issue using Sqoop.
Now as my data size has grown how can I take incremental update of MySql data to hive?
You can use sqoop for incremental update, Sqoop documentation is good, here is the link
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
This is an example on incremental update using hive/spark.
scala> spark.sql("select * from table1").show
+---+---+---------+
| id|sal|timestamp|
+---+---+---------+
| 1|100| 30-08|
| 2|200| 30-08|
| 3|300| 30-08|
| 4|400| 30-08|
+---+---+---------+
scala> spark.sql("select * from table2").show
+---+----+---------+
| id| sal|timestamp|
+---+----+---------+
| 2| 300| 31-08|
| 4|1000| 31-08|
| 5| 500| 31-08|
| 6| 600| 31-08|
+---+----+---------+
scala> spark.sql("select b.id,b.sal from table1 a full outer join table2 b on a.id = b.id where b.id is not null union select a.id,a.sal from table1 a full outer join table2 b on a.id = b.id where b.id is null").show
+---+----+
| id| sal|
+---+----+
| 4|1000|
| 6| 600|
| 2| 300|
| 5| 500|
| 1| 100|
| 3| 300|
+---+----+
Hope this logic is gonna work for you.
Related
I have a spark data frame
| item_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 brand Samsung
| id_1 ram 6GB
| id_2 brand Apple
| id_2 ram 4GB
_____________________________________________________________________________
I want to group this data frame by item_id and output as a file with each line being a json object
{id_1: "properties":[{"brand":['Samsung']},{"ram":['6GB']} ]}
{id_2: "properties":[{"brand":['Apple']},{"ram":['4GB']} ]}
This is a big distributed data frame so , converting to pandas is not an option.
Is this kind of transformation even possible in pyspark
In scala, but python version will be very similar (sql.functions):
val df = Seq((1,"brand","Samsung"),(1,"ram","6GB"),(1,"ram","8GB"),(2,"brand","Apple"),(2,"ram","6GB")).toDF("item_id","attribute_key","attribute_value")
+-------+-------------+---------------+
|item_id|attribute_key|attribute_value|
+-------+-------------+---------------+
| 1| brand| Samsung|
| 1| ram| 6GB|
| 1| ram| 8GB|
| 2| brand| Apple|
| 2| ram| 6GB|
+-------+-------------+---------------+
df.groupBy('item_id,'attribute_key)
.agg(collect_list('attribute_value).as("list2"))
.groupBy('item_id)
.agg(map(lit("properties"),collect_list(map('attribute_key,'list2))).as("prop"))
.select(to_json(map('item_id,'prop)).as("json"))
.show(false)
output:
+------------------------------------------------------------------+
|json |
+------------------------------------------------------------------+
|{"1":{"properties":[{"ram":["6GB","8GB"]},{"brand":["Samsung"]}]}}|
|{"2":{"properties":[{"brand":["Apple"]},{"ram":["6GB"]}]}} |
+------------------------------------------------------------------+
I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+
I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.
I have got 3 tables:
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| id | A_id | | A_id| B_id | value | | B_id| B_id_ | value |
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 1| 5| | 5| 1| aa| | 1| 2| zzxx|
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 2| 3| | 3| 3| bb| | 2| | vvyy|
+-----+----------+ +-----+----------+-------+ +-----+----------+-------+
| 3| 4| bbll|
+-----+----------+-------+
| 5| | oopp|
+-----+----------+-------+
| 4| 5| mmnn|
+-----+----------+-------+
What SELECT statement i need to use, so that output would look like this(table3 can be up to 4 levels deep into it self):
+----+------------------------------+
| id | value |
+----+------------------------------+
| 1| aa\zzxx\vvyy|
+----+------------------------------+
| 2| bb\bbll\mmnn\oopp|
+----+------------------------------+
As i don't have much experience with DB and SQL, this is hard for me. And I have no vision about how to do this.
This has to be done in MySQL. Hardest thing as i have read is the recursive query in MySQL since it doesn't exist, so people have to simulate it. I have read some SO topics about the recursive Query, but i understood that's not for me.
Any help is appreciated.
By hard and fast learning I managed to solve my problem. Code below.
SELECT DISTINCT
OTHER.DATA,
concat(
'/',ifnull(t4.value,''), CASE WHEN (t4.value is NULL) then '' else '/' END,
ifnull(t3.value,''), CASE WHEN (t3.value is NULL) then '' else '/' END,
ifnull(t2.value,''), CASE WHEN (t2.value is NULL) then '' else '/' END,
ifnull(t1.value,''), CASE WHEN (t1.value is NULL) then '' else '/' END,
table2.value
) as 'My Column name'
FROM
table1
LEFT JOIN table2 ON
(table1.A_id = table2.A_id)
LEFT JOIN table3 as t1 ON
(t1.B_id = table2.B_id)
LEFT JOIN table3 AS t2 ON
(t2.B_id = t1.B_id_)
LEFT JOIN table3 AS t3 ON
(t3.B_id = t2.B_id_)
LEFT JOIN table3 AS t4 ON
(t4.B_id = t3.B_id_)
Big Thanks to #Damodaran and his solution for recursive query.
How to create a MySQL hierarchical recursive query
Be careful with using this code, as I have used it for DB, which is only queried for data. So this approach might be slow on other different usage. If you use this, I suggest you to think about indexing some fields.
I hate to admit it by my knowledge of MySQL is lacking when it comes to the more complex queries. Essentially I have four tables two of them contain the data I want to return, and two are relational tables linking the data. Table A is present just to provide filler for Table D.aID.
+--------+ +--------+ +--------+ +-----------+ +-----------+
|Table A | |Table B | |Table C | | Table D | | Table E |
+---+----+ +---+----+ +---+----+ +---+---+---+ +---+---+---+
|aID|name| |bID|name| |cID|name| |dID|aID|bID| |eID|dID|cID|
+---+----+ +---+----+ +---+----+ +---+---+---+ +---+---+---+
| 1 | a_1| | 1 | b_1| | 1 | c_1| | 1 | 1 | 1 | | 1 | 1 | 1 |
+---+----+ | 2 | b_2| | 2 | c_2| | 2 | 1 | 2 | | 1 | 1 | 2 |
+---+----+ | 3 | c_3| +---+---+---+ +---+---+---+
+---+----+
The relationship created with these tables is: Table A > Table B > Table C. The data I am wanting belongs to the Table B > Table C relationship.
+--------+---------+--------+---------+
|tblB.bID|tblB.name|tblC.cID|tblC.name|
+--------+---------+--------+---------+
| 1 | a_1 | 1 | c_1 |
| 1 | a_1 | 2 | c_2 |
| 2 | a_2 | NULL | NULL |
+--------+---------+--------+---------+
However to ensure I am following the correct path I need to grab the Table B of the Table A > Table B relationship Table C belongs to. I realize that I am making things much more difficult for myself by allowing for duplicate name values, but I would rather have small tables and more complex queries than bloated tables and simpler queries. The query I am using is
SELECT * FROM `Table E`
LEFT JOIN `Table D` ON (`Table B`.bID = `Table D`.bID)
RIGHT JOIN `Table E` ON (`Table D`.dID = `Table E`.dID))
RIGHT JOIN `Table C` ON (`Table E.cID = `Table C`.cID);
However so far it has not worked. When the query is submitted this error is returned:
ERROR 1066 (42000): Not unique table/alias: 'Table D'
Any ideas on how I can get this to work? Is this even possible?
The query you say you are submitting bears little resemblance to the table structure you have given us!
What is Table D.national_regionID? Or modx.coverage_state?
Generally though don't mix left and right joins. Also every table used in the query must either follow the FROM or follow a JOIN. You seem to be using Table B and Table C in join conditions without ever adding them to the query.
Thanks to Martin Smith I was able to come up with the solution I am posting here. I hope this might be able to help someone else.
SELECT tblB.bID,
tblB.name,
tblC.cID,
tblC.name
FROM Table E
RIGHT JOIN Table B ON (Table B.bID = Table D.bID)
RIGHT JOIN Table D USING dID
RIGHT JOIN Table C USING cID;