Spark ETL job execute mysql only once - mysql

I have an ETL job in Spark that also connects to MySQL in order to grab some data. Historically, I've been doing it as follows:
hiveContext.read().jdbc(
dbProperties.getProperty("myDbInfo"),
"(SELECT id, name FROM users) r",
new Properties()).registerTempTable("tmp_users");
Row[] res = hiveContext.sql("SELECT "
+ " u.name, "
+ " SUM(s.revenue) AS revenue "
+ "FROM "
+ " stats s "
+ " INNER JOIN tmp_users u "
+ " ON u.id = s.user_id
+ "GROUP BY "
+ " u.name "
+ "ORDER BY "
+ " revenue DESC
+ "LIMIT 10").collect();
String ids = "";
// now grab me some info for users that are in tmp_user_stats
for (i = 0; i < res.length; i++) {
s += (!s.equals("") ? "," : "") + res[i](0);
}
hiveContext.jdbc(
dbProperties.getProperty("myDbInfo"),
"(SELECT name, surname, home_address FROM users WHERE id IN ("+ids+")) r",
new Properties()).registerTempTable("tmp_users_prises");
However, when scaling this to multiple worker nodes, whenever I use the tmp_users table, it runs the query and it gets executed (at least) once per node, which boils down to our db admin running around offices with a knife.
What's the best way to handle this? Can I run the job on like 3 machines, limiting it to 3 queries and then write the data to Hadoop for other nodes to use it or what?
Essentially - as suggested in comments - I could run a query outside of the ETL job which can prepare data from MySQL side and import it to Hadoop. However, there could be subsequent queries, which suggest a solution more in line with Spark and JDBC connection setup.
I'll accept the Sqoop solution as it at least give a more streamlined solution, although I'm still not yet sure it will do the job. If I find something, I'll edit the question again.

You can cache data:
val initialDF = hiveContext.read().jdbc(
dbProperties.getProperty("myDbInfo"),
"(SELECT id, name FROM users) r",
new Properties())
initialDF.cache();
initialDF.registerTempTable("tmp_users");
After first read, data will be cached in memory
Alternative (that doesn't hurt DBA ;) ) is to use Sqoop with parameter --num-mappers=3 and then import result file to Spark

Related

Optimizing "OR" operator to get around SQL Exception "maximum number of expressions in a list is 1000"

I have a query using the IN operator, in an array with 1000+ values. I've searched the error so far but couldn't find what I want : my SELECT is not optimized because I'm using the OR operator and it takes quite a lot of time.
I'm chopping my query into different arrays at the moment :
query = "SELECT\n" +
" filename,\n" +
" status\n" +
"FROM\n" +
" table1\n" +
"WHERE\n" +
" filename IN " + allFileNames[0];
for(int m=1; m<allFileNames.length; m++) {
query +=
" OR\n" +
" filename IN " + allFileNames[m];
}
What this does is essentially I have the allFileNames array at the moment, and each element of the array contains a string with 1000 file names. I'm using a OR operator on each element of the array.
How can I optimize all of this ? Is it possible without creating a temporary table ? Maybe using a substring, but I haven't quite found the solution yet.
Thanks in advance!
EDIT: I can't use temporary table since I don't have writing access to the database
Assuming single allFileNames entry is a single file name, then you can flatten everything under one IN statement.
query = "SELECT\n" +
" filename,\n" +
"FROM\n" +
" table1\n" +
"WHERE\n" +
" filename IN (" + allFileNames.stream().collect(Collectors.joining(","))
+")"
Because all of your conditions, joined by OR statements are using same property, you should merge them.
But if your allFileNames collections is too large, then you may consider executing your query in batches:
split your allFileNames collection to batches of 100-1000 then execute query for each batch
aggregate all results

Multi tables update using javafx and Mysql

I have to update a customer information that are spread over 4 Mysql tables. I created 1 Customer class. when I first add the information it is added to an observable list that populate a table, and by clicking on a selected row the information are displayed in textboxes to edit, but the updates are not being saved into the MySQL tables. Can you tell if it is from this part of code or is it coming from somewhere else in the program. What is wrong with my code ?
public void updateCustomer(Customer selectedCustomer, String user, LocalDateTime timePoint) throws Exception{
String query = "UPDATE customer, address, city, country"
+ " SET customer.customerName = '"+selectedCustomer.getCustomerName()+"', customer.lastUpdate = '" +timePoint+"', customer.lastUpdateBy = '"+user+ "', "
+ " address.address = '" +selectedCustomer.getAddress()+ "', address.address2 = '" +selectedCustomer.getAddress2()+ "', address.postalCode = '" +selectedCustomer.getPostalCode()+ "', address.phone = '" +selectedCustomer.getPhone()+ "', address.lastUpdate='" +timePoint+ "', address.lastUpdateBy = '" +user+ "', "
+ " city.city = '"+selectedCustomer.getCity()+"',city.lastUpdate='"+timePoint+"',city.lastUpdateBy = '"+user+ "', "
+ " country.country = '"+selectedCustomer.getCountry()+"',country.lastUpdate ='"+timePoint+"',country.lastUpdateBy = '"+user+ "' "
+ " WHERE customer.customerId = " +selectedCustomer.getCustomerId()+ " AND customer.addressId = address.addressId AND address.cityId = city.cityId AND city.countryId = country.countryId " ;
statement.executeUpdate(query);
}
What I usually do is:
Create my data class.
Create load and save methods on my DB class.
Set up the FXML so that the TableColumns show the right information from the data class
Get the data into an ObservableList from the database (I like to use Derby but that shouldn't make a difference) and put that into the table.
Add a listener to the selection model so that when the selected item in the table changes, the selected item is referenced by another variable (say "selectedCustomer" and that variable's data is shown into the editable textfields or comboboxes or whatever. Note that I don't use bindings when showing the selectedCustomer in the textboxes. I just use plain setTexts.
When the user clicks on Save or something, the data in the textfields are set into the selectedCustomer (for example, selectedCustomer.setName(nameText.getText());)
I call the database class' save method (for example, DB.save(selectedCustomer); )
That should do the trick! Never failed me yet.
However, I may be at fault here, since I couldn't be bothered to read your SQL statement. Please, for goodness sake, learn to use PreparedStatements! First, I don't really understand how your table is set up, so I can't really comment, but it's really hard to understand. However, if I were to take a wild guess, I think the problem may have something to do with this part here:
WHERE ... AND customer.addressId = address.addressId AND address.cityId = city.cityId AND city.countryId = country.countryId
I don't understand how that part works--either that SQL statement does not make sense or my SQL needs practice (probably the latter).
Since you use MySQL, how about you use a manager (such as PHPMyAdmin or since you are using Java, SQuirreL) to try executing your SQL statement manually and see what happens? If you enter the SQL statement and nothing changed (when there should be) then your SQL statement is at fault.

Import Foxpro tables into SQL Server

We have 800 different .dbf files and these need to load into SQL Server with their file name as the new table name, so file1.dbf has to be loaded into SQL Server into table file1.
Like this, we need to load all 800 Foxpro tables into SQL Server. Does anyone have an idea for this, or a script? Any help is highly appreciated.
There are multiple solutions to the problem. One is to use the upsizing wizard that ships with VFP. I only tried the original version and it was not good at all. I didn't use it since then. You may try uploading a test database with a single table that has, say a million rows in it, just to see if using that would be feasible to use (a million rows shouldn't take more than a minute).
What I did was to create a "generator" that would create the SQL server tables in the mapping I wanted (ie: memo to varchar or varbinary MAX, char to varchar etc). Then using a C# based activex code I wrote, I load the tables - multiple tables at a time (other ways of loading the tables were extremely slow). Since then that code is used to create SQL server tables and\or transfer exiting customers' data to SQL server.
Yet another effective way would be, create a linked server to VFP using VFPOLEDB and then use OpenQuery to get tables' structure and data:
select * into [TableName]
from OpenQuery(vfpserver, 'select * from TableName ...')
This one is fast too and allows you to use VFP specific functions inside the query, however resulting field types might not be as you like.
Below is a solution that is written in FoxPro 9. You will probably need to modify a bit as I only handled 3 data types. You will also have to look out for SQL reserved words as field names.
SET SAFETY OFF
CLOSE ALL
CLEAR ALL
CLEAR
SET DIRE TO "C:\temp"
** house keeping
RUN del *.fxp
RUN del *.CDX
RUN del *.bak
RUN del *.err
RUN del *.txt
oPrgDir = SYS(5)+SYS(2003) && Program Directory
oPath = "C:\temp\pathtodbfs" && location of dbfs
CREATE TABLE dbfstruct (fldno N(7,0), fldnm c(16), fldtype c(20), fldlen N(5,0), fldpoint N(7,0)) && dbf structure table
STORE SQLSTRINGCONNECT("DRIVER={MySQL ODBC 3.51 Driver};SERVER=localhost;DATABASE=testdbf;UID=root;PWD=root; OPTION=3") TO oConn && SQL connection
SET DIRE TO (m.oPath)
STORE ADIR(aFL, "*.dbf") TO iFL && getting list of dbfs
SET DIRE TO (m.oPrgDir)
FOR i = 1 TO iFL
IF AT("dbfstruct.dbf", LOWER(aFL(i,1))) = 0 THEN
USE oPath + "\" + aFL(i,1)
LIST STRUCTURE TO FILE "struct.txt" && output dbf structure to text file"
SET DIRE TO (m.oPrgDir)
USE dbfstruct
ZAP
APPEND FROM "struct.txt" TYPE SDF
DELETE FROM dbfstruct WHERE fldno = 0 && removing non esential text
PACK
CLEAR
DELETE FILE "struct.txt"
SET DIRE TO (m.oPrgDir)
=SQLEXEC(oConn, "DROP TABLE IF EXISTS testdbf." + STRTRAN(LOWER(aFL(i,1)),".dbf", "")) && needed to remove tables already created when I was testing
sSQL = "CREATE TABLE testdbf." + STRTRAN(LOWER(aFL(i,1)),".dbf", "") + " ("
SELECT dbfstruct
GOTO TOP
DO WHILE NOT EOF()
#1,1 SAY "CREATING QUERY: " + aFL(i,1)
sSQL = sSQL + ALLTRIM(LOWER(dbfstruct.fldnm)) + " "
* You may have to add below depending on the field types of your DBFS
DO CASE
CASE ALLTRIM(dbfstruct.fldtype) == "Character"
sSQL = sSQL + "VARCHAR(" + ALLTRIM(STR(dbfstruct.fldlen)) + "),"
CASE ALLTRIM(dbfstruct.fldtype) == "Numeric" AND dbfstruct.fldpoint = 0
sSQL = sSQL + "INT(" + ALLTRIM(STR(dbfstruct.fldlen)) + "),"
CASE ALLTRIM(dbfstruct.fldtype) == "Numeric" AND dbfstruct.fldpoint > 0
sSQL = sSQL + "DECIMAL(" + ALLTRIM(STR(dbfstruct.fldlen)) + "),"
OTHERWISE
=MESSAGEBOX("Unhandled Field Type: " + ALLTRIM(dbfstruct.fldtype) ,0,"ERROR")
CANCEL
ENDCASE
SELECT dbfstruct
SKIP
ENDDO
sSQL = SUBSTR(sSQL, 1, LEN(sSQL)-1) + ")"
STORE SQLEXEC(oConn, sSQL) TO iSQL
IF iSQL < 0 THEN
CLEAR
?sSQL
STORE FCREATE("sqlerror.txt") TO gnOut && SQL of query in case it errors
=FPUTS(gnOut, sSQL)
=FCLOSE(gnOut)
=MESSAGEBOX("Error creating table on MySQL",0,"ERROR")
CANCEL
ENDIF
CLOSE DATABASES
ENDIF
ENDFOR
=SQLDISCONNECT(oConn)
SET DIRE TO (m.oPrgDir)
SET SAFETY ON

MySQL SELECT with varying WHERE fields

At a high-level this sounds trivial, but it turns out I've been scratching my head for a a couple of hours.
Situation:
I have table T, with columns a,b,c,d,e. Column a holds a string, while b,c,d,e each hold a boolean value.
I am allowing a user to perform a kind of search, where I prompt the user to enter values for a,b,c,d,e, and return all the rows where those values all match.
In a perfect world, the user enters all values (lets say a="JavaScript" , b="true", c="false", d="false", e="true") and a resulting query (In Scala, referencing a remote DB running MySQL) might look something like this:
connection.createStatement().executeQuery("SELECT * FROM T
WHERE a = '" + a_input + "'
and b = " + b_input + "
and c = " + c_input + "
and d = " + d_input + "
and e = " + e_input + ";")
Problem:
I give the user the option to 'loosen' the constraints, so it is possible that a_input="" and b_input="", etc... Potentially all fields a,b,c,d,e can be empty ("") If a field is omitted, it should not affect the resulting response. In other words, if c is not entered, the result can contain entries where c is TRUE or FALSE
Question:
How do I write ONE query that covers the situation where potentially all fields can be empty, or just some, or none?
Just build the WHERE dynamically. Instead of always searching for c = 'whatever', only include c in the WHERE if the user supplied a value for it.
you could use
Select *
from table
where a in ('','othervalue','othervalue')
and b in ('','othervalue','morevalues')
and so on.....that is like using an or for each field and it will match even if it's empty
This is tricky because the DB contains booleans but the parms are strings. Are the parms always blank. 'true', or 'false'? If so try
(B_input=''
Or (b_input=''true' and b)
Or (b_input='false' and ((not b)))

BEGIN AND COMMIT MYSQL EXECUTE JAVA

I have this simple doubt:
String sql ="BEGIN;" +
"DELETE FROM users WHERE username='gg';" +
"DELETE FROM comprofiler WHERE id=611;" +
"COMMIT;";
st.execute(sql);
Why doesn't it work? It works if it's just one instruction, how can I type this?