how to insert data into mysql database using spark java framework - mysql

I am new to the spark java framework, how to insert the values into mysql database using sparkjava framework?

Assuming you have already read the data you want to insert into an RDD, You can use the following code to insert the records into database.
rdd.foreach(new VoidFunction<String>() {
#Override
public void call(String s) throws Exception {
//You Code to parse the String and insert the values into MYSQL.
}
});

To build on Shivanand's answer, you really want to use a foreachPartition as opposed to foreach. With foreach, you will be opening a db connection for every element, as opposed to once per partition. This is beneficial for a couple reasons, but most importantly its going to take sometime to open the connection. That overhead will then be carried over for every element. You will also be trying to open a lot of connections, and will probably make one of the admins pretty mad when they see potentially millions of db connection requests.

If you have already read the file, then use the foreachpartition function. It will help you to insert it into MySQL.
Take Example-
rdd.foreachpartition(new VoidFunctioin<String> x)
{
//Here make a connection enter code here
public void call(Iterator<String> x)
{
Connection c=(Connection)DriverManager.getConnection("your conncetion name,localhost,dbname");
while(x.hasNext())
{
String it=x.next();
PreparedStatement ps=c.preparestatement("Insert into table_name (coln_name) values ("?")");
ps.execute();
}
}
});
It will insert it into MySQL.

Related

Difference between offset vs limit [duplicate]

I have this really big table with some millions of records every day and in the end of every day I am extracting all the records of the previous day. I am doing this like:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
Statement.executeQuery(SQL);
The problem is that this program takes like 2GB of memory because it takes all the results in memory then it processes it.
I tried setting the Statement.setFetchSize(10) but it takes exactly the same memory from OS it does not make any difference. I am using Microsoft SQL Server 2005 JDBC Driver for this.
Is there any way to read the results in small chunks like the Oracle database driver does when the query is executed to show only a few rows and as you scroll down more results are shown?
In JDBC, the setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.
Inherently if setFetchSize(10) is being called and the driver is ignoring it, there are probably only two options:
Try a different JDBC driver that will honor the fetch-size hint.
Look at driver-specific properties on the Connection (URL and/or property map when creating the Connection instance).
The RESULT-SET is the number of rows marshalled on the DB in response to the query.
The ROW-SET is the chunk of rows that are fetched out of the RESULT-SET per call from the JVM to the DB.
The number of these calls and resulting RAM required for processing is dependent on the fetch-size setting.
So if the RESULT-SET has 100 rows and the fetch-size is 10,
there will be 10 network calls to retrieve all of the data, using roughly 10*{row-content-size} RAM at any given time.
The default fetch-size is 10, which is rather small.
In the case posted, it would appear the driver is ignoring the fetch-size setting, retrieving all data in one call (large RAM requirement, optimum minimal network calls).
What happens underneath ResultSet.next() is that it doesn't actually fetch one row at a time from the RESULT-SET. It fetches that from the (local) ROW-SET and fetches the next ROW-SET (invisibly) from the server as it becomes exhausted on the local client.
All of this depends on the driver as the setting is just a 'hint' but in practice I have found this is how it works for many drivers and databases (verified in many versions of Oracle, DB2 and MySQL).
The fetchSize parameter is a hint to the JDBC driver as to many rows to fetch in one go from the database. But the driver is free to ignore this and do what it sees fit. Some drivers, like the Oracle one, fetch rows in chunks, so you can read very large result sets without needing lots of memory. Other drivers just read in the whole result set in one go, and I'm guessing that's what your driver is doing.
You can try upgrading your driver to the SQL Server 2008 version (which might be better), or the open-source jTDS driver.
You need to ensure that auto-commit on the Connection is turned off, or setFetchSize will have no effect.
dbConnection.setAutoCommit(false);
Edit: Remembered that when I used this fix it was Postgres-specific, but hopefully it will still work for SQL Server.
Statement interface Doc
SUMMARY: void setFetchSize(int rows)
Gives the JDBC driver a hint as to the
number of rows that should be fetched
from the database when more rows are
needed.
Read this ebook J2EE and beyond By Art Taylor
Sounds like mssql jdbc is buffering the entire resultset for you. You can add a connect string parameter saying selectMode=cursor or responseBuffering=adaptive. If you are on version 2.0+ of the 2005 mssql jdbc driver then response buffering should default to adaptive.
http://msdn.microsoft.com/en-us/library/bb879937.aspx
It sounds to me that you really want to limit the rows being returned in your query and page through the results. If so, you can do something like:
select * from (select rownum myrow, a.* from TEST1 a )
where myrow between 5 and 10 ;
You just have to determine your boundaries.
Try this:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
connection.setAutoCommit(false);
PreparedStatement stmt = connection.prepareStatement(SQL, SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY, SQLServerResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(2000);
stmt.set....
stmt.execute();
ResultSet rset = stmt.getResultSet();
while (rset.next()) {
// ......
I had the exact same problem in a project. The issue is that even though the fetch size might be small enough, the JDBCTemplate reads all the result of your query and maps it out in a huge list which might blow your memory. I ended up extending NamedParameterJdbcTemplate to create a function which returns a Stream of Object. That Stream is based on the ResultSet normally returned by JDBC but will pull data from the ResultSet only as the Stream requires it. This will work if you don't keep a reference of all the Object this Stream spits. I did inspire myself a lot on the implementation of org.springframework.jdbc.core.JdbcTemplate#execute(org.springframework.jdbc.core.ConnectionCallback). The only real difference has to do with what to do with the ResultSet. I ended up writing this function to wrap up the ResultSet:
private <T> Stream<T> wrapIntoStream(ResultSet rs, RowMapper<T> mapper) {
CustomSpliterator<T> spliterator = new CustomSpliterator<T>(rs, mapper, Long.MAX_VALUE, NON-NULL | IMMUTABLE | ORDERED);
Stream<T> stream = StreamSupport.stream(spliterator, false);
return stream;
}
private static class CustomSpliterator<T> extends Spliterators.AbstractSpliterator<T> {
// won't put code for constructor or properties here
// the idea is to pull for the ResultSet and set into the Stream
#Override
public boolean tryAdvance(Consumer<? super T> action) {
try {
// you can add some logic to close the stream/Resultset automatically
if(rs.next()) {
T mapped = mapper.mapRow(rs, rowNumber++);
action.accept(mapped);
return true;
} else {
return false;
}
} catch (SQLException) {
// do something with this Exception
}
}
}
you can add some logic to make that Stream "auto closable", otherwise don't forget to close it when you are done.

Spring Batch: ItemProcessor query Database?

I have a scenario where I need to parse flat files and process those records into mysql database inserts (schema already exists).
I'm using the FlatFileItemReader to parse the files and a JdbcCursorItemWriter to insert in the database.
I'm also using an ItemProcessor to convert any column values or skip records that I don't want.
My problem is, some of those inserts need to have a foreign key to some other table that already has data into it.
So I was thinking to do a select to retrieve the ID and update the pojo, inside the ItemProcessor logic.
Is this the best way to do it? I can consider alternatives as I'm just beginning to write all this.
Thanks!
The ItemProcessor in a Spring Batch step is commonly used for enrichment of data and querying a db for something like that is common.
For the record, another option would be to use a sub select in your insert statement to get the foreign key value as the record is being inserted. This may be a bit more performant give it removes the additional db hit.
for the batch process - if you require any where you can call use below method anywhere in batch using your batch listeners
well the below piece of code which I wrote , worked for me --
In you Main class - load your application context in a static variable - APP_CONTEXT
If you are not using XML based approach - then get the dataSource by auto-wiring it and then you can use below code -
Connection conn = null;
PreparedStatement pstmt= null;
try {
DataSource dataSource = (DataSource) Main.APP_CONTEXT
.getBean("dataSource");
conn = dataSource.getConnection();
pstmt = conn.prepareStatement(" your SQL query to insert ");
pstmtMstr.executeQuery();
} catch (Exception e) {
}finally{
if(pstmt!=null){
pstmt.close();
}if(conn!=null){
conn.close();
}
}

jdbc batch different statements

Now I've been searching this all over, i get the same answer. What I want is to have different statements batched under one variable using jdbc in java. So far what I get is batching statements that have the same pattern, e.g, INSERT INTO table VALUES('?','?'). This can be done using a preparedstatement. But I have tried to batch different types of statements using java.sql.Statement and they executed well. for example an update and an insert under one statement, commit once. But now the problem with java.sql.Statement is that it does now do what preparedStatement does, what people call escaping. Again the problem with preparedStatement is it only batches statements of the same pattern, as in, you can't update and insert. it has to be one of the two.
So now I thought I would use java.sql.Statement, but is there a library that does what preparedStatement does,String escaping to avoid Sql injection. Also, if I am mistakening batching with another terminology that I may not know, rather correct me and tell me what I am wanting to do is called, that is, to execute multiple different statements under one java.sql.Statement.
One last thing, when batching i realized there is no validation of syntax, which I wouldn't want, all errors are checked during executing, this might also fall under a library that can validate Sql.
Whatever you have mentioned is correct.
You can batch similar set of statements and can get executed at once. But as far as my knowledge there is no library in java which groups or batches different kinds of statements together and gets executed.
The last thing I want to tell is that the sql statement will be compiled only once when you are using the PreparedStatement object, if any errors in the sql statement, will be thrown, otherwise the statement will gets executed. If the same statement is sent to the database again with different values, the statement will not be compiled and simply executed by the database server.
Since you're looking for a library to do this kind of thing, yes, jOOQ can do it for you via its BatchedConnection, and you don't even have to use jOOQ's DSL to access this feature, though it works with the DSL as well. The following code snippet illustrates how this works.
Let's assume you have this logic:
// This is your original JDBC connection
try (Connection connection = ds.getConnection()) {
doSomethingWith(connection);
}
// And then:
void doSomethingWith(Connection c) {
try (PreparedStatement s = c.prepareStatement("INSERT INTO t (a, b) VALUES (?, ?)")) {
s.setInt(1, 1);
s.setInt(1, 2);
s.executeUpdate();
}
try (PreparedStatement s = c.prepareStatement("INSERT INTO t (a, b) VALUES (?, ?)")) {
s.setInt(1, 3);
s.setInt(1, 4);
s.executeUpdate();
}
try (PreparedStatement s = c.prepareStatement("INSERT INTO u (x) VALUES (?)")) {
s.setInt(1, 1);
s.executeUpdate();
}
try (PreparedStatement s = c.prepareStatement("INSERT INTO u (x) VALUES (?)")) {
s.setInt(1, 2);
s.executeUpdate();
}
}
Now, instead of re-writing your code, you can simply wrap it with jOOQ glue code:
// This is your original JDBC connection
try (Connection connection = ds.getConnection()) {
// Now wrap that with jOOQ and turn it into a "BatchedConnection":
DSL.using(connection).batched(c -> {
// Retrieve the augmented connection again from jOOQ and run your original logic:
c.dsl().connection(connection2 -> {
doSomethingWith(connection2);
});
});
}
Now, whatever your method doSomethingWith() is doing with a JDBC Connection, it is now getting batched as good as possible, i.e. the first two inserts are batched together, and so are the third and fourth one.

Is this database dump design ok?

I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}
From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.
If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.

Turning IDENTITY_INSERT ON on a table to load it with DB Unit

I try to load a table, that have an identity column, with DB Unit. I want to be able to set the id value myself (I don't want the database generate it for me).
Here is a minimal definition of my table
create table X (
id numeric(10,0) IDENTITY PRIMARY KEY NOT NULL
)
To insert a line in X, I execute the following SQL
set INDENTITY_INSERT X ON
insert into X(id) VALUES(666)
No problem. But when I try to load this table with the following db unit XML dataset (RS_7_10_minimal_ini.xml)
<dataset>
<X id="666"/>
</dataset>
using the following minimal JUnit (DBTestCase) test case :
package lms.lp.functionnal_config;
import java.io.FileInputStream;
import org.dbunit.DBTestCase;
import org.dbunit.PropertiesBasedJdbcDatabaseTester;
import org.dbunit.dataset.IDataSet;
import org.dbunit.dataset.xml.FlatXmlDataSetBuilder;
import lms.DBUnitConfig;
import org.junit.Test;
public class SampleTest extends DBTestCase
{
public SampleTest(String name)
{
super( name );
System.setProperty( PropertiesBasedJdbcDatabaseTester.DBUNIT_DRIVER_CLASS, DBUnitConfig.DBUNIT_DRIVER_CLASS );
System.setProperty( PropertiesBasedJdbcDatabaseTester.DBUNIT_CONNECTION_URL, DBUnitConfig.DBUNIT_CONNECTION_URL );
System.setProperty( PropertiesBasedJdbcDatabaseTester.DBUNIT_USERNAME, DBUnitConfig.DBUNIT_USERNAME );
System.setProperty( PropertiesBasedJdbcDatabaseTester.DBUNIT_PASSWORD, DBUnitConfig.DBUNIT_PASSWORD );
}
protected IDataSet getDataSet() throws Exception
{
return new FlatXmlDataSetBuilder().build(new FileInputStream("src/test/resources/RS_7_10_minimal_ini.xml"));
}
#Test
public void testXXX() {
// ...
}
}
It fails with the following exception
com.sybase.jdbc3.jdbc.SybSQLException: Explicit value specified for identity field in table 'X' when 'SET IDENTITY_INSERT' is OFF.
It seems DB Unit does not turn identity ON before inserting a row for which the value of the identity column is specified.
I already tried to execute myself on the connection retrieved from the JdbcDataBaseTester but no luck. Probably a new connection or not the same connection used to push the data into de DB.
Any idea?
Thanks a lot for your help all !
Octave
Yes, found the solution in the DBUnit FAQ actually
Can I use DbUnit with IDENTITY or auto-increment columns?
Many RDBMSes allow IDENTITY and auto-increment columns to be implicitly overwritten with client values. DbUnit can be used with these RDBMS natively. Some databases, like MS SQL Server and Sybase, need to explicitly activate client values writing. The way to activate this feature is vendor-specific.
DbUnit provides this functionality for MS SQL Server with the InsertIdentityOperation class.
Although it is written for the MS SQL Server, is also works for Sybase. So I push my data set to db with
new InsertIndentityOperation(DatabaseOperation.CLEAN_INSERT).execute(connection,initialDataSet);
Et voilĂ .
Thanks for your answer rawheiser.
Not familar enough with DBUnit to help you with the specifics; but I have used a table truncate and reseeding the identity value in similar situations.
dbcc checkident