Is this database dump design ok? - mysql

I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}

From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.

If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.

Related

Efficiently load large JSON file into SQL Server

I have been attempting to identify what the most effective way to load a large JSON file into SQL Server is.
I have a rather primitive API that most helpfully returns me a 150+ MB JSON string with ~450k rows and 12 columns. Ultimately I want to be able to use this in power-query, I figure the simplest way to store, index, then query it efficiently will be in SQL Server.
I have asked regarding filtering the data from the source (the most logical solution, but for the purposes of this question, count it out).
I have attempted code like the following to prove it will work
DECLARE #JSON VARCHAR(MAX)
select #JSON = BulkColumn
from OPENROWSET
(BULK 'C:\temp\test.txt', SINGLE_CLOB)
AS j
Select * FROM OPENJSON (#JSON)
The issue I have is that after 7 minutes query time on a laptop with fast SSD, 16gb ram and i7-7800hq I decided there must be a better way of going about this.
I'm happy to try any other language (say python, r, C# clr)
Security isn't my primary concern, I think I should be able to refresh the data in a few seconds rather than waiting several minutes.
I never used to encounter problems loading significant volumes (several hundreds of megabytes of json) in, simply a query variable from the C# app I wrote to load these files.. The load also included a query that only inserted records that didn't already exist.
INSERT dba.dbo.Table(columns,from,json)
FROM
OPENJSON (#json, '$.result' )
WITH (
column varchar(256) '$.a',
from varchar(256) '$.b',
json varchar(256) '$.c'
) j
LEFT JOIN dba.dbo.Table b
ON
j.column = b.column
WHERE
b.column IS NULL
Something like this would take around 30 seconds to load 100mb of json and insert only the new records. The c# looked like:
using (SqlCommand sc = new SqlCommand("...", new SqlConnection(_connstr)))
{
sc.Parameters.Add("#json", SqlDbType.NVarChar, -1).Value = json;
var prior = sc.Connection.State;
if (sc.Connection.State != ConnectionState.Open)
sc.Connection.Open();
int ins = sc.ExecuteNonQuery();
if (prior == ConnectionState.Closed)
sc.Connection.Close();
return ins;
}
Critically, I seldom needed to do what you're doing (retrieve all the rows back as a SSMS result set or whatever). I would do the load, then do the work I needed with them. An appreciable proportion of the time you're waiting might be down to SSMS retrieving millions of rows

.NET insert data from SQL Server to MySQL without looping through data

I'm working a .NET app where the user selects some filters like date, id, etc.
What I need to do is query a SQL Server database table with those filters, and dump them into a MySQL table. I don't need all fields in the table, only a few.
So far, I need to loop through all records in the SQL Server Dataset and insert them one by one on my MySQL table.
Is there anyway of achieving better performance? I've been playing with Dapper but cant figure out a way to do something like:
Insert into MySQLTable (a,b,c)
Select a,b,c from SQLServerTable
where a=X and b=C
Any ideas?
Linked server option is not possible because we have no access to the SQL server configuration, so looking for the most efficient way of bulk inserting data.
If I were to do this inside .NET with dapper I'd use c# and do the following;
assumptions:
a table in both tables with the same schema;
CREATE Events (
EventId int,
EventName varchar(10));
A .Net class
public class Event
{
public int EventId { get; set; }
public string EventName { get; set; }
}
The snippet below should give you something you can use as a base.
List<Event> Events = new List<Event>();
var sqlInsert = "Insert into events( EventId, EventName ) values (#EventId, #EventName)";
using (IDbConnection sqlconn = new SqlConnection(Sqlconstr))
{
sqlconn.Open();
Events = sqlconn.Query<Event>("Select * from events").ToList();
using (IDbConnection mySqlconn = new SqlConnection(Sqlconstr))
{
mySqlconn.Open();
mySqlconn.Execute(sqlInsert, Events);
}
}
The snippet above selects the rows from the events table in SQL Server and populates the Events list. normally Dapper will return an IEnumerable<>, but you are casting that ToList(). Now with the Events list, you connect to MySQL and execute the insert statements against the Events list.
This snippet is just a barebones example. Without a transaction on the Execute, each row will be autocommitted. If you add a Transaction, it will commit when all the items in the Events list are inserted.
Of course there are disadvantages doing it this way. One important thing to realize is that if you are trying to insert 1million rows from SQL to MySQL, that list will contain 1million entries which will increase the memory footprint. In those cases I'd use Dapper's Buffered = false option. This will return the 1million rows 1 row at a time. Your c# code can them enumerate over the results and add the row to a list and keep a counter. after 1000 rows have been inserted into list you can do the insert part into MySQL then clear the list and contine enumerating over the rows.
this will keep the memory footprint of your application small while processing a large number of rows.
With all that said, nothing beats bulk insert at the server level.
-HTH

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Spring Batch: ItemProcessor query Database?

I have a scenario where I need to parse flat files and process those records into mysql database inserts (schema already exists).
I'm using the FlatFileItemReader to parse the files and a JdbcCursorItemWriter to insert in the database.
I'm also using an ItemProcessor to convert any column values or skip records that I don't want.
My problem is, some of those inserts need to have a foreign key to some other table that already has data into it.
So I was thinking to do a select to retrieve the ID and update the pojo, inside the ItemProcessor logic.
Is this the best way to do it? I can consider alternatives as I'm just beginning to write all this.
Thanks!
The ItemProcessor in a Spring Batch step is commonly used for enrichment of data and querying a db for something like that is common.
For the record, another option would be to use a sub select in your insert statement to get the foreign key value as the record is being inserted. This may be a bit more performant give it removes the additional db hit.
for the batch process - if you require any where you can call use below method anywhere in batch using your batch listeners
well the below piece of code which I wrote , worked for me --
In you Main class - load your application context in a static variable - APP_CONTEXT
If you are not using XML based approach - then get the dataSource by auto-wiring it and then you can use below code -
Connection conn = null;
PreparedStatement pstmt= null;
try {
DataSource dataSource = (DataSource) Main.APP_CONTEXT
.getBean("dataSource");
conn = dataSource.getConnection();
pstmt = conn.prepareStatement(" your SQL query to insert ");
pstmtMstr.executeQuery();
} catch (Exception e) {
}finally{
if(pstmt!=null){
pstmt.close();
}if(conn!=null){
conn.close();
}
}

JDBC batch insert performance

I need to insert a couple hundreds of millions of records into the mysql db. I'm batch inserting it 1 million at a time. Please see my code below. It seems to be slow. Is there any way to optimize it?
try {
// Disable auto-commit
connection.setAutoCommit(false);
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
}
// Execute the batch
int [] updateCounts = pstmt.executeBatch();
System.out.append("inserted "+updateCounts.length);
I had a similar performance issue with mysql and solved it by setting the useServerPrepStmts and the rewriteBatchedStatements properties in the connection url.
Connection c = DriverManager.getConnection("jdbc:mysql://host:3306/db?useServerPrepStmts=false&rewriteBatchedStatements=true", "username", "password");
I'd like to expand on Bertil's answer, as I've been experimenting with the connection URL parameters.
rewriteBatchedStatements=true is the important parameter. useServerPrepStmts is already false by default, and even changing it to true doesn't make much difference in terms of batch insert performance.
Now I think is the time to write how rewriteBatchedStatements=true improves the performance so dramatically. It does so by rewriting of prepared statements for INSERT into multi-value inserts when executeBatch() (Source). That means that instead of sending the following n INSERT statements to the mysql server each time executeBatch() is called :
INSERT INTO X VALUES (A1,B1,C1)
INSERT INTO X VALUES (A2,B2,C2)
...
INSERT INTO X VALUES (An,Bn,Cn)
It would send a single INSERT statement :
INSERT INTO X VALUES (A1,B1,C1),(A2,B2,C2),...,(An,Bn,Cn)
You can observe it by toggling on the mysql logging (by SET global general_log = 1) which would log into a file each statement sent to the mysql server.
You can insert multiple rows with one insert statement, doing a few thousands at a time can greatly speed things up, that is, instead of doing e.g. 3 inserts of the form INSERT INTO tbl_name (a,b,c) VALUES(1,2,3); , you do INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(1,2,3),(1,2,3); (It might be JDBC .addBatch() does similar optimization now - though the mysql addBatch used to be entierly un-optimized and just issuing individual queries anyhow - I don't know if that's still the case with recent drivers)
If you really need speed, load your data from a comma separated file with LOAD DATA INFILE , we get around 7-8 times speedup doing that vs doing tens of millions of inserts.
If:
It's a new table, or the amount to be inserted is greater then the already inserted data
There are indexes on the table
You do not need other access to the table during the insert
Then ALTER TABLE tbl_name DISABLE KEYS can greatly improve the speed of your inserts. When you're done, run ALTER TABLE tbl_name ENABLE KEYS to start building the indexes, which can take a while, but not nearly as long as doing it for every insert.
You may try using DDBulkLoad object.
// Get a DDBulkLoad object
DDBulkLoad bulkLoad = DDBulkLoadFactory.getInstance(connection);
bulkLoad.setTableName(“mytable”);
bulkLoad.load(“data.csv”);
try {
// Disable auto-commit
connection.setAutoCommit(false);
int maxInsertBatch = 10000;
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
int count = 1;
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
if(count%maxInsertBatch == 0){
pstmt.executeBatch();
}
count++;
}
// Execute the batch
pstmt.executeBatch();
System.out.append("inserted "+count);