Any Best practice of doing table record insert with SQL CLR store procedure? - sqlclr

Recently we turned a set of complicate C# based scheduling logic into SQL CLR stored procedure (running in SQL Server 2005). We believed that our code is a great SQL CLR candidate because:
The logic involves tons of data from sQL Server.
The logic is complicate and hard to be done using TSQL
There is no threading or sychronization or accessing resources from outside of the sandbox.
The result of our sp is pretty good so far. However, since the output of our logic is in form of several tables of data, we can't just return a single rowset as the result of the sp. Instead, in our code we have a lot of "INSERT INTO ...." statements in foreach loops in order to save each record from C# generic collection into SQL tables. During code review, someone raised concern about whether the inline SQL INSERT approach within the SQL CLR can cause perforamnce problem, and wonder if there's other better way to dump data out (from our C# generic collections).
So, any suggestion?

I ran across this while working on an SQLite project a few months back and found it enlightening. I think it might be what you're looking for.
...
Fastest universal way to insert data
using standard ADO.NET constructs
Now that the slow stuff is out of the
way, lets talk about some hardcore
bulk loading. Aside from SqlBulkCopy
and specialized constructs involving
ISAM or custom bulk insert classes
from other providers, there is simply
no beating the raw power of
ExecuteNonQuery() on a parameterized
INSERT statement. I will demonstrate:
internal static void FastInsertMany(DbConnection cnn)
{
using (DbTransaction dbTrans = cnn.BeginTransaction())
{
using (DbCommand cmd = cnn.CreateCommand())
{
cmd.CommandText = "INSERT INTO TestCase(MyValue) VALUES(?)";
DbParameter Field1 = cmd.CreateParameter();
cmd.Parameters.Add(Field1);
for (int n = 0; n < 100000; n++)
{
Field1.Value = n + 100000;
cmd.ExecuteNonQuery();
}
}
dbTrans.Commit();
}
}

You could return a table with 2 columns (COLLECTION_NAME nvarchar(max), CONTENT xml) filled with as many rows as internal collections you have. CONTENT will be an XML representation of the data in the collection.
Then you can use the XML features of SQL 2005/2008 to parse each collection's XML into tables, and perform your INSERT INTO's or MERGE statements on the whole table.
That should be faster than individual INSERTS inside your C# code.

Related

Writing to MySQL with Python without using SQL strings

I am importing data into my Python3 environment and then writing it to a MySQL database. However, there is a lot of different data tables, and so writing out each INSERT statement isn't really pragmatic, plus some have 50+ columns.
Is there a good way to create a table in MySQL directly from a dataframe, and then send insert commands to that same table using a dataframe of the same format, without having to actually type out all the col names? I started trying to call column names and format it and concat everything as a string, but it is extremely messy.
Ideally there is a function out there to directly handle this. For example:
apiconn.request("GET", url, headers=datheaders)
#pull in some JSON data from an API
eventres = apiconn.getresponse()
eventjson = json.loads(eventres.read().decode("utf-8"))
#create a dataframe from the data
eventtable = json_normalize(eventjson)
dbconn = pymysql.connect(host='hostval',
user='userval',
passwd='passval',
db='dbval')
cursor = dbconn.cursor()
sql = sqltranslate(table = 'eventtable', fun = 'append')
#where sqlwrite() is some magic function that takes a dataframe and
#creates SQL commands that pymysql can execute.
cursor.execute(sql)
What you want is a way to abstract the generation of the SQL statements.
A library like SQLAlchemy will do a good job, including a powerful way to construct DDL, DML, and DQL statements without needing to directly write any SQL.

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Is this database dump design ok?

I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}
From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.
If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.

Generic SQL for update / insert

I'm writing a DB layer which talks to MS SQL Server, MySQL & Oracle. I need an operation which can update an existing row if it contains certain data, otherwise insert a new row; All in one SQL operation.
Essentially I need to save over existing data if it exists, or add it if it doesn't
Conceptually this is the same as upsert except it only needs to work on a single table. I'm trying to make sure I don't need to delete then insert as this has a performance impact.
Is there generic SQL to do this or do I need vendor specific solutions?
Thanks.
You need vendor specific SQL as MySQL (unlike MS and Oracle) doesn't support MERGE
http://en.wikipedia.org/wiki/Merge_(SQL)
I suspect that sooner rather than later, you're going to need a vendor specific implementation of your DB layer - SQL portability is pretty much a myth as soon as you do anything even slightly advanced.
I am pretty sure this is going to be vendor specific. For SQL Server, you can accomplish this using the MERGE statement.
If you are using SQL Server 2008, use Merge Statement. But keep in mind that if your Insert part has some condition involve, then it cannot be used. In which case you need to write your own way for accomplishing this. And in your case it has to be since you are involving MySQL which does not have a Merge Statement.
Why are you not using an ORM layer (like Entity Framework) for this purpose?
Just some pseudo code(in C#)
public int SaveTask(tblTaskActivity task, bool isInsert)
{
int result = 0;
using (var tmsEntities = new TMSEntities())
{
if (isInsert) //for insert
{
tmsEntities.AddTotblTaskActivities(task);
result = tmsEntities.SaveChanges();
}
else //for update
{
var taskActivity = tmsEntities.tblTaskActivities.Where(i => i.TaskID == task.TaskID).FirstOrDefault();
taskActivity.Priority = task.Priority;
taskActivity.ActualTime = task.ActualTime;
result = tmsEntities.SaveChanges();
}
}
return result;
}
In MySQL you have something similar to merge:
insert ... on duplicate key update ...
MySQL Reference - Insert on duplicate key update

Table-Valued Parameters to CLR Procedures in SQL Server 2008 - possible?

This page from SQL Server 2008 BOL, talks about CLR Stored Procedures and has a section labelled, "Table-Valued Parameters", which talks about how they can be advantageous. That's great - I'd love to use TVPs in my CLR procs, but unfortunately this seems to be the only reference in the universe to such a possibility, and the section doesn't describe what the syntax would be (nor does the further information linked at the end of the paragraph)
Sure, I can easily find descriptions of how to use TVPs from T-SQL procs, or how to do CLR procs in general. But writing a CLR proc that takes a TVP? Nothing. This is all highly unusal since the passing of multi-row data to a stored proc is a popular problem.
This leads me to wonder if the presence of the section on that page is an error. Somebody please tell me it's not and point me to more info/ examples.
[EDIT]
I was about to post this to one of the MS forums too when I came across this, which seems to be the final nail in the coffin. Looks like it can't be done.
I can find a lot more references. However, these are all for passing table-valued parameters to TSQL procedures, so that's of little use.
However, I've come to the conclusion that it's impossible. First, there is the list of mappings between CLR and SQL types. For table types there is no mapping, so the following does not work, for example:
[SqlProcedure]
public static void StoredProcedure(DataTable tvp, out int sum)
{
return 42;
}
and then
CREATE TYPE MyTableType AS TABLE
(
Id INT NOT NULL PRIMARY KEY,
[Count] INT NOT NULL
)
GO
CREATE ASSEMBLY ClrTest FROM '<somePath>'
GO
CREATE PROCEDURE ClrTest
AS EXTERNAL NAME ClrTest.StoredProcedures.StoredProcedure
GO
Whatever type you try (DataTable, DbDataReader, IEnumerable), the CREATE PROCEDURE call keeps generating an error 6552: CREATE PROCEDURE for "ClrTest" failed because T-SQL and CLR types for parameter "#tvp" do not match.
Second, the documentation on the page you linked to says: A user-defined table type cannot be passed as a table-valued parameter to, or be returned from, a managed stored procedure or function executing in the SQL Server process.
I can not seem to find anywhere how to create a user defined table type in C#, but this also seems to be a dead end.
Maybe you can ask your question somewhere on a Microsoft forum. It's still odd that they mention table-valued parameters on the CLR sproc page but never explain how to implement this. If you find any solution, I'd like to know.
You can use a temporary table created and populated before you call the procedure and read the table inside the clr procedure.
The solution is to serialize your tabular data into a Json-formatted string then pass the string into your CLR proc. Within your clr proc or function you would parse the json to an IEnumerable, list, or tabular object. You may then work with the data as you would any other type of tabular data.
I have written some utilities capable of serializing any sql table into a Json formatted string. I would be happy to share them with anyone providing their e-mail address. Phil Factor has written a nice T-SQL Json parser he called parseJson. I have adapted his solution to the clr which performs much faster. Both accept a Json formatted string and produce a table from the string. I also have a variety of Json utilities I employ with both T-SQL and the CLR capable of serializing, parsing, inserting, deleting, and updating Json formatted strings stored in sql columns.
If you use C# (as opposed to VB, which lacks custom iterators) you can write ADO.NET code to invoke ExecuteNonQuery() and run a stored procedure with a SqlDbType.Structured parameter (i.e., a TVP).
The collection passed as the value of the TVP must implement IEnumerable<SqlDataRecord>. Each time this IEnumerable's yield return is executed, a SqlDataRecord “row” is pipelined to the "table" parameter.
See this article for details.
Whilst it looks like passing tables directly to CLR procedures is currently impossible, I got a result, albeit sub optimal by:
defining a TSQL table valued UDT FooTable
defining a TSQL function which takes FooTable as a param and returns XML using FOR XML EXPLICIT
passing the resultant XML to the CLR function/procedure instead of the table itself
Not ideal, but it gets a bit closer.