I have been attempting to identify what the most effective way to load a large JSON file into SQL Server is.
I have a rather primitive API that most helpfully returns me a 150+ MB JSON string with ~450k rows and 12 columns. Ultimately I want to be able to use this in power-query, I figure the simplest way to store, index, then query it efficiently will be in SQL Server.
I have asked regarding filtering the data from the source (the most logical solution, but for the purposes of this question, count it out).
I have attempted code like the following to prove it will work
DECLARE #JSON VARCHAR(MAX)
select #JSON = BulkColumn
from OPENROWSET
(BULK 'C:\temp\test.txt', SINGLE_CLOB)
AS j
Select * FROM OPENJSON (#JSON)
The issue I have is that after 7 minutes query time on a laptop with fast SSD, 16gb ram and i7-7800hq I decided there must be a better way of going about this.
I'm happy to try any other language (say python, r, C# clr)
Security isn't my primary concern, I think I should be able to refresh the data in a few seconds rather than waiting several minutes.
I never used to encounter problems loading significant volumes (several hundreds of megabytes of json) in, simply a query variable from the C# app I wrote to load these files.. The load also included a query that only inserted records that didn't already exist.
INSERT dba.dbo.Table(columns,from,json)
FROM
OPENJSON (#json, '$.result' )
WITH (
column varchar(256) '$.a',
from varchar(256) '$.b',
json varchar(256) '$.c'
) j
LEFT JOIN dba.dbo.Table b
ON
j.column = b.column
WHERE
b.column IS NULL
Something like this would take around 30 seconds to load 100mb of json and insert only the new records. The c# looked like:
using (SqlCommand sc = new SqlCommand("...", new SqlConnection(_connstr)))
{
sc.Parameters.Add("#json", SqlDbType.NVarChar, -1).Value = json;
var prior = sc.Connection.State;
if (sc.Connection.State != ConnectionState.Open)
sc.Connection.Open();
int ins = sc.ExecuteNonQuery();
if (prior == ConnectionState.Closed)
sc.Connection.Close();
return ins;
}
Critically, I seldom needed to do what you're doing (retrieve all the rows back as a SSMS result set or whatever). I would do the load, then do the work I needed with them. An appreciable proportion of the time you're waiting might be down to SSMS retrieving millions of rows
Related
JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.
I am very much a SQL developer and am new to redis, but it's performance is very interesting. I have a problem I think redis could help me very much in. I have a SQL table familiar to this:
| CONTAINER <String><NoUnq> | PROCESS <String><NoUnq> | PROCESS_DATA <String><NoUnq> | TimeCreated <TimeStamp><NoUnq>|
This table when populated to its max has roughly ~450,000,000 rows. I am running this on AWS. With these rows I select all the processes within a container (~1,000,000 containers), so I would have something like this in sql (of course container is indexed):
SELECT * FROM table WHERE container = '[CONTAINER_NAME]';
I then have a cronjob script which runs every hour and removes old processes from containers with something like this:
DELETE FROM table WHERE TimeCreated <= [SOME_TIME];
So essentially I like to have processes which are not older than ~4-5 hours. Looking at Redis I feel like I can engineer something similar to improve my performance, but am having trouble to convert this SQL like design into Redis.
My first thought was to use HSET, but I found out HSET does not allow the EXPIRE command on fields so I could not automatically remove old process. I am most concerned about performance and efficiency.
Look's like you can (and probably should) use HSET. And look's like you do not need to expire fields. You need to expire keys. The key name based on container name and EXPIREAT on this key. If you told about table relation structure like you wrote above the most like analogue is one table row is one key:
MULTI
HMSET <container name:rowId> PROCESS <value> PROCESS_DATA <value>
EXPIREAT <container name:rowId> <TimeCreated>
EXEC
Also you can use ZSET to store time related list of rows:
ZADD <container name> <TimeCreated> <rowId>
So you may use zRange as SELECT equivalent. Also you may use LUA scripting to get content of container with one request. Something like (I can make a mistake somewhere in the syntax of LUA):
local result = {}
local tmp = redis.call( 'zrange', KEYS[1], ARG[1], ARG[2], 'withscores' )
for k, v in pairs(tmp) do
result[v] = redis.call('hgetall', KEYS[1] + ':' + k)
end
return result
Where KEYS1 - container name, ARG1 - from , ARG2- to .
p.s. Also you should understand how redis expire keys to understand thats happens with memory at your instance.
I'm relatively new to Talend OSDI. I managed to do simple request in MySql with tMySqlInput component. However today I have a more ambitious request and have some trouble to make it work.
Indeed I need a request where the result depends on the previous line. I made it on MySQLWorkbench but not on Talend. Exemple : delay time between two dates.
Here is the request :
SET #var = NULL;
SELECT id, start_date, end_date, #var precedent, UNIX_TIMESTAMP(TIMEDIFF(start_date,#var)) AS diff, #var:=start_date AS temp
FROM ma_table
ORDER BY start_date;
and errors are :
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT id, start_date, end_date, id_process_type, #var precedent, UNIX_TIMESTAMP' at line 2
...Not very usefull, Is this syntax forbidden on Talend ? Do it exists others solutions to do such requests on Talend ? (for delay time between two dates for examples) or other component maybe ? I am searching with tMysqlRow.
Thanks for ideas !
As #Gabriele B mentions, you might want to consider doing this in a more "Talend" way.
I'd personally make use of the tMemorizeRows component to do this though.
To simplify this I've gone and made the start and end dates as integers but it should be trivial to handle this using proper dates.
If we have some data that shows the start and end date of a process and we want to work out the delay between finishing the last one and starting the next process we can read all of the data in and then use the tMemorizeRows component to remember the last 2 rows:
We then access the memorized data by looking at the array index. So here we go to a tJavaRow component that has an extra output column, startdelay. We then calculate it by comparing the current process' start day minus the last process' end date:
output_row.id = input_row.id;
output_row.startdate = input_row.startdate;
output_row.enddate = input_row.enddate;
if (id_tMemorizeRows_1[0] != 1) {
output_row.startDelay = startdate_tMemorizeRows_1[0] - enddate_tMemorizeRows_1[1];
} else {
output_row.startDelay = 0;
}
The conditional statement it to avoid null pointer errors on the first run of the data as the enddate_tMemorizeRows_1[1] will be null at that point. You could handle the null in other ways of course.
This process is reasonably easy to understand and maintain (although there is that small bit of Java code in there) and has the benefits of only needing the load the data once and only keep a small part of it in memory at any one time. It should also be very fast.
You should consider a statement refactory to do it in a "Talend" way, maybe little slower but most portable and robust.
If your table is not huge, for example, I would recommend to load it in memory using tCacheOutput/tCacheInput (you can find them on Talend Exchange) and this design:
tMySqlLoad----->tCacheOutput_1
|
|
|
OnSubjobOk
|
|
v
tCacheInput_1------->tMap_1--------+
|
|
tJoin-------------->tMap_3------------>[output]
|
|
tCacheInput_2------->tMap_2--------'
First of all you dump your table on a memory buffer
Then, you read two times this buffer. It's in memory, so it won't hurt performances
In tMap_1 you add a auto_increment index using a Numeric.sequence
You do the same in tMap_2 but with a starting number of 2 (basically, you shift the index)
Then you auto-join the table using these brand new columns
Finally in tMap_3 you're going to release your payload (ie make the diff)
This is going to be a verbose but robust solution if your table is small. If it's not and performance is not a issue you can try an even more verbose solution like Prepared Statements.
I have written a Java program to do the following and would like opinions on my design:
Read data from a CSV file. The file is a database dump with 6 columns.
Write data into a MySQL database table.
The database table is as follows:
CREATE TABLE MYTABLE
(
ID int PRIMARY KEY not null auto_increment,
ARTICLEID int,
ATTRIBUTE varchar(20),
VALUE text,
LANGUAGE smallint,
TYPE smallint
);
I created an object to store each row.
I used OpenCSV to read each row into a list of objects created in 1.
Iterate this list of objects and using PreparedStatements, I write each row to the database.
The solution should be highly amenable to the changes in requirements and demonstrate good approach, robustness and code quality.
Does that design look ok?
Another method I tried was to use the 'LOAD DATA LOCAL INFILE' sql statement. Would that be a better choice?
EDIT: I'm now using OpenCSV and it's handling the issue of having commas inside actual fields. The issue now is nothing is writing to the DB. Can anyone tell me why?
public static void exportDataToDb(List<Object> data) {
Connection conn = connect("jdbc:mysql://localhost:3306/datadb","myuser","password");
try{
PreparedStatement preparedStatement = null;
String query = "INSERT into mytable (ID, X, Y, Z) VALUES(?,?,?,?);";
preparedStatement = conn.prepareStatement(query);
for(Object o : data){
preparedStatement.setString(1, o.getId());
preparedStatement.setString(2, o.getX());
preparedStatement.setString(3, o.getY());
preparedStatement.setString(4, o.getZ());
}
preparedStatement.executeBatch();
}catch (SQLException s){
System.out.println("SQL statement is not executed!");
}
}
From a purely algorithmic perspective, and unless your source CSV file is small, it would be better to
prepare your insert statement
start a transaction
load one (or a few) line(s) from it
insert the small batch into your database
return to 3. while there are some lines remainig
commit
This way, you avoid loading the entire dump in memory.
But basically, you probably had better use LOAD DATA.
If the no. of rows is huge, then the code will fail at Step 2 with out of memory error. You need to figure out a way to get rows in chunks and perform a batch with prepared statement for that chunk, continue till all the rows are processed. This will work for any no. of rows and also the batching will improve performance. Other than this I don't see any issue with the design.
Recently we turned a set of complicate C# based scheduling logic into SQL CLR stored procedure (running in SQL Server 2005). We believed that our code is a great SQL CLR candidate because:
The logic involves tons of data from sQL Server.
The logic is complicate and hard to be done using TSQL
There is no threading or sychronization or accessing resources from outside of the sandbox.
The result of our sp is pretty good so far. However, since the output of our logic is in form of several tables of data, we can't just return a single rowset as the result of the sp. Instead, in our code we have a lot of "INSERT INTO ...." statements in foreach loops in order to save each record from C# generic collection into SQL tables. During code review, someone raised concern about whether the inline SQL INSERT approach within the SQL CLR can cause perforamnce problem, and wonder if there's other better way to dump data out (from our C# generic collections).
So, any suggestion?
I ran across this while working on an SQLite project a few months back and found it enlightening. I think it might be what you're looking for.
...
Fastest universal way to insert data
using standard ADO.NET constructs
Now that the slow stuff is out of the
way, lets talk about some hardcore
bulk loading. Aside from SqlBulkCopy
and specialized constructs involving
ISAM or custom bulk insert classes
from other providers, there is simply
no beating the raw power of
ExecuteNonQuery() on a parameterized
INSERT statement. I will demonstrate:
internal static void FastInsertMany(DbConnection cnn)
{
using (DbTransaction dbTrans = cnn.BeginTransaction())
{
using (DbCommand cmd = cnn.CreateCommand())
{
cmd.CommandText = "INSERT INTO TestCase(MyValue) VALUES(?)";
DbParameter Field1 = cmd.CreateParameter();
cmd.Parameters.Add(Field1);
for (int n = 0; n < 100000; n++)
{
Field1.Value = n + 100000;
cmd.ExecuteNonQuery();
}
}
dbTrans.Commit();
}
}
You could return a table with 2 columns (COLLECTION_NAME nvarchar(max), CONTENT xml) filled with as many rows as internal collections you have. CONTENT will be an XML representation of the data in the collection.
Then you can use the XML features of SQL 2005/2008 to parse each collection's XML into tables, and perform your INSERT INTO's or MERGE statements on the whole table.
That should be faster than individual INSERTS inside your C# code.