Pentaho PDI (Spoon): MySQL table output very slow (~2000 rows/s) - mysql

My table output step is terribly slow (~2.000 rows/second), compared to the input (100.000-200.000 rows/second). The MySQL server is not the problem, using native MySQL, e.g. with the "Execute SQL script" step, I get something in the 100thousands/second. I already tried (without success) the common solution of extending the SQL options by:
useServerPrepStmts=false
rewriteBatchedStatements=true
useCompression=true
I also varied the commit size parameter (100, 1.000, 10.000) and Use batch updates for inserts is enabled, also without success. What else can I do? I have tables with ~10.000.000 rows and Pentaho runs on a very potent machine, so this is not acceptable.

For this I think the ideal step is MySQL Bulk Loader step which is listed under Bulk loading section. Along with that use the said
useServerPrepStmts=false
rewriteBatchedStatements=true
useCompression=true
in JDBC options in the connection.
These useCompression will compress the traffic between the client and the MySQL server
where as other two will form INSERT INTO tbl (a,b) VALUES (1,'x'),(2,'y'),(3,'z'); without using separate insert statements for each.

Follow these steps:
Increase the RAM Size for PDI a.k.a Spoon.
Using the Command line utility such as ( Kitchen or Pan) run your Job or Transformation.
Well Now compare the speed.
Cheers!

Related

No columns returned SSIS

I am implementing a SSIS package and currently trying to do the following.
Truncate the destination table
Fetch the data by executing the stored procedure and insert it into the destination table.
I have created an Execute SQL task to address step 1 and dataflow with oledb source and oledb destination to address the second point. It been working successfully so far but isn't working for one my stored procedure that uses temp tables.
When I edit the oledb source and click the preview button, I get the error no column returned
I know that SSIS has an issue with generating column while executing stored procedures that depend on temp tables. I have converted the stored proc to use temporary table variables and its now able to return columns in SSIS when I do a preview. The only downside is that the stored procedure is taking longer time to execute. Its taking 1 hour 15 mins as compared to 15 mins while using temp tables.
I did see a suggestion to use SET FMTONLY before executing the stored procedure as an alternate solution to changing to temp table variables but that didn't seem to work as I am getting syntax or permission denied error.
Could somebody tell me a solution to my problem which does not compromise on the performance.
Sounds like you've already read all the approaches to using Temp tables in SSIS, including the IF 1=0... trick? If you haven't seen that one yet, google it.
You say that using Table Variables causes your stored procedure to take about 5 times longer than using Temp Tables. The most likely reason for that is that you are indexing your temp tables but not your table variables. If you didn't know that table variables can be indexed, they can. You might try that.
Finally, a solution that you haven't mentioned is that you can replace your temporary table with a real table that gets truncated when you're done using it.
Short comment:
Try EXEC WITH RESULT SETS and specify the metadata yourself for a proc with temp tables; or use the Script Component as a source and specify the Output columns yourself.
Long comment:
Technically speaking, it is the driver/database you are using in SSIS that would decide the behavior when working with temp tables.
Metadata is an important factor when using SSIS's pipeline components. By metadata, I mean the names of the columns, their data types etc that a pipeline component uses. When designing a data flow, someone/something should provide this metadata to the components that require it.
In most cases, SSIS automatically retreives the metadata. Components that do not connect to a external data source, like Conditional Split etc, get their metadata from the other components they are connected to. For the pipeline components that connect to a external data source (like Oledb source, oledb destination, Lookup etc.), SSIS provides a mechanism to get this metadata without human involvement. This mechanism involves the driver connecting to the database and retrieving the metadata of the output. If the driver/database is capable of returning the metadata, then that metadata is used. If the driver/database is incapable, then you get the errors you are seeing. The rest of my comments are based on the assumption that you are using a SQL Server database in your question.
When working with a SQL Server database in SSIS, typically, we use the native client drivers provided by Microsoft. When trying to get the metadata, these drivers try to get the metadata without actually executing the SQL Statement (actual execution can have side effects; and also, might take more than a few seconds/minutes/hours; and you dont want side effects and long wait times during package design time.) So to get the metadata, the driver relies on the metadata of the actual objects used in the sql command. If the command uses a physical table or view, SQL Server already has the metadata available and can supply it to the driver. If it is a temp table, SQL Server does not have the metadata until it can create the temp table. If using FMT ONLY option, you can use it in such a way to create the temp tables, but avoid any heavy processing/side affects and thus be able to retrieve metadata without penalties. Post 2012, these native client drivers rely on some newer functionality to retrieve metadata than the drivers before 2012. In 2012 and after, the driver uses the sp_describe_first_result_set proc to retrieve metadata. So, whether you can get metadata or not is determined by the ability of the sp_describe_first_result_set proc.
So while SSIS can automatically get the metadata (because of the driver/database), it does not automatically get the metadata in some cases (again because of the driver/database). In cases involving the second scenario, some other process (typically a human) can help the driver infer metadata or provide the metadata to the component directly.
To help the driver, in case of SQL Server 2012 and after, you can use the WITH RESULTSETS clause to specify the output metadata. When this clause is present, the driver will use it and doesnt try to query the metadata from system objects; and thus avoid the error which you would otherwise get. If you are using the drivers that came with SQL Server 2008, you can use FMT ONLY. This option is at the driver/database level.
Another option could be to use a Script Component as the Source and in the Output columns, you can specify the columns/metadata. SSIS would not try to retrieve metadata from the datasource in this case, but would rely on the definitions you provided in the Output section of the Script Component.
As you can see, both options involve a human (or some other process) specifying the metadata instead of SSIS trying to retrieve the metadata in an automated fashion. I would prefer the first option if working with SQL Server and the second option if working with databases like MySql.

How to insert data to mysql directly (not using sql queries)

I have a MySQL database that I use only for logging. It consists of several simple look-alike MyISAM tables. There is always one local (i.e. located on the same machine) client that only writes data to db and several remote clients that only read data.
What I need is to insert bulks of data from local client as fast as possible.
I have already tried many approaches to make this faster such as reducing amount of inserts by increasing the length of values list, or using LOAD DATA .. INFILE and some others.
Now it seems to me that I've came to the limitation of parsing values from string to its target data type (doesn't matter if it is done when parsing queries or a text file).
So the question is:
does MySQL provide some means of manipulating data directly for local clients (i.e. not using SQL)? Maybe there is some API that allow inserting data by simply passing a pointer.
Once again. I don't want to optimize SQL code or invoke the same queries in a script as hd1 adviced. What I want is to pass a buffer of data directly to the database engine. This means I don't want to invoke SQL at all. Is it possible?
Use mysql's LOAD DATA command:
Write the data to file in CSV format then execute this OS command:
LOAD DATA INFILE 'somefile.csv' INTO TABLE mytable
For more info, see the documentation
Other than LOAD DATA INFILE, I'm not sure there is any other way to get data into MySQL without using SQL. If you want to avoid parsing multiple times, you should use a client library that supports parameter binding, the query can be parsed and prepared once and executed multiple times with different data.
However, I highly doubt that parsing the query is your bottleneck. Is this a dedicated database server? What kind of hard disks are being used? Are they fast? Does your RAID controller have battery backed RAM? If so, you can optimize disk writes. Why aren't you using InnoDB instead of MyISAM?
With MySQL you can insert multiple tuples with one insert statement. I don't have an example, because I did this several years ago and don't have the source anymore.
Consider as mentioned to use one INSERT with multiple values:
INSERT INTO table_name (col1, col2) VALUES (1, 'A'), (2, 'B'), (3, 'C'), ( ... )
This leads to you only having to connect to your database with one bigger query instead of several smaller. It's easier to take in the entire couch through the door once than running back and forth with all disassembled pieces of the couch, opening the door every time. :)
Apart from that, you can also run LOCK TABLES table_name WRITE before INSERT and UNLOCK TABLES afterwards. That will secure that nothing else is inserted during.
Lock tables
INSERT into foo (foocol1, foocol2) VALUES ('foocol1val1', 'foocol2val1'),('foocol1val2','foocol2val2') and so on should sort you. More information and sample code will be found here. If you have further problems, do leave a comment.
UPDATE
If you don't want to use SQL, then try this shell script to do as many inserts as you want, put it in a file, say insertToDb.sh, and get on with your day/evening:
#!/bin/sh
mysql --user=me --password=foo dbname -h foo.example.com -e "insert into tablename (col1, col2) values ($1, $2);"
Invoke as sh insertToDb.sh col1value col2value. If I've still misunderstood your question, leave another comment.
After making some investigation I found no way of passing data directly to mysql database engine (without parsing it).
My aim was to speed up communication between local client and db server as much as possible. The idea was if client is local then it could use some api functions to pass data to db engine thus not using (i.e. parsing) SQL and values in it. The only closest solution was proposed by bobwienholt (using prepared statement and binding parameters). But LOAD DATA .. INFILE appeared to be a bit faster in my case.
The best way to insert data on MS SQL without using insert into or update queries is just to access MS SQL Interface. Right click on the table name and select "Edit top 200 rows". Then you will be able to add data on the database directly by just typing per cell. For you to enable searching or using select or other sql commands just right click on any of the 200 rows you have selected. Go to pane then select SQL and you can add sql command. Check it out. :D
without using insert statement , use " Sqllite Studio " for inserting data in mysql. It's free and open source so u can download and check.

Updating records in MYSQL with SSIS

I am writing an SSIS package that has a conditional split from a SQL Server source that splits records to either be updated or inserted into a MYSQL database.
The SQL Server connection has provider .NET Provider for OldDB\SQL Server Native Client 10.0.
The MYSQL connection is a MYSQL ODBC 5.1 ADO.NET connection.
I was thinking about using the OLE DB Command branching off of the conditional split to update records but I connect use this and connect to the MYSQL database.
Does anyone know how to accomplish this task?
I would write to a staging table for updates including the PK and columns to be updated and then execute an UPDATE SQL statement using that table and the table to be updated. The alternative is to use the command for every row and that just doesn't seem to perform that well in my experience - at least compared to a nice fat batch insert and a single update command.
For that matter, I guess you could do without the conditional split altogether, write everything to a staging table and then use an UPDATE and INSERT in SQL back to back.
Probably, the following MSDN blog link might help you. I haven't tried this.
How do I UPDATE and DELETE if I don’t have an OLEDB provider?
The post suggests the following three options.
Script Component
Store the data in a Recordset
Use a custom component (like Merge destination component)
The author also had posted two other articles about MySQL prior to posting the above article.
Connecting to MySQL from SSIS
Writing to a MySQL database from SSIS
Hope that points you in the right direction.

SQL Server to MySQL data transfer

I am trying to transfer bulk data on a constant and continuous based from a SQL Server database to a MYSQL database. I wanted to use SQL Server's SSMS's replication but this apparently is only for SQL Server to Oracle or IBM DB2 connection. Currently we are using SSIS to transform data and push it to a temporary location at the MYSQL database where it is copied over. I would like the fastest way to transfer data and am complication several methods.
I have a new way I plan on transforming the data which I am sure will solve most time issues but I want to make sure we do not run into time problems in the future. I have set up a linked server that uses a MYSQL ODBC driver to talk between SQL Server and MYSQL. This seems VERY slow. I have some code that also uses Microsoft's ODBC driver but is used so little that I cannot gauge the performance. Does anyone know of lightening fast ways to communicate between these two databases? I have been researching MYSQL's data providers that seem to communicate with a OleDB layer. Im not too sure what to believe and which way to steer towards, any ideas?
I used the jdbc-odbc bridge in Java to do just this in the past, but performance through ODBC is not great. I would suggest looking at something like http://jtds.sourceforge.net/ which is a pure Java driver that you can drop into a simple Groovy script like the following:
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:jtds:sqlserver://serverName/dbName-CLASS;domain=domainName',
'username', 'password', 'net.sourceforge.jtds.jdbc.Driver' )
sql.eachRow( 'select * from tableName' ) {
println "$it.id -- ${it.firstName} --"
// probably write to mysql connection here or write to file, compress, transfer, load
}
The following performance numbers give you a feel for how it might perform:
http://jtds.sourceforge.net/benchTest.html
You may find some performance advantages to dumping data to a mysql dumpfile format and using mysql loaddata instead of writing row by row. MySQL has some significant performance improvements for large data sets if you load infile's and doing things like atomic table swaps.
We use something like this to quickly load large datafiles into mysql from one system to another e.g. This is the fastest mechanism to load data into mysql. But real time row by row might be a simple loop to do in groovy + some table to keep track of what row had been moved.
mysql> select * from table into outfile 'tablename.dat';
shell> myisamchk --keys-used=0 -rq '/data/mysql/schema_name/tablename'
mysql> load data infile 'tablename.dat' into table tablename;
shell> myisamchk -rq /data/mysql/schema_name/tablename
mysql> flush tables;
mysql> exit;
shell> rm tablename.dat
The best way I have found to transfer SQL data (if you have the space) is a SQL dump in one language and then to use a converting software tool (or perl script, both are prevalent) to convert the SQL dump from MSSQL to MySQL. See my answer to this question about what converter you may be interested in :) .
We've used the ado.net driver for mysql in ssis with quite a bit of success. Basically, install the driver on the machine with integration services installed, restart bids, and it should show up in the driver list when you create an ado.net connection manager.
As for replication, what exactly are you trying to accomplish?
If you are monitoring changes, treat it as a type 1 slowly changing dimension (data warehouse terminology, but same principal applies). Insert new records, update changed records.
If you are only interested in new records and have no plans to update previously loaded data, try an incremental load strategy. Insert records where source.id > max(destination.id).
After you've tested the package, schedule a job in the sql server agent to run the package every x minutes.
Cou can also try the following.
http://kofler.info/english/mssql2mysql/
I tried this a longer time before and it worked for me. But I woudn't recommend it to you.
What is the real problem, what you try to do?
Don´t you get a MSSQL DB Connection, for example from Linux?

MySQL to SQL Server migration

I have a mysql database full of data which I need to keep but migrate to SQL Server 2008.
I know end to end where the data should go, table to table but I have no idea how to go about moving the data. I've looked around the web but it seems there are 'solutions' which you have to download and run. I'd rather if possible do something myself in terms of writing scripts or code.
Can anyone recommend the best way to do this please?
You have several options here:
On the sql server side, you can set up a connection to your old mysql db using something called a linked server. This will allow you to write sql code for sql server that returns data from the mysql tables. You can use this to build INSERT or SELECT INTO statements.
You can write queries for mysql to export your data as csv, and then use the BULK INSERT features of sql server to efficiently import the csv data.
You can use Sql Server integration services to set move the data over from mysql.
Regardless of which you choose, non-data artifacts like indexes, foreign keys, triggers, stored procedures, and security will have to be moved manually.
Have you tried tool from MSFT called SQL Server Migration Assistance for MySQL ???
https://www.microsoft.com/download/en/details.aspx?id=1495
Try this tutorial it is very easy to perform migration to SQL Server from Mysql and is straightforward as mentioned
http://www.codeproject.com/Articles/29106/Migrate-MySQL-to-Microsoft-SQL-Server
Thanks
You can use the Import/Export Wizard that comes with SQL Server Standard Edition.
Select your 'data source' from MySQL using the ODBC data source. Note: You will need to first install the from ODBC driver for MySQL (ODBC Connector). Then, select your SQL Server destination. Select all tables, and fire it up. You will need to add your primary and foreign keys, and indexes manually.
A bit more automated means would be by using the SQL Server Migration Assistant for MySQL, also free. It has the benefit of recreating the relationships and indexes automatically for you. Probably your best bet.
I did it once, some time ago. First you could couple your mssql server to the mysql server using the odbc mysql connector
http://dev.mysql.com/downloads/connector/
After the connection is made you can write you database procedure as you would if it were two mssql db's. Probably easiest to write some sql batch scripts including a cursor where you run through every every row of a table an decide on a field basis where you will need the field in the future.
example of a cursor: http://www.mssqltips.com/tip.asp?tip=1599
If you decide to go with the cursor, you can play with the parameter to increase performance. I especially remember the FORWARD_ONLY parameter giving a big boost.