I want to write to DB using pyspark as below
df.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", table).option("user", user).option("password", password)\
.mode('append').save()
I am not sure how it works fully, i.e. is it atomic in nature, sometimes even if there is a failure i see some records inserted and sometimes in case of errors none of the records is inserted. What is the default behaviour of this write.
Also is there any option where i can write while ignoring the DML errors like column length issues or date format issues etc.
Spark will execute insert queries parallelly in independent transactions and this is why some times you find data in case of error and sometimes you don't. It may possible that insertion order is not the same in each and every time. Spark will auto created query based on the value of spark.sql.shuffle.partitions the default value will be 200 or if you have used df.repartition(val). So based on the configuration and available partitions, it will execute insert query parallelly having independent transactions for each partition. If any error encountered, it will stop the execution.
To answer your other question, you are executing queries on MySQL so Spark doesn't have any control over it. So NO there can not be the way that you can ignore such DML errors. Spark will execute a query as any other external system do. It should be the developer's responsibility that data are in expected length and format. You can always convert the data in the expected format before writing to MySQL.
For more info refer this.
Related
My input table in MySQL has 20 Million records and the target table in Oracle is empty. I need to load the entire table from MySQL into Oracle. I am simply using a Table Input and Table Output step.
My intention is not to lock the source table for a long time whilst reading.
Is there a problem with the load (Number of records) I am trying to achieve?
I could see Use batch update for inserts option in the Table Output. I could not see something similar in the Table Input. Is there a way to perform batch processing in Pentaho?
Don't worry, 20 millions records is a small number for PDI, and you will not lock the table if it is open for input. That's why the Bulk load is for output table not input tables.
A common beginer trap however, is the Truncate table option on the output table. If you run (inadertantly or for parallel processing) twice the output step, each one will lock the other. Forever.
To speed up: You may use the Lazy conversion check box on input, so that the data remains in byte format until it is used. But I am not sure you win something on a simple input/output table. if something wrong happens with Dates or Blobs on writing on the output file, the error message will be quite cryptic.
You can also increase the speed of the output by inceasing the commit size (worst a few trials in Oracle), and by inceasing the number of rows in row set which will increase the number of rows read by the table input. To do so right-click anywhere then Properties/Miscelanous.
Something I really advise to do, is to increase the JVM memory size. Use an editor (notepad or better) to edit the file named spoon.bat. You'll find around line 94-96 a line containing someting like "-Xmx256K". Change it to "-Xmx4096M" (where 4096 is half the size of your machine RAM).
To perform "batch processing" has many meaning. One of them beeing Make the transformation database transactional. Which you can do with the check box just below the above mentionned Number of row in rowset (and buggily spelled as Make the transformation database in PDI latest version). With that box checked, if something goes wrong the state of the databases is rolled back as if the transformation was never executed. But I don't advise to do this in your case.
In additinal of #AlainD solution There are a couple of options:
- Tune MySQL for better performance on Inserts
- Use the MySQL Bulk loader step in PDI
- Write SQL statements to file with PDI and read them with mysql-binary
Speed can be boosted by using some simple JDBC-connection setting.
useServerPrepStmts= false
rewriteBatchedStatements= true
useCompression= true
I have a big table, which saved data with an ID based on input from an external API. The ID is stored in an int field. When I developed the system, I encountered no problems, because the ID of records in the external API were always below 2147483647.
The system has been fetching data from the API for the last few months, and apparantly the ID crossed the 2147483647 mark. I now have a database with thousands of unusable records with ID 2147483647.
It is not possible to fetch this information from the database again (basically, the API allows us to look up data from max x days ago).
I am pretty sure that I am doomed. But might there be any backlog, or any other way, to retrieve the original input queries, or numbers that were truncated by MySQL to fit in the int field?
As already discussed in the comments, there is no way to retrieve the information from the table. It was silently(?!!!) truncated to 32 bits.
First, call the API provider, explain your situation, and see if you can redo the queries. Best that happens is they say yes and you don't have to try to reconstruct things from logs. Worst that happens is they say no and you're back where you are now.
Then there are some logs I would check.
First is the MySQL General Query Log. IF you had this turned on, it may contain the queries which were run. Another possibility is the Slow Query Log, more often enabled, if your queries happened to be slow.
In MySQL, data truncation is a warning by default. It's possible those warnings went into a log and included the original data. The MySQL Error Log is one possibility. On Windows it may have gone into the Windows Event Log. On a Mac, it might be in a log visible to the Console. In Unix, it might have gone to syslog.
Then it's possible the API queries themselves are logged somewhere. If you used a proxy it might contain them in its log. The program fetching from the API and adding to the database may also have its own logs. It's a long shot.
As a last resort, try grepping all of /var/log and /var/local/log and anywhere else you might think could contain a log.
In the future there are some things you can do to prevent this sort of thing from happening again. The most important is to turn on strict SQL mode. This will turn warnings, like that data has been truncated, into errors.
Set UNIQUE constraints on unique columns. Had your API ID column been declared UNIQUE the error would have been detected.
Use UNSIGNED BIGINT for numeric IDs. 2 billion is a number easily exceeded these days. It will mean 4 extra bytes per row or about 8 gigabytes extra to store 2 billion rows. Disk is cheap.
Consider turning on ANSI SQL mode. This will disable a lot of MySQL extensions and make your SQL more portable.
Finally, consider switching to PostgreSQL. Over the years MySQL has accumulated a lot of bad ideas, mish-mashes of functions, and bad default behaviors. You just got bit by one. PostgreSQL is far better designed, more powerful and flexible, and usually as fast or faster.
In Postgres, you would have gotten an error.
test=# CREATE TABLE foo ( id INTEGER );
CREATE TABLE
test=# INSERT INTO foo (id) VALUES (2147483648);
ERROR: integer out of range
If you have binary logging enabled, and you still have backups of the binlogs, and your binlog_format is not set to ROW then your original insert and/or update statements should be preserved there, where you could extract them and replay them into another server with a more appropriate table definition.
If you don't have the binlog enabled and/or you aren't archiving the binlogs in perpetuity... this is one of the reasons why you should consider doing it.
I have to read approximately 5000 rows of 6 columns max from a xls file. I'm using PHP >= 5.3. I'll be using PHPExcel for that task. I haven't try it but I think it can handle (If you have other options, they are welcome).
The issue is that every time I read a row, I need to query the database to verify if that particular row exists, If it does, then overwrite it, If not, then add it.
I think that's going to take a lot of time and PHP will just simply timeout ( I can't modify the timeout variable since it's a shared server).
Could you give me a hand with this?
Appreciate your help
Since you're using MySQL, all you have to do is insert data and not worry about a row being there at all.
Here's why and how:
If you query a database from PHP to verify a row exists, that's bad. Reason it's bad is because you are prone to getting false results. There's a lag between PHP and MySQL, and PHP can't be used to verify data integrity. That's the job of the database.
To ensure there are no duplicate rows, we use UNIQUE constraints on our columns.
MySQL extends SQL standard using INSERT INTO ... ON DUPLICATE KEY UPDATE syntax. That lets you just insert data, and if there's a duplicate row - you can just update it with new data.
Reading 5000 rows is quick. Inserting 5000 is also quick, if you wrap it in a transaction. I would suggest reading 100 rows from the excel file, starting a transaction and just insert data (using ON DUPLICATE KEY UPDATE to handle duplicates). That will let you spend 1 I/O of your hard drive to save 100 records. Doing so, you can finish this whole process in a few seconds, which lets you not to worry about performance or timeouts.
At first run this process via exec, and timeout has no matter
At second, select all rows before read excel file. Select not at one query, read 2000 rows at time for example, and collect it to array.
At third use xlsx format and chunkReader, that allows read not whole file.
It's not 100% garantee, but i did the same.
I've just read a mysql docs where I found such sentence: "A consistent read means that InnoDB uses multi-versioning to present to a query a snapshot of the database at a point in time."
I read a lot of mysql doc pages, but still cann't clarify to myself what exactly "to a query" here means. Definitly it ralates to a SELECT statement, but what about if my transaction starts with UPDATE, INSERT, DELETE statement?
Thanks!
I found another way on my answer. And I think it should be apripriate by the others. So, days of searching whiting oracle docs and finaly founed:
InnoDB creates a consistent read view or a consistent snapshot either when the statement
mysql> START TRANSACTION WITH CONSISTENT SNAPSHOT;
is executed or when the first select query is executed in the transaction.
https://blogs.oracle.com/mysqlinnodb/entry/repeatable_read_isolation_level_in
When the query can change the data, the database also uses locks to synchronise queries.
So between queries that change data, locks are used to make sure that only one query at a time can change specific items. Between a query that reads data and a query that changes data, multi-versioning is used to present the data before the change to the query that reads it.
I have a SSIS package to insert/update rows to Database. First i use look up for checking if row already inserted to DB;If yes i update that row, else insert as new row.
My problem is when inserting , a row is inserted successfully but also redirected to error. How both can happen at same time? That too it happens some times not always - very inconsistent. How to track what caused the error ? I used "redirect row" here to get failed rows.
This happens only when it deployed on server.On running my local machine using BIDS works fine.
Your OLE DB Destination is likely set to the default values
A quick recap of what all these values mean
Data access mode: You generally want Table or view - fast load or Table or view name variable - fast load as this will perform bulk inserts. The non-fast load choices result in singleton inserts which for any volume of data will have you questioning the sanity of those who told you SSIS can load data really fast.
Keep Identity: This is only needed if you want to explicitly provide an identity seed
Keep nulls: This specifies whether you should allow the defaults to fire
Table lock: Preemptively lock the table. Unless you're dealing with Hekaton tables (new SQL Server 2014 candy), touching a table will involve locks. Yes, even if you use the NOLOCK hint. Inserting data obviously results in locking so we can assure our ACID compliance. SQL Server will start with a small lock, either Row or Page level. If you cross a threshold of modifying data, SQL Server will escalate that lock to encapsulate the whole table. This is a performance optimization as it's easier to work if nobody else has their grubby little paws in the data. The penalty is that during this escalation, we might now have to wait for another process to finish so we can get exclusivity to the table. Had we gone big to begin with, we might have locked the table before the other process had begun. - Check constraints: Should we disable the checking of constraint values. Unless you have a post import step to ensure the constraint is valid, don't uncheck this. Swiftly loaded data that is invalid for the domain is no good.
Rows per batch: this is a pass through value to the INSERT BULK statement as the ROWS_PER_BATCH value.
Maximum insert commit size: The FastLoadMaxInsertCommitSize property specifies how many rows should be held in the transaction before committing. The 2005 default was 0 which meant everything gets committed or none of it does. The 2008+ default of 2 billion may be effectively the same, depending on your data volume.
So what
You have bad data somewhere in your insert. Something is causing the insert to fail. It might be the first row, last row, both or somewhere in between. The net effect is that the insert itself is rolled back. You designed your package to allow for the bad data to get routed to a flat file so a data steward can examine it, clean it up and get it re-inserted into the pipeline.
The key then is you needed to find some value that provides the optimal balance of insert performance size, more is better, relative to badness size. For the sake of argument, let's use a commit size of 5003, because everyone likes prime numbers, and assume our data source supplies 10009 rows of data. Three rows in there will violate the integrity of the target table and will need to be examined by a data steward.
This is going to result in 3 total batches being sent to the destination. The result is one of the following scenarios
Bad rows are the final 3 rows, resulting in only those 3 rows being sent to the text file and 10006 rows committed to the table in 2 batches.
Bad rows exist only in 1 of full sets. This would result in 5006 rows being written to the table and 5003 rows sent to our file
Bad rows split amongst each commit set. This results in 0 rows written to the table and all the data in our file.
I always felt Murphy was an optimist and the disk holding the error file would get corrupt but I digress.
What would be ideal is to whittle down the space bad data can exist in while maximizing the amount of good data inserted at a shot. In fact, there are a number of people who have written about it but I'm partial to the approach outlined in "Error Redirection with the OLE DB Destination".
We would perform an insert at our initial commit size of 5003 and successful rows will go as they will. Bad rows would go to a second OLE DB Destination, this time with a smaller commit size. There's differences of opinion whether you should immediately go to singleton inserts here or add an intermediate bulk insert attempt at half your primary commit size. This is where you can evaluate your process and find the optimal solution for your data.
If data still fails the insert at the single row level, then you write that to your error file/log. This approach allows you to put as much good data into the target table as you can in the most efficient mechanism as possible when you know you're going to have bad data.
Bad data addendum
Yet a final approach to inserting bad data is to not even try to insert it. If you know foreign keys must be satisfied, add a Lookup component to your data flow to ensure that only good values are presented to the insert. Same for NULLs. You're already checking your business key so duplicates shouldn't be an issue. If a column has a constraint that the Foo must begin with Pity, then check it in the data flow. Bad rows all get shunted off to a Derived Column Task that adds business friendly error messages and then they land at a Union All and then all the errors make it to the error file.
I've also written this logic where each check gets its own error column and I only split out the bad data prior to insertion. This prevents the business user from fixing one error in the source data only to learn that there's another error. Oh, and there's still another, try again. Whether this level of data cleansing is required is a conversation with your data supplier, data cleaner and possibly your boss (because they might want to know how much time you're going to have to spend making this solution bullet proof for the horrible data they keep flinging at you)
References
Keep nulls
Error Redirection with the OLE DB Destination
Batch Sizes, Fast Load, Commit Size and the OLE DB Destination
Default value for OLE DB Destination FastLoadMaxInsertCommitSize in SQL Server 2008
I have noticed that if you check lock tables, and also have an update, you will get deadlocks between the 2 flows in your dataflow. So we do not check table lock. The performance seems to be the same.
My finding might help those who visit here..
#billinkc made a broad comment; i had gone all through that.Well Later after digging down the system the problem was something different.
My SSIS package has script task within to do some operations.That one uses a folder called TEMP in the disk.The program which triggered this SSIS also simultaneously using the same TEMP folder. Now there the file read/write exceptions were not handled.
This caused script task to fail resulting a package fail error.Since the INSERT functionality carried before the script task,INSERT was successful.Later when script failed it moved rows to error too!
I tried with catching these "file errors/exceptions" and it worked!!