SQOOP Export to MySQL - mysql

Trying to do sqoop export from HDFS to MYSQL. Getting mapper error because of different date format between input file vs MySQL. Input file have data in mm/dd/yyyy format where in SQL it is date. I guess MySQL is yyyy-mm-dd.
Because of same getting an error as:
caused by: java.lang.RuntimeException: Can't parse input data: '2/18/2019'
My limitation as the source is from different provider and we can not request them to change it. So in this situation what options do i have? Any suggestions

edit
Unfortunately this answer may not be for you. If you are using a program that you don't have control over the source for, this won't help you.
I'll leave it up only because it is a common question that I see with people new to rdbms programming.
Original answer
Why are you treating dates and times as strings? For that matter why are you building SQL for each row? On the MySql side there is a better way to handle that.
Most RDBMS support the concept of a Prepared Statement, although the implementation differs by vendor. Java had support through jdbc for all of the major vendors flavor of prepared statement, so you don't need to worry about the implementation details.
Every time you execute SQL the database engine goes through several phases before the data is applied or returned. The first and most time consuming phase, called the "prepare" phase, is to analyze the SQL string and computer the ideal access path to complete it with. 50 to 80 percent of the SQL "execution" time is spent in this "Prepare" phase.
A simple optimization is to recognize that the ideal access path in a mature database rarely varies, which allows the programmer to prepare the statement once, return a handle to the access path, then pass only the handle and it's parameters across the wire from the application to the database. This minimizes overheads of access path computation, data type conversions, and network communication while automatically protecting from SQL injection attacks and taking care of such administrivia as date formatting.
In Java, this is represented with the PreparedStatement class.
Always use prepared statements. If used properly, they will eliminate 50 to 80% of the overheads of each database call. They also allow you to choose more simply by using native java types and simply passing the value into the execution with the PS.
Using PreparedStatement also eliminates much of the need to sanitize inputs. By it's nature, you don't need to worry about special characters, apart from those the target will reject (example: dropping a character with a codeine greater than 127 into a database that was built for ASCII only on a platform that enforces character set).
If you need to take input as String, and convert to Date, use java's DateFormat class.

Related

Best methods to avoid MySQL 1406 errors on VARCHAR

My server is using a MySQL DB, connecting to it via the C++ connector. I'm nearing production and I've been spending some time trying to break things as part of hardening the server.
One action item I had was to see what would happen if I execute a statement with a string that is longer than VARCHAR. For example, if I have a column defined as VARCHAR(4) and then set it to the string "hello".
This of course throws an exception with the error code 1406 (Data too long for column).
What I was wondering was if there was a good or standard way to defend against this? Obviously one thing is to check against the string length and truncate manually. I can do this, however there are many tables and several columns with VARCHAR. So my worry is updating server code if one of the columns using VARCHAR has its length increased (i.e. code maintainability)
Note that the server does do some validation up front. I'm just trying to defend against a subtle bug or corner case that lets something slip through.
A couple of other options on the table are to disable strict so it will give a warning and truncate or to convert VARCHAR to TEXT.
I was wondering a few things.
Is there a recommended method to handle this situation?
What are the disadvantages of disabling strict?
Is it worth (and is it possible) to query the DB at runtime the VARCHAR lengths? Note that I'm using the C++ connector. I suppose I could also write a tool that is run before compiling which would extract out VARCHAR lengths from the SQL code used to generate tables. But that then makes me wonder is I'm over engineering this.
I'm just sorting through the possible approaches now and thought I'd seek advice from those with more experience with MySQL.
As an experience database engineer I would recommend a combination of the follow two strategies:
1) If you that know that a there is a chance, however small, that data for your varchar(4) could go higher than 4 then make the varchar field larger than 4. For example, if you expect that the field can go as high as 8 then set the field to varchar(10). The beauty of using a varchar field instead of a char is that a varchar will only use whatever storage it needs.
2) If there is a real issue with data constantly being larger than the varchar field length then you should right your own exception handler to trap for the 1406 error. For the exception to work properly you will need to come up with some type of strategy on exactly how you want to handle the exception. For example, you could send an error to the user and ask them to fix the problem, you could accept the data but truncated it so it fits into the field, or you could send the error to a log file to get fixed at a later time.

Is there any replacement for the InstrRev function in Microsoft Access 2016 for a calculated column?

I am attempting to create several calculated columns in a table with different parts of a parsed filename. Using the InstrRev function is critical to isolate the base file name or extension, but InstrRev is not supported in calculated columns.
I know that there are other ways to solve my problem that don't use calculated columns, but does anyone have a valid calculated column formula that could help me?
Access lets you use VBA functions (including user-defined functions) directly from within a SQL query - however they only work within an Access context - if you have another frontend for a JET (now ACE) database - or inside a computed/calculated column, they won't work - as you've just discovered.
Unfortunately Access (JET and ACE) have only a very meagre and anaemic selection of built-in functions, and the platform has now lagged-behind SQL Server (and even the open-source SQLite) significantly - Access 2016 has not made significant changes to its SQL implementation since Access 2000 (16 years of stagnation!) whereas SQL Server 2016's T-SQL language is so evolved it's almost unrecognizable compared to SQL Server 2000.
JET and ACE support the standard ODBC functions ( https://msdn.microsoft.com/en-us/library/bb208907(v=office.12).aspx ) however none of these perform a "reverse index-of" operation. Also absent is any form of pattern-matching function - though the LIKE operator works, it only returns a boolean result, not a character index.
In short: what you want to do is impossible.
This has been discovered by many people before you:
https://social.msdn.microsoft.com/Forums/office/en-US/6cf82b1b-8e74-4ac8-9997-61cad8bb9310/access-database-engine-incompatible-with-instrrev?forum=accessdev
He maintains a list of DAO/Jet/etc reserved words - and on that list you will see the InstrRev is a VBA() function, and is not a part of the Jet/Ace Engines.
using InStrRev() and similar functions in Jet/ACE queries outside of Access
As you have discovered, SQL queries executed from within Access can use many VBA functions that are not natively supported by the Jet/ACE dialect of SQL
That said, computed/calculated columns are only really of use in stored VIEW objects ("Queries" objects in Access parlance) - which in turn are used for user convenience, not for any programming advantage - especially as these are scalar functions that are evaluated for every row of data that the engine processes (making them potentially very expensive and inefficient to run).
...so the only real solution is to abandon computed/calculated columns and perform this processing in your own application code - but the advantage is that your program will likely be significantly faster.
...or don't use Access and switch to a different DBMS with better active support, such as SQLite (for an in-process database), SQL Server (now with LocalDb for in-process support), or VistaDB (proprietary, but 100% Managed code). Note that Access also supports acting as a front-end for a SQL Server "backend" data-store - where you could create a VIEW that performs this operation, then query the view from your Access code or other consuming client.
There is a workaround if you must: Create a duplicate column that contains the string-reversed value of your original column, then you can evaluate the ODBC LOCATE or JET SQL InStr functions on it and get the result you want (albiet, reversed) - but this would require double the storage space.
e.g.
RowId, FileName , FileNameRev
1 , 'Foo.txt', 'txt.ooF'
2 , 'Bar.txt', 'txt.raB'
Avoid any calculated field. It's a "super user" feature only, that will cause you nothing but trouble. Calculated fields - or expressions - belong in a query.
So create a simple select query:
Select
*,
InStrRev([FieldToCheck], "YourMatchingString") As StringMatch
From
YourTable
Save the query, and then use this whenever you need the table values and this expression.

MySQL: Alternate solution of SQL Server's HierarchyId datatype

My current application was built up in SQL Server 2008 server in JAVA with Hibernate and I had used HierarchyId data type for department hierarchy in my database.
I had written SQL queries to deal with HierarchyId datatype. And I also have n-Level of department tree structure.
Now I want to change my Database server from SQL Server 2008 to MySQL as per business requirement.
After feasibility checking I came with the solution that my whole application will migrate to MySQL database server except HierarchyId data type.
So my main challenge is to find alternate solution of HierarchyId data type with the minimal change in coding.
What is the best way to implement department hierarchy in my database?
Thanks...
I faced the similar situation when our team decided to migrate from MS-SQL to MySQL. We resolved the issue using the following steps:
Added a column of type varchar(100) to the same table in MS SQL.
Converted the hierarchyid from hexadecimal value to string using the hierarchyid.ToString() function as saved it in the newly created column using computed column functionality. for eg. 0x58 -> "/1/", 0x7CE0 -> "/3/7/".
The level of the entity is equal to no-of '/''s minus 1.
These columns could be migrated to the MySQL.
The IsDesendantOf() and is method was replaced with LIKE function of string concaenated with '%'.
Thus we got rid of the hierarchyid functionality in MySQL.
Whenever we face such an issue, we just need to ask ourselves, what would we have done if this functionality would not have been provided by the tool we use. We generally end up getting the answer optimally.
Mysql has no equivalent that I'm aware of, but you could store the same data in a varchar.
For operations involving the HierarchyId, you're probably going to have to implement them yourself, probably as either user defined functions or stored procedures.
What sqlserver does looks like the "materialized path" method of storing a hierarchy. One example of that in mysql can be seen at http://www.cloudconnected.fr/2009/05/26/trees-in-sql-an-approach-based-on-materialized-paths-and-normalization-for-mysql/

Parameterized OLEDB source query

I am creating an ETL in SSIS in which I which I want my data source to be a restricted query, like select * from table_name where id='Variable'. This variable is what I defined as User created variable.
I do not understand how I can have my source query interact with the SSIS scoped Variable.
The only present options are
Table
Table from variable
SQL Command
SQL command from a variable
What I want is to have a SQL statement having a variable as parameter
Simple. Choose SQL command as the Data Access Mode. Enter your query with a question mark as a parameter placeholder. Then click the Parameters button and map your variable to Parameter0 in the Set Query Parameters dialog:
More information is available on MSDN.
An inferior alternative to #Edmund's approach is to use an Expression on another Variable to build your string. Assuming you have #[User::FirstName] already defined, you would then create another variable, #[User::SourceQuery].
In the properties for this variable, set EvaluateAsExpression to True and then set an Expression like "SELECT FirstName, LastName, FROM Person.Person WHERE FirstName = '" + #[User::FirstName] +"'" The double quotes are required because we are building an SSIS String.
There are two big reasons this approach should not be implored.
Caching
This approach is going to bloat your plan cache in SQL Server with N copies of essentially the same query. The first time it runs and the value is "Edmund" SQL Server will create an execution plan and save it (because it can be expensive to build them). You then run the package and the value is "Bill". SQL Server checks to see if it has a plan for this. It doesn't, it only has one for Edmund and so it creates another copy of the plan, this time hard coded to Bill. Lather-rinse-repeat and watch your available memory dwindle until it unloads some plans.
By using the parameter approach, when the plan is submitted to SQL Server, it should be creating a parameterized version of the plan internally and assumes that all parameters supplied will result in equal costing executions. Generally speaking, this is the desired behaviour.
If your database is optimized for ad-hoc workload (it's a setting turned off by default), that should be mitigated as every plan is going to get parameterized.
SQL Injection
The other big nasty you will run into with building your own string is that you open yourself up to SQL Injection attacks or at the least, you can get runtime errors. It's as simple as having a value of "d'Artagnan." That single quote will cause your query to fail resulting in package failure. Changing the value to "';DROP TABLE Person.Person;--" will result in great pain.
You might think it's trivial to safe quote everything but the effort of implementing it consistently everywhere you query is beyond what your employer is paying you. All the more so since there is native functionality provided to do the same thing.
When using OLEDB Connection manager (with SQL Server Native Client 11.0 provider in my case) you can catch an error like this:
Parameters cannot be extracted from the SQL command. The provider
might not help to parse parameter information from the command. In
that case, use the "SQL command from variable" access mode, in which
the entire SQL command is stored in a variable.
So you need to explicitly specify database name in OLEDB Connection manager properties. Otherwise SQL Server Native Client can use different database name then you mean (e.g. master in MSSQL Server).
For some cases you can explicitly specify database name for each database object used in query, e.g.:
select Name
from MyDatabase.MySchema.MyTable
where id = ?

Which SQL inject methods aren't "destroyed" by mysql_real_escape_string();?

Is there a list of SQL injection methods which can't be protected with just using mysql_real_escape_string(); with utf8 encoding?
For integer, I'm using intval();
Is it secure enough?
For those who think I want to get "tutorial" to hack anyone: No, I won't. I just want to know how to make my applications more secure, and I want to know if they're secured 99% against hackers
If given a valid database connection, mysql_real_escape_string() is supposed to be safe for string data under all circumstances (with the rare exception described in this answer).
However, anything outside a string, it won't escape:
$id = mysql_real_escape_string($_GET["id"]);
mysql_query("SELECT * FROM table WHERE id = $id");
is still vulnerable, because you don't have to "break out" of a string to add an evil additional command.
There are not many sql injection methods. They are always due to input not being sanitized and escaped properly. So, While mysql_real_escape_string() will make any string safe to be included in a database query, you should follow the following avoidance techniques to protect your data and users from sql injection.
Never connect to the database as a superuser or as the database owner. Use always customized users with very limited privileges.
Check if the given input has the expected data type.
If the application waits for numerical input, consider verifying data with is_numeric(), or silently change its type using settype()
Quote each non numeric user supplied value that is passed to the database with the database-specific string escape function. So mysql_real_escape_string() will make all strings safe to be included in an SQL query to a mysql database
You could also learn to use stored procedures and prepared statements which tend to be very safe but have other impacts
See also: PHP page on SQL injection
There are many things that may not get protected by standard methods (e.g. string escaping, int casting), also depending on the version of software you use. For example, utf-8 is quite an issue by itself, as a tiny example (among many) you should make sure the request is valid utf-8 (or convert it into utf-8). See an example.
As the undead bane of websites, I think that MySQL injection protection cannot be squeezed into a single SO answer, hence I'm including these links as general starting points.
http://ferruh.mavituna.com/sql-injection-cheatsheet-oku/
And also : Search for mysql injection utf8