Handle extra column while importing delimited file in adf - csv

I am having a csv file with delimiter '|'. For some of the rows the string itself contains '|'. At the end these rows are getting an additional column. So, when ever copying data using a copy activity, ADF is throwing an error. How to skip the copy activity for these particular rows ?
I have tried deleting these rows in file itself. But the main problem here is, I would be getting files every day that are to be loaded into db.

This problem comes up frequently, usually with commas, and there aren't any good answers. Below are my recommendations in order of preference.
If you can control the input file format, I would recommend doing both of these:
Change the file delimiter. Change the file to use a delimiter that would not occur in your data. Again, this issue occurs most frequently with comma (,) delimiters because commas often show up in the underlying data. Pipestem (|) is usually a good option as it does not organically occur in text. Since that is not the case here, you may need to get more creative and use something like caret (^). Tabs (\t) are also a solid option and probably the easiest change to implement.
Wrap the fields with Quotes. Doing this will allow the text inside the quotes to contain the delimiter character. This is a good practice regardless of the delimiter, but can add significant bloat to the file size depending on the number of rows and columns. You can also choose to only quote fields that contain the delimiter in the text.
If you cannot change the input file, you'll need a preprocessor step to remove the offending rows. Basically, I would read each line of the original file as a single text value (not parsing) and count the delimiters. If a row has the proper delimiter count, then write it out to a secondary file. Then you can use the secondary file for your downstream processing. This would be my last resort because of the data loss, but it might be tolerable in your situation. You could use a Data Flow with a schema-less source dataset to accomplish this step.

Related

Is there an update to set commas to existing integers/decimals used as currencies in a table?

Data already exists, but since it refers to large amounts, it's not so readable without commas.
I've read this is the function:
(https://i.stack.imgur.com/LVoEC.png)
but how do I make a sentence to modify the existing data in a table?
As far as I know, this should be done in your frontend application. The database should only store data in it's raw format. That way you have the possibility to change the formating depending on the users location for example, since not all countries use "." and "," the same way in numbers.
If you just need it while you are developing your querys and not for the end-user, you could search, if your sql-client has an option to format the output, but it is probably not done in your sql-query.

SSIS - fields without the same number of characters causing space delimited file to push data to the next column

I am having a problem with a space delimited file I am uploading to SQL. This works as needed except there is a column which has occasional length differences which is separated by a space delimiter...
As you can see from the data viewer the column output from the flat file source is moving data across the columns.
I am using a space as a delimiter.
I would like to be able to have the data in the same columns even if the flight number is one less.
You don't have a space delimited file, you do have a fixed-width, or more likely a ragged right since it shows well in Notepad, file.
You'll need to re-define your Flat File Source accordingly. Change Format from Delimited to Ragged Right. In the Columns selector, you get to click the header bar to identify where columns are (assuming the change from delimited to ragged right drops the existing column naming and typing)
On this one I had to load each row as a whole rather than being able to separate into columns. I then took care of all the "heavy lifting" in SQL.

Best way to deal with csv , for preparation for mySql

We have a CSV file, which we have meticulously checked and stripped to show the data in the format we want.
As such this csv file is, just under 500kb in size. I have converted to sql ( saved as txt ) hope thats ok.
The original csv data entry is 3 fields, as thus:
'STANLEY','7331','TAS'
'GORMANSTON','7466','TAS'
After conversion its like so:
INSERT INTO suburbs ('Locality','Pcode','State') VALUES ('\'STANLEY\'','\'7331\'','\'TAS\'');
INSERT INTO suburbs ('Locality','Pcode','State') VALUES ('\'GORMANSTON\'','\'7466\'','\'TAS\'');
Ok now, not being a db officianado, I would like to know. Have I converted it correctly ?
Should I be looking at making this code cleaner for import to the db.
The sql is over 1.6 mb for this file, with over 16,000 entries, so want to make sure I have done things correctly.
Cheers
As with what Adam's comment said, you're most likely not going to want to insert the quotes, which you are doing with \'STANLEY\' etc.
Also, on the 'field' side (locality etc), make sure those are back ticks (non-shift tilde), and the data side (STANLEY) are single quotes.
change to:
INSERT INTO suburbs (`Locality`,`Pcode`,'State`) VALUES ('STANLEY','7331','TAS');
Other than that, I don't see anything wrong with it.
Looks fine, other than the escaped quotes. I use this technique usually with an excel file. Where I have my columns and then I create a formula to generate the appropriate insert statements. Alternatively you can use something like SSIS to get your data into your db.
Your SQL looks good, although won't the extra escaped single quotes end up in your records? I'm not sure if you want 'STANLEY' or just STANLEY in your records, so I'll leave it up to you.
You have half of your work done. You have an insert strategy, do you have a rollback strategy as well? It seems as if this is a big data migration for you, if I might so humbly suggest that you try the insert with just a few rows in a junk table that you don't mind getting rid of first. It's always a pain if the changes have to be undone and there is nothing in place or ready to go to undo any errors.

Trim before destination write in SSIS?

I have a dataflow, where there is a DB source and a flat text file destination(delimited by pipe '|').
The DB source is picking up the SQL query from a variable.
Problem is that if my DB field size of say, firstname and lastname are 30 characters, i get the output as(space represented by dots)
saurabh......................|kumar.......................
What I need is the fields to be trimmed, so that the actual output is
saurabh|kumar
I have more than 40 columns to write, and I would not want to manually insert RTRIM after every column in my BIG sql query :(
I should add that the source can have upto 50,000 rows returned. I was thinking of putting a script component in between, but processing every row might have a performance impact.
Any ideas?
You have quite a few options, but some will obviously be undesirable or impossible to do because of your situation.
First, I'll assume that trailing spaces in the data are because the data types for the source columns are CHAR or NCHAR. You can change the data types in the source database to VARCHAR or NVARCHAR. This probably isn't a good idea.
If the data types in the source data are VARCHAR or NVARCHAR and the trailing spaces are in the data, you can update the data to remove the trailing spaces. This is probably not appealing either.
So, you have SSIS and the best place to handle this is in the data flow. Unfortunately, you must develop a solution for each column that has the trailing spaces. I don't think you'll find a quick and simple "fix all columns" solution.
You can do the data trimming with a script transformation, but you must write the code to do the work. Or, you can use a Derived Column transformation component. In the Derived Column transformation you would add a derived column for each column that needs trimming. For example, you would have a firstname column and a lastname column. The derived column value would replace the existing column value.
In the Derived Column transformation you would use SSIS expression syntax to trim the data. The firstname and lastname trim expressions would be
RTRIM(firstname)
RTRIM(lastname)
Performance will probably be better for the Derived Column transformation, but it may not differ much from the script solution. However, the Derived Column transformation will probably be easier to read and understand later.
You could try using a script component in the data flow? Unlike the control flow, a data-flow script component has inputs & outputs.
Look at this example in MSDN: http://msdn.microsoft.com/en-us/library/ms345160.aspx
If you can iterate each column of the row (?) as it flows through the script component, you could do a .Net trim on the column's data, then pass the trimmed row to the output.
Advantage there of course is it will trim future rows you add later.
Just an idea, I haven't tried this myself. Do post back if it works.
See this:
http://masstrimmer.codeplex.com
It will trim rows by using parallelism.

Order By varbinary column that holds docx files

I'm using MS SQL 2008 server, and I have a column that stores a word document ".docx".
Within the word document is a definition (ie: a term). I need to sort the definitions upon returning a dataset.
so basically...
SELECT * FROM DocumentsTable
Order By DefinitionsColumn ASC.
So my problem is how can this be accomplished, the binary comlumn only sorts on the binary value and not the word document content?
I was wondering if fulltext search/index would work. I already have that working, just not sure if I can use it with ORDER BY.
-Thanking all in advance.
I think you'd need to add another column, and populate this with the term from inside the docx. If it's possible at all to get SQL to read the docx (maybe with a custom .net function?) then it's going to be pretty slow.
Better to populate and maintain another column.
You have a couple options that may or may not be acceptable.
store the string definition contents
of the file in a field along side
the binary file column in the
record.
Only store the string definition in the record, and build the .docx
file at runtime for use within your
application.