SSIS - fields without the same number of characters causing space delimited file to push data to the next column - ssis

I am having a problem with a space delimited file I am uploading to SQL. This works as needed except there is a column which has occasional length differences which is separated by a space delimiter...
As you can see from the data viewer the column output from the flat file source is moving data across the columns.
I am using a space as a delimiter.
I would like to be able to have the data in the same columns even if the flight number is one less.

You don't have a space delimited file, you do have a fixed-width, or more likely a ragged right since it shows well in Notepad, file.
You'll need to re-define your Flat File Source accordingly. Change Format from Delimited to Ragged Right. In the Columns selector, you get to click the header bar to identify where columns are (assuming the change from delimited to ragged right drops the existing column naming and typing)

On this one I had to load each row as a whole rather than being able to separate into columns. I then took care of all the "heavy lifting" in SQL.

Related

Handle extra column while importing delimited file in adf

I am having a csv file with delimiter '|'. For some of the rows the string itself contains '|'. At the end these rows are getting an additional column. So, when ever copying data using a copy activity, ADF is throwing an error. How to skip the copy activity for these particular rows ?
I have tried deleting these rows in file itself. But the main problem here is, I would be getting files every day that are to be loaded into db.
This problem comes up frequently, usually with commas, and there aren't any good answers. Below are my recommendations in order of preference.
If you can control the input file format, I would recommend doing both of these:
Change the file delimiter. Change the file to use a delimiter that would not occur in your data. Again, this issue occurs most frequently with comma (,) delimiters because commas often show up in the underlying data. Pipestem (|) is usually a good option as it does not organically occur in text. Since that is not the case here, you may need to get more creative and use something like caret (^). Tabs (\t) are also a solid option and probably the easiest change to implement.
Wrap the fields with Quotes. Doing this will allow the text inside the quotes to contain the delimiter character. This is a good practice regardless of the delimiter, but can add significant bloat to the file size depending on the number of rows and columns. You can also choose to only quote fields that contain the delimiter in the text.
If you cannot change the input file, you'll need a preprocessor step to remove the offending rows. Basically, I would read each line of the original file as a single text value (not parsing) and count the delimiters. If a row has the proper delimiter count, then write it out to a secondary file. Then you can use the secondary file for your downstream processing. This would be my last resort because of the data loss, but it might be tolerable in your situation. You could use a Data Flow with a schema-less source dataset to accomplish this step.

Solr - how to index entire csv as one document vs. treating each row as document

I would like to index a csv file.
Ex:
col1,col2
cat,dog
boy,girl
Normally, once indexed, Solr creates three documents, one for each row.
I would like to treat the whole document like its a .txt while keeping it as a csv. So far I figure I should just read the contents into a txt file and just index that. I would like to avoid this if possible though, because I am using paths for the id field, and the whole reading into text, indexing, and setting id path to the original csv feels hacky.

SSIS from Excel to SQL Server : DataType length

I 've got an SSIS Package (SQL Server 2008). I have an Excel source file (XLS 97-2003) that I want to import first to a SQL table storing everything as string (numbers and dates are stored as they rae written for instance). Then, I take data from this table to my other tables.
Excel source is configured like this : Provider=Microsoft.Jet.OLEDB.4.0;Data Source=*********;Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1";
My problem occurs at the first step. Let me explain :
some of my columns MIGHT contain large text. I know exactly what those columns are.
The problem is that :
On one hand, if source columns are configured to be ntext, and if there is long text (>255 char), then OK. If there is no data is these columns, or short text (<255 char), or even long text after the first 8 rows, I get an error (Red box on Excel source... does not go further).
On the other hand, if source columns are configured to be (wstr, 255) and if there is no data, or short data (<255 char), everything is fine. If there is large text, I get an error (which seems logical).
I would like to configure my package so that it does not fall in error if the data source contains smaller data than expected. It seems to me that it is quite reasonnable, but I cannot achieve that...
Any help will be much appreciated.
According to MSDN SSIS documentation, you should read these two:
Missing values. The Excel driver reads a certain number of rows (by default, 8 rows) in the specified source to guess at the data type of
each column... For more information, see PRB: Excel Values Returned
as NULL Using DAO OpenRecordset.
Truncated text. When the driver determines that an Excel column contains text data, the driver selects the data type (string or memo)
based on the longest value that it samples. If the driver does not
discover any values longer than 255 characters in the rows that it
samples, it treats the column as a 255-character string column instead
of a memo column. Therefore, values longer than 255 characters may be
truncated. To import data from a memo column without truncation, you
must make sure that the memo column in at least one of the sampled
rows contains a value longer than 255 characters, or you must increase
the number of rows sampled by the driver to include such a row. You
can increase the number of rows sampled by increasing the value of
TypeGuessRows under the
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel registry
key. For more information, see PRB: Transfer of Data from Jet 4.0
OLEDB Source Fails w/ Error.
Thus, it seems either you try to change excel source structure on the fly (which does not work with Excel provider) or you may have data that does not fit requirements listed above (i.e. no long text or long text after 8 rows). I suppose you can handle this using two possible methods:
Paste dummy NTEXT-size data into those columns. Saves much nerves. You can do this for very first row, so Excel provider won't be frustrated after checking column content at all.
Increase row sampling setting, using the link from MSDN. Which will anyway fail, if you may not have any text in those columns.
PS. Third method is not to use Excel provider at all. Save Excel file as CSV and work with Flat File Source, you won't be hit by this problem working with it. Excel Source is only good when you are 100% sure that source file meets all requirements and will never accidentally change its structure.

Total type of CSV file

I would like to ask there are how many types of csv.
Therefore, I have checked when I am going to save an Excel file.There are window comma seperated,ms dos comma seperated and normal csv. So far I have found these.
However, I could not find on google about the total types of csv. Maybe my keywords are wrong.May I know the rest other than those I have mentioned and their differences?
I will appreciate too if any link provided by you guys.
Although there is an RFC document for CSV, RFC4180, it is not strictly followed.
Other variants on CSV depend on how cells are quoted, no quotes, double quotes...
Some people treat lines with '#' as a comment. Others do not.
Some have header rows, some don't.
Some places will even merge CSV data from multiple sources by concatenation and sorting on some column, and use the number of columns in a particular row to identify the record type.

Trim before destination write in SSIS?

I have a dataflow, where there is a DB source and a flat text file destination(delimited by pipe '|').
The DB source is picking up the SQL query from a variable.
Problem is that if my DB field size of say, firstname and lastname are 30 characters, i get the output as(space represented by dots)
saurabh......................|kumar.......................
What I need is the fields to be trimmed, so that the actual output is
saurabh|kumar
I have more than 40 columns to write, and I would not want to manually insert RTRIM after every column in my BIG sql query :(
I should add that the source can have upto 50,000 rows returned. I was thinking of putting a script component in between, but processing every row might have a performance impact.
Any ideas?
You have quite a few options, but some will obviously be undesirable or impossible to do because of your situation.
First, I'll assume that trailing spaces in the data are because the data types for the source columns are CHAR or NCHAR. You can change the data types in the source database to VARCHAR or NVARCHAR. This probably isn't a good idea.
If the data types in the source data are VARCHAR or NVARCHAR and the trailing spaces are in the data, you can update the data to remove the trailing spaces. This is probably not appealing either.
So, you have SSIS and the best place to handle this is in the data flow. Unfortunately, you must develop a solution for each column that has the trailing spaces. I don't think you'll find a quick and simple "fix all columns" solution.
You can do the data trimming with a script transformation, but you must write the code to do the work. Or, you can use a Derived Column transformation component. In the Derived Column transformation you would add a derived column for each column that needs trimming. For example, you would have a firstname column and a lastname column. The derived column value would replace the existing column value.
In the Derived Column transformation you would use SSIS expression syntax to trim the data. The firstname and lastname trim expressions would be
RTRIM(firstname)
RTRIM(lastname)
Performance will probably be better for the Derived Column transformation, but it may not differ much from the script solution. However, the Derived Column transformation will probably be easier to read and understand later.
You could try using a script component in the data flow? Unlike the control flow, a data-flow script component has inputs & outputs.
Look at this example in MSDN: http://msdn.microsoft.com/en-us/library/ms345160.aspx
If you can iterate each column of the row (?) as it flows through the script component, you could do a .Net trim on the column's data, then pass the trimmed row to the output.
Advantage there of course is it will trim future rows you add later.
Just an idea, I haven't tried this myself. Do post back if it works.
See this:
http://masstrimmer.codeplex.com
It will trim rows by using parallelism.