String gets separated into different rows in SSIS - ssis

I am currently working on an input file and I do have a column which contains 3 different values in one cell itself. Although this data is not being used in the transformation , I need to input this data from the source and then ignore when it is loaded into the staging table.
But the issue I face is that it gets loaded into separate rows rather than 1 cell.
This particular column is input as a string datatype. what change do I need to make to resolve this issue. Please let me know If more details are needed to answer the question.
I have uploaded a sample file to google drive https://drive.google.com/file/d/17hn8xmRd4CWsgKBzHgdwnR9W4jTJ9lTn/view?usp=sharing
The following is a screenshot of the csv data as opened in a text editor

Having downloaded sample.csv from your link, the first thing I did was open it in a text editor (Notepad++, TextPad, Visual Studio, etc) and just looked at what you have.
Row 1 is column headers
Encoded in UFT-8 with BOM (byte order marker)
Line Endings are CR/LF (Carriage Return & Line Feed)
Column delimiter appears to be a comma ,
Double Quote, ", is used as the text qualifier but only when needed
There are CR/LF characters in the actual data
I then define my flat file connection manager based on that data
Finally, I have a data flow with a Flat File Source to a Derived Column and drop a Data Viewer between them
As you can see, configuring your Flat File Connection Manager as I show will allow all the data to flow into your table as expected.
What is happening now is the CRLF, which is our row delimiter, is having precedence over the embedded CRLF in the column data. By setting the double quote as the Text Qualifier, the data reader correctly "skips" the embedded CRLF until it is encountered outside of the quotes.

Related

Pipeline unable to read field of plain text file

Using Apache Hop latest version I'm trying to read in a plain text file. This text file is old and basically only structured by its lines (it has no delimiter, no seperator, no enclosure, etc.). I would like to read and process the lines of this file as rows in my transformation.
I use the "Text file input" transformation to read the file. Apparently reading it works, but I seem have no field available when trying to retrieve the fields. It simply states that no fields were found.
When I run the "preview records" I do get empty records equal to the number if lines in the file, so that is good. However there is no data shown as there is no field detected.
Curiously enough, when I press "Show file content" I DO get the desired content, nicely structured in the rows as desired, so I know the file is being read correctly.
Does anyone know how to best read these kind of files?
PS: The files can be anywhere from 10 to 100000 lines.
When there is no header row with field names or Hop is not able to detect any fields you can also create a field in the fields tab and it will put content in there.
As we just use a position based approach and split the content using the specified delimiter everything should go in "field1" when no delimiter is found in the data.
Figured it out. The naming is a bit misleading, but you can use the "CSV File input" and then set a TAB as delimited. Then use preview on your file and you should find that the lines are actually being parsed.

Paste CSV or Tab-Delimited data to excel with NO formatting

I'm pasting Tab Delimited data from Notepad++ to excel (about 50k rows and 3 columns). No matter how many different ways I try it, Excel wants to convert a cell containing one " to the next instance of " into one cell content.
For Example, if my data looked like this:
"Apple 1.0 Store
Banana 1.3 Store
"Cherry" 2.5 Garden
Watermelon 4.0 Field
The excel file looks like this:
Apple1.0StoreBanana1.3Store
Cherry 2.5GardenWatermelon4.0Field
One way to get around this is to open the file as a CSV in excel, however this leads to Excel formatting the number values to simplified ones using Excel's "General" format. So the data would look like the following:
"Apple 1 Store
Banana 1.3 Store
"Cherry" 2.5 Garden
Watermelon 4 Field
The data I'm getting is coming from SQL Server Studio so my options for file formats are:
.CSV
.Txt (Tab-delimited)
Copy Pasting from Query results
The solution I'm looking for is to have the data represented in Excel with no excel processing taking place on the quotations, numbers or any other cell contents.
Don't open the file directly in excel. Instead import it and control the data types and file layout.
Open a new excel document:
Select Data menu:
Select From Text in get External Data section.
Select file to import
On step 1 of import wizard select delimited
Click next
Select tab checkbox and change text qualifier to {none}.
Click next
Set column data types to general, text, text
Click finish.
Excel auto imports the data the best it can when you open directly in excel. You lose flexibility/control when this happens. better to import and control yourself to get the fine adjustments you're looking for.
You end up with something like this:
By treating the numbers like text, the zero's don't get messed up.
By setting the text qualifier to none, the quotes don't get messed up.
Have you tried opening it via Text Import?
Got to Data tab > From Text (third form left on default)
You will have window similar to Text To Columns.
Select correct delimiter, remember to remove the quote sign from TExt Qualifier and mark all columns as text to avoid Excel autoformatting.
Step 1:
Step 2:
Step 3:
EXCEL TIP: TIME SAVING IN IMPORTING CSV FILES INTO EXCEL: If u pre-set your Text-To-Columns delimiter parameters correctly in EXCEL (eg specify tabs as the delimiter) and then copy and paste the CSV data, Excel will import the CSV paste directly into the correct columns without u having to going through the Text-To-Columns rigmarole. This was particularly time saving when i had to import hundreds of bank statements into Excel.
However if your Text-To-Columns delimiters are pre-specified incorrectly as e.g. comma and you are importing tab delimited files then excel will dump all the data into one column, and u will have to go through the time consuming process of converting Text-To-Columns for each statement.
EXCEL LOOKS TO THE EXISTING Text-To-Columns delimiters TO SEE IF IT CAN USE THOSE TO MAKE YOUR LIFE EASIER WHEN PASTING DATA
Hope that tip helps (It saved me several hours)

Adwords csv file in attachment is not parsing properly

I am trying to use google apps script to extract data from an email attachment which is basically an Adwords report as csv file.
Here is the gist of the code
var dataTest3 = Utilities.parseCsv(msg.getAttachments()[0].getDataAsString());
SpreadsheetApp.getActive().getSheetByName("Sheet1").getRange(1, 1, dataTest3.length, dataTest3[0].length).setValues(dataTest3);
msg is the GmailMessage object.
The result that i am getting is an array with strange format
The data shows ok but its value is strange
Any idea how can i make it parse into the spreadsheet like a normal csv. It opens up like a normal csv when downloaded.
Thanks
The description basically an Adwords report as csv file needs to be investigated... what exactly is the file format? With only pictures of your problem, the best I can do is guess that your file is using some custom delimiter, not commas.
"CSV" stands for Comma Separated Values, but in practice it applies to text files with a number of different field delimiters - call them Delimiter-separated values. Common delimiters include commas (,), tabs (\t), colons (:), v-bars (|), and sometimes just spaces (usually between quote-enclosed text fields).
Instead of using the version of Utilities.parseCsv(csv) that assumes a comma delimiter, you can use Utilities.parseCsv(csv, delimiter) to specify a custom delimiter. You should be able to determine what the delimiter is by reviewing the attachment in the debugger.
You could also try adapting importFromCSV() from How to Import tab-delimited "CSV", which automatically detects tab or comma delimiters.

CSV manipulation - text removal

I'm trying to manipulate a .csv file to remove text at the beginning of the file before the data starts. The file contains a fixed text string followed by a date field, which will change from file to file and then another fixed text string.
eg.
"Text1"
"------"
"date"
"Text2"
"data column1","data column2" etc
How can I remove this text so i can then use SSIS to import the data to the SQL database?
If I understand the question correctly you want to skip the first line of the file. When you set up the Flat file connection there is an option in the format section of the properties Header rows to skip:. You can set this to the number of rows you need to skip and the file should import. If you have a an actual header row you will need to skip that as well and then map the columns manually.
Within the SSIS import configuration, is there not an option to tell SSIS that the "first row has headers," or something roughly similar? That's what I've used when importing through SSIS, at least.

How to import a fixed width flat file into database using SSIS?

Does any one have a tutorial on how to import a fixed width flat file into a database using an SSIS package?
I have a flat file containing columns with varying lengths.
Column name Width
----------- -----
First name 25
Last name 25
Id 9
Date 8
How do I convert a flat file into columns?
Here is a sample package created using SSIS 2008 R2 that explains how to import a flat file into a database table.
Create a fixed-width flat file named Fixed_Width_File.txt with data as shown in the screenshot. The screenshot uses Notepad++ to display the file contents. It has the capability to show the special characters like carriage return and line feed. CR LF denotes the row delimiters Carriage return and Line feed.
In the SQL server database, create a table named dbo.FlatFile using the create script provided under SQL Scripts section.
Create a new SSIS package and add a new OLE DB Connection manager that would connect to the SQL Server database. Let's assume that the OLE DB Connection manager is named as SQLServer.
On the package's control flow tab, place a Data Flow Task.
Double-click on the data flow task and you will be taken to the data flow tab. On the data flow tab, place a Flat File Source. Double-click on the flat file source and the Flat File Source Editor will appear. Click the New button to open the Flat File Connection Manager Editor.
On the General section of the Flat File Source Editor, enter a value in Connection manager name (say Source) and browse to the flat file location and select the file. This example uses the sample file in the path C:\temp\Fixed_Width_File.txt If you have header rows in your file, you can enter a value 1 in the Header rows to skip textbox to skip the header row.
Click on the Columns section. Change the font according to your choice I chose Courier New so I could see more data with less scrolling. Enter the value 69 in the Row width text box. This value is the sum of width of all your columns + 2 for the row delimiter. Once you have set the correct row width, you should see the fixed width file data correctly on the Source data columns section. Now, you have to click at the appropriate locations to determine the column limits. Note the sections 4, 5, 6 and in the below screenshot.
Click on the Advanced section. You will notice 5 columns created for you automatically based on the column limits that we set on the Columns section in the previous step. The fifth column is for row delimiter.
Rename the column names as FirstName, LastName, Id, Date and RowDelimiter
By default, the columns will be set with DataType string [DT_STR]. If we are fairly certain, that a certain column will be of different data type, we can configure it in the Advanced section. We will change Id column to be of data type four-byte signed integer [DT_I4] and Date column to be of data type date [DT_DATE]
Click on the Preview section. The data will be shown as per the column configuration.
Click OK on the Flat file connection manager editor and the flat file connection will be assigned to the Flat File Source in the data flow task.
On the Flat File Source Editor, click on the Columns section. You will notice the columns that were configured in the flat file connection manager. Uncheck the RowDelimiter because we won't need that.
On the data flow task, place an OLE DB Destination. Connect the output from the Flat file source to the OLE DB Destination.
On the OLE DB Destination Editor, select the OLE DB Connection manager named SQLServer and set the Name of the table or the view drop down to [dbo].[FlatFile]
On the OLE DB Destination Editor, click on the Mappings section. Since the column names in the flat file connection manager are same as the columns in the database, the mapping will take place automatically. If the names are different, you have to manually map the columns. Click OK.
Now the package is ready. Execute the package to load the fixed-width flat file data into the database.
If you query the table dbo.FlatFile in the database, you will notice the flat file data imported into the database.
This sample should give you an idea about how to import fixed-width flat file into database. It doesn't explain how to handle error logging but this should get you started and help you discover other SSIS related features when you play with packages.
Hope that helps.
SQL Scripts:
CREATE TABLE [dbo].[FlatFile](
[Id] [int] NOT NULL,
[FirstName] [varchar](25) NOT NULL,
[LastName] [varchar](25) NOT NULL,
[Date] [datetime] NOT NULL
)
In the derived column transformation you can use SUBSTRING() function for each of the column.
Example:
Columns DerivedColumn
FirstName SUBSTRING(Data, startFrom, Length);
Here the FirstName has width 25 so if we consider that from the 0th position then in the derived column you should specify it by giving SUBSTRING(Data, 0, 25);
Similarly for other columns.
Very well explained, Siva! Your tutorial and excellent illustrations point out what Microsoft should have made clear
that the width for a fixed length row has to include the Carriage Return and Line Feed (CR & LF) characters (which I figured out because the preview showed the rows were not lining up correctly)
the all important step of defining an extra column to contain those CR & LF characters, even though they won't be imported. I figured this out, too. I would have benefited by finding your answer before I began.
Without those two things, an attempt to run the import will give this error message:
The data conversion for column "Column x" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
I have added in this error text in hopes someone will find this page while searching for the cause of their error. Your turorial is worth finding, even if after the fact!