I've got a 3rd party file coming in - utf-8 encoded, 56 columns, csv export from MySql. My intent is to load it into a Sql Server 2019 instance - a table layout I do not have control over.
Sql Server Import Wizard will automatically do the code page conversions to latin 1 (and a couple string-to-int conversions) but it will not handle the MySql "\N" for null conventions, so I thought I'd try my hand at SSIS to see if I could get the data cleaned up on ingestion.
I got a number of components set up to do various filtering and transforming (like the "\N" stuff) and that was all working fine. Then I tried to save the data using an OLE DB destination, and the wheels kinda fall off the cart.
SSIS appears to drop all of the automatic conversions Import Wizard would do and force you to make the conversions explicit.
I add a Data Transformation Component into the flow and edit all 56 columns to be explicit about the various conversions - only it lets me edit the "Copy of" output column code pages it will not save them. Either in the Editor or the Advanced Editor.
I saw another article here saying "Use the Derived Column Transformation" but that seems to be on a column-by-column basis (so I'd have to add 56 of them).
It seems kinda crazy that SSIS is such a major step backwards in this regard from Import Wizard, bcp, or BULK INSERT.
Is there a way to get it to work through the code page switch in SSIS using SSIS components? All the components I've seen recommended don't seem to be working and all of the other articles say "make another table using different code pages or NVARCHAR and then copy one table to the other" which kinda defeats the purpose.
It took synthesizing a number of different posts on tangentially related issues, but I think I've finally gotten SSIS to do a lot of what Import Wizard and BULK INSERT gave for free.
It seems that to read a utf-8 csv file in with SSIS and to process it all the way through to a table that's in 1252 and not using NVARCHAR involves the following:
Create a Flat File Source component and set the incoming encoding to 65001 (utf-8). In the Advanced editor, convert all string columns from DT_STR/65001 to DT_WSTR (essentially NVARCHAR). It's easier to work with those outputs the rest of the way through your workflow, and (most importantly) a Data Conversion transform component won't let you convert from 65001 to any other code page. But it will let you convert from DT_WSTR to DT_STR in a different code page.
1a) SSIS is pretty annoying about putting a default 50 length on everything by default. And not carrying through any lengths as defaults from one component/transform to the next. So you have to go through and set the appropriate lengths on all the "Column 0" input columns from the Flat File Source and all the WSTR transforms you create in that component.
1b) If your input file contains, as mine apparently does, invalid utf-8 encoding now and then, choose "RD_RedirectRow" as the Truncation error handling for every column. Then add a Flat File Destination to your workflow, and attach the red line coming out of your Flat File Source to it. That's if you want to see what row was bad. You can just choose "RD_IgnoreError" if you don't care about bad input. But leaving the default means your whole package will blow up if it hits any bad data
Create a Script transform component, and in that script you can check each column for the MySql "\N" and change it to null.
Create a Data Conversion transformation component and add it to your workflow. Because of the DT_WSTR in step 1, you can now change that output back to a DT_STR in a different code page here. If you don't change to DT_WSTR from the get-go, the Data Conversion component will not work changing the code page at this step. 99% of the data I'm getting in just has latinate characters, utf-8 encoded (the accents). There are a smattering of kanji characters in a small subset of data, so to reproduce what Import Wizard does for you, you must change the Truncation error handling on every column here that might be impacted to RD_IgnoreError. Unlike some documentation I read, RD_IgnoreError does not put null in the column; it puts the text with the non-mapping characters replaced with "?" like we're all used to.
Add your OLE DB destination component and map all of the output columns from step 3 to the columns of your database.
So, a lot of work to get back to Import Wizard started and to start getting the extra things SSIS can do for you. And SSIS can be kind of annoying about snapping column widths back to the default 50 when you change something. If you've got a lot of columns this can get pretty tedious.
Related
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
When converting CSV to AVRO I would like to output all the rejections to a file (let's say error.csv).
A rejection is usually caused by a wrong data type - e.g. when a "string" value appears in a "long" field.
I am trying to do it using incompatible output, however instead of saving the rows that failed to convert (2 in the example below), it saves the whole CSV file. Is it possible to filter out somehow only these records that failed to convert? (Does NiFi add some markers to these records etc?)
Both processors: RouteOnAttribute and RouteOnContent route the whole files. Does the "incompatible" leg of the flow somehow mark single records with something like "error" attribute that is available after splitting the file into rows? I cannot find this in any doc.
I recommend using a SplitText processor upstream of ConvertCSVToAvro, if you can, so you are only converting one record at a time. You will also have a clear context for what the errors attribute refers to on any flowfiles sent to the incompatible output.
Sending the entire failed file to the incompatible relationship appears to be a purposeful choice. I assume it may be necessary if the CSV file is not well formed, especially with respect to records being neatly contained on one line (or properly escaped). If your data violates this assumption, SplitText might make things worse by creating a fragmented set of failed lines.
I have a project that imports a TSV file with a field set as text stream (DT_TEXT).
When I have invalid rows that get redirected, the DT_TEXT fields from my invalid rows gets appended to the first proceeding valid row.
Here's my test data:
Tab-delimited input file: ("tsv IN")
CatID Descrip
y "desc1"
z "desc2"
3 "desc3"
CatID is set as in integer (DT_I8)
Descrip is set as text steam (DT_TEXT)
Here's my basic Data Flow Task:
(I apologize, I cant post images until my rep is above 10 :-/ )
So my 2 invalid rows get redirected, and my 3rd row directs to sucess,
But here is my "Success" output:
"CatID","Descrip"
"3","desc1desc2desc3"
Is this a bug when using DT_TEXT fields? I am fairly new to SSIS, so maybe I misunderstand the use of text streams. I chose to use DT_TEXT as I was having truncation issues with DT_STR.
If its helpful, my tsv Fail output is below:
Flat File Source Error Output Column,ErrorCode,ErrorColumn
x "desc1"
,-1071607676,10
y "desc2"
,-1071607676,10
Thanks in advance.
You should really try and avoid using the DT_TEXT, DT_NTEXT or DT_IMAGE data types within SSIS fields as they can severely impact dataflow performance. The problem is that these types come through not as a CLOB (Character Large OBject), but as a BLOB (Binary Large OBject).
For reference see:
CLOB: http://en.wikipedia.org/wiki/Character_large_object
BLOB: http://en.wikipedia.org/wiki/BLOB
Difference: Help me understand the difference between CLOBs and BLOBs in Oracle
Using DT_TEXT you cannot just pull out the characters as you would from a large array. This type is represented as an array of bytes and can store any type of data, which in your case is not needed and is creating problems concatenating your fields. (I recreated the problem in my environment)
My suggestion would be to stick to the DT_STR for your description, giving it a large OutputColumnWidth. Make it large enough so no truncation will occur when reading from your source file and test it out.
does someone know any way to export data from access db to cobol code?
Thanks
Fixed Format is definitely the way to go, any Cobol can read a fixed Format File.
A simple way to create a Fixed Format File in any SQL dialect (oracle, DB2, H2 etc) is to use the SQL String functions to create a single Field and export/write this query to a file
MS Access Example Query:
SELECT Left(Str([TblId])+Space(8),8)
+ Left(Str([tblkey])+Space(20),20)
+ Left([Details]+Space(30),30)
+ "<" AS ExportString
FROM Tbl_TI_IntTbls;
For Cobol it would be best right justify Zero fill and align the decimal points of numeric fields.
Also if this a once off you
can run the query in access an copy / paste the output to a Text Editor.
Ms Access also allows you to define Fixed Formats and use these to input ( export ?) but it is a long time since I used them (I was using them to import fixed width data). I will leave discussion of this to an Access Expert.
You could also look at the RecordEditor (http://record-editor.sourceforge.net/Record11.htm) / JRecord (http://jrecord.sourceforge.net/) because
Both let you View / edit a file
using a Cobol Copybooks - useful for
checking the export match's the Cobol Definition
Both have Copybook analysis Option
(File Menu) that will calculate the
start / length of fields in a Cobol
Copybook
Both have copy
function that will copy a Csv File
to / from a Cobol File using a Cobol
Copybook
Note: This is shameless plug for my software
I would avoid a delimited file (in case the delimiter occurs in a field) but if you must, use an obskure character eg ` or ~ or ^
The easiest way would be to export to a fixed with format from access. That is the native format for Cobol file data descriptions.
However, if access does not support this, you can export to a CSV (Comma Seperated Values) file or a TSV (Tab Seperated Values). Cobol, in its ANSI form, does not support this, but it is very easy to parse with a simple Unstring. For example:
Perform Read-A-Record
Perform until End-Of-File
Unstring Input-Record
delimited by ","
into Column-1-Field
Column-2-Field
...
Column-n-Field
Perform Read-A-Record
End-Perform
Access can export to fixed-field formats via its Export Wizard or a simple VB6 program or a script can do the same thing using the Jet OLEDB Provider and Jet's Text IISAM, with a Schema.ini file that defines the output format.
There are formatting limitations (no signed packed decimal formats or other Cobol exotica) but in general this should suffice for creating files most Cobol variations support. If you truly must have numeric fields left-zero filled you can do that by using the Jet SQL Expression Service which allows inline use of a subset of VBA functions, defining the result field as Text in the Schema.ini file.
If what you really require is export to some sort of ISAM files your best bet is to write an intermediate Cobol program to import the saved field-field text data. Some Cobol products may even include utilities to do this kind of importing.
I have an OLE DB Data source and a Flat File Destination in the Data Flow of my SSIS Project. The goal is simply to pump data into a text file, and it does that.
Where I'm having problems is with the formatting. I need to be able to rtrim() a couple of columns to remove trailing spaces, and I have a couple more that need their leading zeros preserved. The current process is losing all the leading zeros.
The rtrim() can be done by simple truncation and ignoring the truncation errors, but that's very inelegant and error prone. I'd like to find a better way, like actually doing the rtrim() function where needed.
Exploring similar SSIS questions & answers on SO, the thing to do seems to be "Use a Script Task", but that's ususally just thrown out there with no details, and it's not at all an intuitive thing to set up.
I don't see how to use scripting to do what I need. Do I use a Script Task on the Control Flow, or a Script Component in the Data Flow? Can I do rtrim() and pad strings where needed in a script? Anybody got an example of doing this or similar things?
Many thanks in advance.
With SSIS, there are many possible solutions! From what you mention, you could use a Derived Column transform within a Data Flow to perform the trimming and padding - you would use an expression to do this, it would be relatively straightforward. Eg,
ltrim([ColumnName])
to trim and something along the lines of
right("0000"+ [ColumnName],6)
to pad (this is off the top of my head so syntax may not be exact).
As for the scripting method, that is also valid. You would use the Script Component Transform on the Data Flow and use VB.NET or C# (if you have 2008) string manipulation methods (eg strVariable.Trim()).