SSIS write DT_NTEXT into an UTF-8 csv file - csv

I need to write the result of an SQL query into a CSV file (UTF-8 (I need this encoding as there are French letters)). One of the columns is too large (more than 20000 char) so I can't use DT_WSTR for it. The type that is inputted is DT_TEXT so I use a Data Conversion to change it to DT_NTEXT. But then when I want to write it to the file I have this error message :
Error 2 Validation error. The data type for "input column" is
DT_NTEXT, which is not supported with ANSI files. Use DT_TEXT instead
and convert the data to DT_NTEXT using the data conversion component
Is there a way I can write the data to my file?
Thank you

I had this kind of issues also sometimes. When working with data larger than 255 characters SSIS sees it as blob data and will always handle this as such.
I then converted this blob stream data to a readable text with a script component. Then other transformation should be possible.
This was the case in ssis that came with sql server 2008 but I believe this isn't changed yet.

I ended up doing just like Samyne says, I used a script.
First I've modified my SQL SP, instead of having several columns I put all the info in one single column like follows :
Select Column1 + '^' + Column2 + '^' + Column3 ...
Then I used this code in a script
string fileName = Dts.Variables["SLTemplateFilePath"].Value.ToString();
using (var stream = new FileStream(fileName, FileMode.Truncate))
{
using (var sw = new StreamWriter(stream, Encoding.UTF8))
{
OleDbDataAdapter oleDA = new OleDbDataAdapter();
DataTable dt = new DataTable();
oleDA.Fill(dt, Dts.Variables["FileData"].Value);
foreach (DataRow row in dt.Rows)
{
foreach (DataColumn column in dt.Columns)
{
sw.WriteLine(row[column]);
}
}
sw.WriteLine();
}
}
Putting all the info in one column is optional, I just wanted to avoid handling it in the script, this way if my SP is changed I don't need to modify the SSIS.

Related

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

CSV Parser, how to auto discover col names & types

I am loading CVS files for data import. The files come from various sources so the header names and location are often changing. I searched and found helpful libs like CsvHelper & FileHelpers
Question: either using FileHelper.net or CsvHelper, how do we extract both the Header names & the Column datatype? so that I can create a drop for each col, to map between .NET type <==> to a SQL type
Just read in the first line of the file with, say,
string headers = File.ReadLines("MyFile.txt").First();
And then use a class builder to build whatever CSV spec you need.
DelimitedClassBuilder cb = new DelimitedClassBuilder("MyProduct", delimiter: ",");
cb.AddField("Name", typeof(string));
cb.LastField.TrimMode = TrimMode.Both;
cb.AddField("Description", typeof(string));
cb.LastField.FieldQuoted = true;
cb.LastField.QuoteChar = '"';
cb.LastField.QuoteMode = QuoteMode.OptionalForBoth;
// etc... e.g., add a date field
cb.AddField("SomeDate", typeof(DateTime));
engine = new FileHelperEngine(cb.CreateRecordClass());
DataTable dt = engine.ReadFileAsDT("test.txt");

data omisson while using ms ace driver to read csv file

There must be some explanation for this. My csv file is something like this:
CustomerID,FirstName,LastName,EmpID,EmployeeName
1,John,Smith,2,Smith
2,Wilber,Wright,3,Shaney
3,Gloria,Johnathan,4,Dick
Notice that some field names have ID on them. I execute the below code and try to view the datatable during debugging using the DataTable visualizer (in VS).
using System;
using System.Data.OleDb;
using System.Data;
namespace caOledbFileOpen
{
class Program
{
static void Main(string[] args)
{
OleDbConnection cxn = new OleDbConnection();
cxn.ConnectionString = #"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\tools;Extended Properties='text;HDR=No;Delimiter(,)'";
cxn.Open();
OleDbCommand cmd = cxn.CreateCommand();
cmd.CommandText = "Select * from [OUt.csv]";
DataTable tbl = new DataTable();
OleDbDataAdapter adp = new OleDbDataAdapter(cmd);
adp.Fill(tbl);
Console.WriteLine("End");
Console.ReadLine();
}
}
}
I observe on the cells where it should show CustomerID or EmpID it appears blank.
The problem is that the OLEDB text driver infers the data types of your CSV 'columns' based on the data in the first rows of your data file. It sees the numbers in the first and fourth columns in your data and assumes that those columns are numeric even though the data in these columns' first row are not numeric. What you're seeing is the really annoying part of all this: columns with data that does not match the data type infered for those columns are not imported.
The solution here is to specify the data types of your columns by using a text file driver file with your CSV. A text file driver file is a text file that you create in the same folder as your CSV. The file is always named schema.ini. In the file you specify the CSV file name on the first line and the following lines define your CSV.
A schema.ini like this should work for you:
[test.csv]
ColNameHeader=False
Col1="My Field 1" Text
Col2="My Field 2" Text
Col3="My Field 3" Text
Col4="My Field 4" Text
Col5="My Field 5" Text
Here and here are links to more info on using schema.ini files
Also, in your schema.ini file add:
MaxScanRows=1
example:
[hdrdtl.txt]
Format=TabDelimited
ColNameHeader=True
MaxScanRows=1

Excluding Content From SQL Bulk Insert

I want to import my IIS logs into SQL for reporting using Bulk Insert, but the comment lines - the ones that start with a # - cause a problem becasue those lines do not have the same number f fields as the data lines.
If I manually deleted the comments, I can perform a bulk insert.
Is there a way to perform a bulk insert while excluding lines based on a match such as : any line that beings with a "#".
Thanks.
The approach I generally use with BULK INSERT and irregular data is to push the incoming data into a temporary staging table with a single VARCHAR(MAX) column.
Once it's in there, I can use more flexible decision-making tools like SQL queries and string functions to decide which rows I want to select out of the staging table and bring into my main tables. This is also helpful because BULK INSERT can be maddeningly cryptic about the why and how of why it fails on a specific file.
The only other option I can think of is using pre-upload scripting to trim comments and other lines that don't fit your tabular criteria before you do your bulk insert.
I recommend using logparser.exe instead. LogParser has some pretty neat capabilities on its own, but it can also be used to format the IIS log to be properly imported by SQL Server.
Microsoft has a tool called "PrepWebLog" http://support.microsoft.com/kb/296093 - which strips-out these hash/pound characters, however I'm running it now (using a PowerShell script for multiple files) and am finding its performance intolerably slow.
I think it'd be faster if I wrote a C# program (or maybe even a macro).
Update: PrepWebLog just crashed on me. I'd avoid it.
Update #2, I looked at PowerShell's Get-Content and Set-Content commands but didn't like the syntax and possible performance. So I wrote this little C# console app:
if (args.Length == 2)
{
string path = args[0];
string outPath = args[1];
Regex hashString = new Regex("^#.+\r\n", RegexOptions.Multiline | RegexOptions.Compiled);
foreach (string file in Directory.GetFiles(path, "*.log"))
{
string data;
using (StreamReader sr = new StreamReader(file))
{
data = sr.ReadToEnd();
}
string output = hashString.Replace(data, string.Empty);
using (StreamWriter sw = new StreamWriter(Path.Combine(outPath, new FileInfo(file).Name), false))
{
sw.Write(output);
}
}
}
else
{
Console.WriteLine("Source and Destination Log Path required or too many arguments");
}
It's pretty quick.
Following up on what PeterX wrote, I modified the application to handle large log files since anything sufficiently large would create an out-of-memory exception. Also, since we're only interested in whether or not the first character of a line starts with a hash, we can just use StartsWith() method on the read operation.
class Program
{
static void Main(string[] args)
{
if (args.Length == 2)
{
string path = args[0];
string outPath = args[1];
string line;
foreach (string file in Directory.GetFiles(path, "*.log"))
{
using (StreamReader sr = new StreamReader(file))
{
using (StreamWriter sw = new StreamWriter(Path.Combine(outPath, new FileInfo(file).Name), false))
{
while ((line = sr.ReadLine()) != null)
{
if(!line.StartsWith("#"))
{
sw.WriteLine(line);
}
}
}
}
}
}
else
{
Console.WriteLine("Source and Destination Log Path required or too many arguments");
}
}
}

Problem using oledb datatypes to write data to excel sheet

I am trying insert some data into excel sheet using oledb dataadapter which is obtained from MYSQL Db.This data obtained from mysql db contains very long texts whose datatypes in MYSQL have been defined as(Varchar(1023),Text,Longtext etc).When I try to pass these to the oledb Dataadapter I tried to use oledb.VarWChar,oledb.LongVarWChar with size 5000 and so on.But I am getting the following exception when I try to run da.update(...) command.
The field is too small to accept the amount of data you attempted to add. Try inserting or pasting less data
I am having trouble understanding what datatypes with what sizes should I use in oledb to map to these long text values.
Could someone please help me with this?
Thanks.
I am doing something similar and ran into the same error with varchar(max) data types that come from SQL Server. It doesn't matter where the data is coming from though. When you get the data from your database, you need to define the schema for the column data types and sizes. I do this by calling FillSchema on the data adapter that I am using -
DataTable dt = new DataTable();
SqlDataAdapter da = new SqlDataAdapter(cmd);
da.Fill(dt);
da.FillSchema(dt, SchemaType.Source);
return dt;
You could also set the column properties individually, if you wanted.
Then I loop through each column in my DataTable and set up my columns for export with oleDB using ADOX.NET. You don't have to use ADOX.NET, the main concept here is to use the sizes that came from the original database.
foreach (DataColumn col in dt.Columns)
{
columnList.Append(dt.TableName + "." + col.ColumnName + ", ");
ADOX.Column adoxCol = new Column();
adoxCol.ParentCatalog = cat;
adoxCol.Name = col.ColumnName;
adoxCol.Type = TranslateType(col.DataType, col.MaxLength);
int size = col.MaxLength > 0 ? col.MaxLength : 0;
if (col.AllowDBNull)
{
adoxCol.Attributes = ColumnAttributesEnum.adColNullable;
}
adoxTable.Columns.Append(adoxCol, adoxCol.Type, size);
}
Here is a snippet from my TranslateType method that determines whether or not to use the LongVarWChar or VarWChar. These data types are the ADOX.NET version of the oleDB data types. I believe that anything over 4000 characters should use the LongVarWChar type but I'm not sure about that. You didn't mention which version of Excel is your target, but I have this working with both Excel 2003 and Excel 2007.
case "System.String":
if (maxLength > 4000)
{
return DataTypeEnum.adLongVarWChar;
}
return DataTypeEnum.adVarWChar;
The LongVarWChar can take large sizes that can accomadate 2 GB. So don't worry about making the size too big.