Why Dataset columns are not recognised with CSV on Azure Machine Learning? - csv

I have created a csv file, with no header.
its 1496 rows of data, on 2 columns in the form:
Real; String
example:
0.24; "Some very long string"
I go to New - Dataset - From local file
Pick my file, and No header csv format
But after its done loading i get an error message i cant decrypt:
Dataset upload failed. Internal Service Error. Request ID:
ca378649-009b-4ee6-b2c2-87d93d4549d7 2015-06-29 18:33:14Z
Any idea what is going wrong?

At this time Azure Machine Learning only accepts the comma , seperated, American style CSV.
You will need to convert to a comma separated CSV

Related

Found more columns than expected column count in Azure data factory while reading CSV stored in ADLS

I am exporting F&O D365 data to ADLS in CSV format. Now, I am trying to read the CSV stored in ADLS and copy into Azure Synapse dedicated SQL pool table using Azure data factory. However, I can create the pipeline and it's working for few tables without any issue. But it's failing for one table (salesline) because of mismatch in number of column.
Below is the CSV format sample, there is no column name(header) in CSV because it's exported from F&O system and column name stored in salesline.CDM.json file.
5653064010,,,"2022-06-03T20:07:38.7122447Z",5653064010,"B775-92"
5653064011,,,"2022-06-03T20:07:38.7122447Z",5653064011,"Small Parcel"
5653064012,,,"2022-06-03T20:07:38.7122447Z",5653064012,"somedata"
5653064013,,,"2022-06-03T20:07:38.7122447Z",5653064013,"someotherdata",,,,test1, test2
5653064014,,,"2022-06-03T20:07:38.7122447Z",5653064014,"parcel"
5653064016,,,"2022-06-03T20:07:38.7122447Z",5653064016,"B775-92",,,,,,test3
I have created ADF pipeline using copy data activity to copy the data from ADLS(CSV) to Synapse SQL table however I am getting below error.
Operation on target Copy_hs1 failed: ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'SALESLINE_00001.csv' with row number 4: found more columns than expected column count 6.,Source=Microsoft.DataTransfer.Common,'
Column mapping looks like below- Because CSV first row has 6 column so it's appearing 6 only while importing schema.
I have repro’d with your sample data and got the same error while copying the file using the copy data activity.
Alternatively, I have tried to copy the file using data flow and was able to load the data without any errors.
Source file:
Data flow:
Source dataset: only the first 6 columns are read as the first row contains only 6 columns in the file.
Source transformation: connect source dataset in source transformation.
Source preview:
Sink transformation: Connect sink to synapse dataset.
Settings:
Mappings:
Sink output:
After running the data flow, data is loaded to the sink synapse table.
Change my csv to xlsx help me to solve this problem in Copy Activity ADF.
1.From Copy data settings set "Fault Tolerance" = "Skip Incompatible rows"
skip incompatible rows
2.From Dataset connection settings set Escape character to Double quotes
Escape character

SQL COPY INTO is unable to parse a CSV due to unexpected line feeds in some fields. ROWTERMINATOR and FIELDQUOTE paramters do not work

I have a CSV file in Azure data lake that , when opened with notepad++ looks something like this:
a,b,c
d,e,f
g,h,i
j,"foo
bar,baz",l
Upon inspection in notepad++ (vew all symbols) it shows me this:
a,b,c[CR][LF]
d,e,f[CR][LF]
g,h,i[CR][LF]
j,"foo[LF]
[LF]
bar,baz",l[CR][LF]
That is to say normal Windows Carriage Return and Line Feed stuff after each row.
With the exceptions that for one of the columns someone inserted a fancy story like such:
foo
bar, baz
My TSQL code to injest the CSV looks like this:
COPY INTO
dbo.SalesLine
FROM
'https://mydatalakeblablabla/folders/myfile.csv'
WITH (
ROWTERMINATOR = '0x0d', -- Tried \n \r\n , 0x0d0a here
FILE_TYPE = 'CSV',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
CREDENTIAL = (IDENTITY = 'Managed Identity') --Used to access datalake
)
But the query doesn't work. The common error message in SSMS is:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (NAME) in data file
I have no option to correct the faulty rows in the data lake or modify the CSV in any way.
Obviously it is much larger file with real data, but I took a simple example.
How can I modify or rescript the TSQL code to correct the CSV when it is being read?
I recreated a similar file and uploaded it to my datalake and serverless SQL pool seemed to manage just fine:
SELECT *
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/badFile.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
) AS [result]
My results:
It probably seems like a bit of a workaround but if the improved parser in serverless is making light work of problems like this, then why not make use of the whole suite that is Azure Synapse Analytics. You could use serverless query as a source in a Copy activity in a Synapse Pipeline and load it to your dedicated SQL pool and that would have the same outcome as using the COPY INTO command.
In the past I've done stuff like written special parsing routines, loaded up the file as one column and split it in the db or used Regular Expressions but if there's a simple solution why not use it.
I viewed my test file via online hex editor, maybe I'm missing something:

How can I import a .TSV file using the "Get Data" command using SPSS syntax?

I'm importing a .TSV file, with the first row being the variable name and the first column as IDs, into SPSS using a syntax but I keep getting a Failure opening file error in my output. This is my code so far:
GET DATA
/TYPE=TXT
/FILE=\filelocation\filename.tsv
/DELCASE=LINE
/DELIMTERS="/t"
/QUALIFIER=''
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
/MAP
RESTORE.
CACHE.
EXECUTE.
SAVE OUTFILE = "newfile.sav"
I think I'm having an issue in the delimters or qualifiers subcommand. Wondering if I should also include the variables under the variables subcommand. Any advice would be helpful. Thanks!
The GET DATA command you cite above has an empty /VARIABLES= subcommand.
If you used the "File -> Import Data -> Text Data' wizard, it would have populated this subcommand for you. If you are writing the GET DATA command syntax yourself, then you'd have to supply that list of field names yourself.

Importing CSV file in Talend - how to set options to match Excel

I have a CSV file that I can open in Excel 2012 and it comes in perfectly. When I try to setup the metadata for this CSV file in Talend the fields (columns) are not splitting the same was as Excel splits them. I suspect I am not properly setting the metadata.
The specific issue is that I have a column with string data in it which may contain commas within the string. For example suppose I have a CSV file with three columns: ID, Name and Age which looks like this:
ID,Name,Age
1,Ralph,34
2,Sue,14
3,"Smith, John", 42
When Excel reads this CSV file it looks at the second element of the third row ("Smith, John") as a single token and places it into a cell by itself.
In Talend it trys to break this same token into two since there is a comma within the token. Apparently Excel ignores all delimeters within a quoted string while Talend by default does not.
My question is how to I get Talend to behave the same as Excel?
if you use tfileinputdelimited component to read this csv file, you can use delimeter as "," and under csv options properties of this component you should enable Text Enclosure """ option or even if you use metadata there would be an option to define string/text enclosure - here you should mention """ to resolve your problem

SSIS - CSV to SQL Server Data Import Error

I'm trying and playing around a CSV file to import data into the SQL Server table using SSIS.
The package is simple with File Source Task and SQL Server Destination.
The CSV file has 2 fields Transaction_Date and Account_Created. The dates in these fields are the format of 1/2/2009 6:00:00 AM. I am seeing the below error message:
"Error: An exception has occurred during data insertion, the message returned from the provider is: The given value of type String from the data source cannot be converted to type datetime of the specified target column."
Steps I tried below:
I tried using various other destination transformations.
I tried playing around the Data Types inside the Flatfile Connection Managers.
I tried using the Data Conversion Transformations between the Source Task and SQL Server Destination.
When I tried to load the data by providing connection only to Transaction_Date it works. However, when I tried to load by providing connection only to Account_Created it fails.
I am sure I'm missing something silly. Please help.
Regards,
KK
I tired a different method to build the package from start using the Wizard. I used the actual CSV file which had many other columns like Price, Product_name and so on. When I tried to execute the package I see a different error message as below:
"[Destination for AdventureWorks [58]] Error: There was an error with input column "Price" (91) on input "OLE DB Destination Input" (71). The column status returned was: "The value could not be converted because of a potential loss of data.".
"
When I tried a CSV file with only 2 date fields it worked excellent.
I am really puzzled now and thinking this is some kind of data type issues which I am not getting it correct. Can some one pls shred some light into this problem?
Regards,
KK
To load the first two fields (Transaction_Date, Account_Created), you need a DataFlow Task that contains:
Flatfile Source
Derived Column (create two columns to Replace 'Transaction_Date' and 'Account_Created' with formula below)
SQL Server Destination
Notes:
Date format like '1/2/2009 6:00:00 AM' is not parsed by SSIS, make sure the Flatfile Connection Manager treats the fields as Strings (length > 22)
In Derived Column, you can parse '01/02/2009' with this formula:
(DT_DBDATE)(SUBSTRING([Column 2],7,4) + "-" + SUBSTRING([Column 2],4,2) + "-" + SUBSTRING([Column 2],1,2))
The current date format that you have in the file '1/2/2009' makes the conversion very tricky due to the lack of advanced datetime parsing functions of SSIS. Depending on the day and month you will have to subtract from a variable length string, therefore you will have to combine SUBSTRING with FINDSTRING to determine the position of the separator '/'
Good luck