U-SQL: Schematized input files - cortana-intelligence

U-SQL: Schematized input files - cortana-intelligence

How can I use schematized input files in an U-SQL script? That is, how can I use multiple files as input to an EXTRACT clause?
According to
https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx?f=255&MSPPError=-2147217396
and
https://social.msdn.microsoft.com/Forums/en-US/0ad563d8-677c-46e7-bb3e-e1627025f2e9/read-data-from-multiple-files-and-folder-using-usql?forum=AzureDataLake&prof=required
I tried both
#rs =
EXTRACT s_type string, s_filename string
FROM "/Samples/logs/{s_filename:*}.txt"
USING Extractors.Tsv();
and
#rs =
EXTRACT s_type string
FROM "/Samples/logs/{*}.txt"
USING Extractors.Tsv();
Both versions resulting in an error message complaining about '*' being an invalid character.

File set is not supported locally so far. It will work when you run it on cloud Azure Data Lake Analytics account.

Related

Retrieve CSV format over https

I'm retrieving data in csv format from a web service using http request.
data can contain a web address with parms in json format ( https://example.com?parm=abc&opt={"key1":"value1","key2":"value2"} )
the comma within the JSON string causes the subsequent data processing using spark to mess up the data.
the web service providing the data does not allow to change the csv delimiter.
resulting data in csv file is doubling double quotes inside like
' "https....""key1"":""value1""... '
Are there any options in the http protocol to 'correctly' transport/quote the data or is this rather a spark issue ?
using Postman to analyse the 'look and feel' of the delivered data

I found the solution in spark
spark.read.load(file, format='csv', header = True, quote = '"', escape = '"')

SQL COPY INTO is unable to parse a CSV due to unexpected line feeds in some fields. ROWTERMINATOR and FIELDQUOTE paramters do not work

I have a CSV file in Azure data lake that , when opened with notepad++ looks something like this:
a,b,c
d,e,f
g,h,i
j,"foo
bar,baz",l
Upon inspection in notepad++ (vew all symbols) it shows me this:
a,b,c[CR][LF]
d,e,f[CR][LF]
g,h,i[CR][LF]
j,"foo[LF]
[LF]
bar,baz",l[CR][LF]
That is to say normal Windows Carriage Return and Line Feed stuff after each row.
With the exceptions that for one of the columns someone inserted a fancy story like such:
foo
bar, baz
My TSQL code to injest the CSV looks like this:
COPY INTO
dbo.SalesLine
FROM
'https://mydatalakeblablabla/folders/myfile.csv'
WITH (
ROWTERMINATOR = '0x0d', -- Tried \n \r\n , 0x0d0a here
FILE_TYPE = 'CSV',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
CREDENTIAL = (IDENTITY = 'Managed Identity') --Used to access datalake
)
But the query doesn't work. The common error message in SSMS is:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (NAME) in data file
I have no option to correct the faulty rows in the data lake or modify the CSV in any way.
Obviously it is much larger file with real data, but I took a simple example.
How can I modify or rescript the TSQL code to correct the CSV when it is being read?

I recreated a similar file and uploaded it to my datalake and serverless SQL pool seemed to manage just fine:
SELECT *
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/badFile.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
) AS [result]
My results:
It probably seems like a bit of a workaround but if the improved parser in serverless is making light work of problems like this, then why not make use of the whole suite that is Azure Synapse Analytics. You could use serverless query as a source in a Copy activity in a Synapse Pipeline and load it to your dedicated SQL pool and that would have the same outcome as using the COPY INTO command.
In the past I've done stuff like written special parsing routines, loaded up the file as one column and split it in the db or used Regular Expressions but if there's a simple solution why not use it.
I viewed my test file via online hex editor, maybe I'm missing something:

Replace space of multiple files inside Google Cloud Storage

I have thousands of jsons on google cloud storage, but they have a specific field name (campaign name)
with a space, but before load (or create an external table) on bigquery I need to replace the space for underscore (campaign_name). I'm getting the following error when I try to create without replace:
Error in query string: Illegal field name: campaign name Table: raw_km_all_data
Is there any other solution that not includes download all the files to a server, do the replace and then upload again to cloud storage?
Thanks!

You can pretend that these JSON files are CSV with single column containing big string. Then, once it is loaded into BigQuery as a single column table - use REPLACE or REGEXP_REPLACE functions to replace spaces with underscores. Then you can use JSON_EXTRACT family of functions to parse JSON and populate table with real columns.

Why Dataset columns are not recognised with CSV on Azure Machine Learning?

I have created a csv file, with no header.
its 1496 rows of data, on 2 columns in the form:
Real; String
example:
0.24; "Some very long string"
I go to New - Dataset - From local file
Pick my file, and No header csv format
But after its done loading i get an error message i cant decrypt:
Dataset upload failed. Internal Service Error. Request ID:
ca378649-009b-4ee6-b2c2-87d93d4549d7 2015-06-29 18:33:14Z
Any idea what is going wrong?

At this time Azure Machine Learning only accepts the comma , seperated, American style CSV.
You will need to convert to a comma separated CSV

Getting multiple values from a single cell in .CSV file in jmeter

How can I read multiple values from a single cell in CSV file in jmeter . I have an excel sheet as .csv input and one of its column has mobile numbers which have 2 or more values.eg
987#765#456 Which sampler should I use.
now I want it to split at # as 987,765,456

To read the csv file in JMeter, use CSV data set config.
Check this link to understand how to use CSV data set config in JMeter.
Lets assume the column name is mobileNo which has the value as 987#765#456.
Use Beanshell preprocessor to replace '#' by ','.
mobileNo = vars.get("mobileNo");
mobileNo = mobileNo.replace("#", ",");
vars.put("mobileNo",mobileNo);

You can use JMeter's __javaScript function to replace all occurences of # with , as follows:
Given that your 987#765#456 bit lives as ${mobileNumber} JMeter Variable:
${__javaScript("${mobileNumber}".split('#').join('\,'),mobileNumber)}
The script above replaces all "at" signs with commas and stores the result in "mobileNumber" JMeter Variable.
To learn more about different JMeter Functions refer to How to Use JMeter Functions post series.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

U-SQL: Schematized input files - cortana-intelligence

File set is not supported locally so far. It will work when you run it on cloud Azure Data Lake Analytics account.

Related

Retrieve CSV format over https

SQL COPY INTO is unable to parse a CSV due to unexpected line feeds in some fields. ROWTERMINATOR and FIELDQUOTE paramters do not work

Replace space of multiple files inside Google Cloud Storage

Why Dataset columns are not recognised with CSV on Azure Machine Learning?

Getting multiple values from a single cell in .CSV file in jmeter

Categories

Resources