My source is 'Access Database'
Dynamically generating Source query as 'Select * from <tableName>'
But I got field names with spaces in source table, and destination is of type .parquet, Data Factory pipeline is failing with below error
Example if Table Employee got a column 'First Name'
{
"errorCode": "2200",
"message": "Failure happened on 'Sink' side. ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java,
message: java.lang.IllegalArgumentException:field ended by ';': expected ';' but got 'Area' at line 0: message adms_schema { optional binary RAM Area\ntotal entry:10\r
\norg.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:215)\r\norg.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:188)\r
\norg.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:112)\r\norg.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:100)\r
\norg.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:93)\r\norg.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:83)\r
\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBuilderBridge.getSchema(ParquetWriterBuilderBridge.java:187)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBuilderBridge.build
(ParquetWriterBuilderBridge.java:159)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBridge.open(ParquetWriterBridge.java:13)\r
\ncom.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createWriter(ParquetFileBridge.java:27)\r
\n,Source=Microsoft.DataTransfer.Common,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
"failureType": "UserError",
"target": "Copy Table Data to Sink",
"details": []
}
if i change query to SELECT [First Name] as FirstName from Employee, it works fine.
As am generating query dynamically, i was using '*'
Is there some setting on Sink (.parquet) to ignore spaces in column names?
EDIT some info here https://issues.apache.org/jira/browse/SPARK-4521, not sure how to deal in ADF.
And this link: https://github.com/MicrosoftDocs/azure-docs/issues/28320
Related
I have data coming to s3 from mixpanel and mixpanel adds '$' character before some event properties. Sample:
"event": "$ae_session",
"properties": {
"time": 1646816604,
"distinct_id": "622367f395dd06c26f311c46",
"$ae_session_length": 17.2,
"$app_build_number": "172",
"$app_release": "172",...}
As '$' special character is not supported in Athena I need to use some sort of escape thing to proceeds from here. I would really need any help regarding this.
The error i am getting in subsequent DML queries after My DDL table:
HIVE_METASTORE_ERROR: Error: name expected at the position 262 of
'struct<distinct_id:string,
sheetid:string,
addedUserId:string,
memberId:string,
communityId:string,
businessId:string,
time:timestamp,
communityBusinessType:string,
initialBusinessType:string,
sheetRowIndex:string,
dataType:varchar(50),
screenType:varchar(50),
rowIndex:int,
$ae_session_length:int>' but '$' is found.
(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
Since, I can not change the column names as they are directly populating from mixpanel on daily interval I really think that there should be work around this somehow!
I get the following error when trying to split a column by space delimiter on PowerQuery in Data Factory :
UserQuery : Expression.Error: An error occurred invoking 'Table.AddColumn': We can't get the expression for the specified value.
What is causing this and how would I go about resolving it?
Many thanks
This is the error
The PowerQuery itself is :
let
Source = dedupedetipscsv,
#"Split Column by Delimiter" = Table.SplitColumn(Source, "Candidate", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, true), {"Candidate.1", "Candidate.2"}),
#"Split Column by Delimiter1" = Table.SplitColumn(Table.TransformColumnTypes(#"Split Column by Delimiter", {{"ApprovedDate", type text}}, "en-GB"), "ApprovedDate", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, true), {"ApprovedDate.1", "ApprovedDate.2"})
in
#"Split Column by Delimiter1"
Note: Power Query will split the column into as many columns as needed. The name of the new columns will contain the same name as the
original column. A suffix that includes a dot and a number that
represents the split sections of the original column will be appended
to the name of the new columns.
In the Table.AddColumn step might refer to variable which is a List. You need to refer to #"Renamed Columns" which is the last step that results in a table.
Split columns by delimiter into columns | Here
Make sure : An alternate for split by length and by position is listed below | M script workarounds
Table.AddColumn(Source, "First characters", each Text.Start([Email], 7), type text)
Table.AddColumn(#"Inserted first characters", "Text range", each Text.Middle([Email], 4, 9), type text)
I am trying to load the AWS ip ranges from their provided API as seen in the query below. If I use the proper url "https://ip-ranges.amazonaws.com/ip-ranges.json", I get the error message shown below. But if I run the same query for the json hosted at my own url "https://raw.githubusercontent.com/anandmudgerikar/aws-ips/main/ip-ranges.json", the query works fine. Any idea what might be happening? Thanks in advance.
Kusto Query:
externaldata(syncToken:string, createDate:string, prefixes: dynamic , ipv6_prefixes: dynamic)
[
h#'https://ip-ranges.amazonaws.com/ip-ranges.json'
//h#'https://api.github.com/users/anandmudgerikar/repos'
// h#'https://reqres.in/api/product/3'
//h#'https://www.dropbox.com/s/24117ufuyfanmew/ip-ranges.json'
//h#'https://raw.githubusercontent.com/anandmudgerikar/aws-ips/main/ip-ranges.json'
]
with(format= 'multijson', ingestionMapping=
'[{"Column":"syncToken","Properties":{"Path":"$.syncToken"}},'
'{"Column":"createDate","Properties":{"Path":"$.createDate"}},'
'{"Column":"prefixes","Properties":{"Path":"$.prefixes"}},'
'{"Column":"ipv6_prefixes","Properties":{"Path":"$.ipv6_prefixes"}}]')
This is the error message I get:
Query execution has resulted in error (0x80004003): Partial query failure: Invalid pointer (message: 'Argument 'name' is null: at .ctor in C:\source\Src\Common\Kusto.Cloud.Platform.Azure\Storage\PersistentStorage\BlobPersistentStorageFile.cs: line 55
Parameter name: name: ', details: 'Source: Kusto.Cloud.Platform
System.ArgumentNullException: Argument 'name' is null: at .ctor in C:\source\Src\Common\Kusto.Cloud.Platform.Azure\Storage\PersistentStorage\BlobPersistentStorageFile.cs: line 55
Parameter name: name
at Kusto.Cloud.Platform.Utils.Ensure.FailNullOrEmpty(String value, String argName, String callerMemberName, String callerFilePath, Int32 callerLineNumber) in C:\source\Src\Common\Kusto.Cloud.Platform\Diagnostics\Ensure.cs:line 150
at Kusto.Cloud.Platform.Azure.Storage.PersistentStorage.BlobPersistentStorageFile..ctor(CloudBlobContainer blobContainer, String name, IPersistentStorageFileCompressor persistentStorageFileCompressor, IPersistentStorageUri persistentStorageUri, TriState validBlobStorage, FileKnownMetadata knownMetadata) in C:\source\Src\Common\Kusto.Cloud.Platform.Azure\Storage\PersistentStorage\BlobPersistentStorageFile.cs:line 56
at Kusto.Cloud.Platform.Azure.Storage.PersistentStorage.BlobPersistentStorageFactory.CreateFileRef(String uri, IKustoTokenCredentialsProvider credentialsProvider, String compressionType, IPersistentStorageFileCompressorFactory persistentStorageFileCompressorFactory, StorageItemLocationMode locationMode, FileKnownMetadata knownMetadata) in C:\source\Src\Common\Kusto.Cloud.Platform.Azure\Storage\PersistentStorage\BlobPersistentStorageFactory.cs:line 214
at Kusto.Cloud.Platform.Storage.PersistentStorage.PersistentStorageFactoryFactory.CreateFileRef(String uri, IKustoTokenCredentialsProvider credentialsProvider, String compressionType, IPersistentStorageFileCompressorFactory persistentStorageFileCompressorFactory, StorageItemLocationMode locationMode, FileKnownMetadata knownMetadata) in C:\source\Src\Common\Kusto.Cloud.Platform\Storage\PersistentStorage\PersistentStorageFactoryFactory.cs:line 154
at Kusto.DataNode.DataEngineQueryPlan.ExternalDataQueryUtils.ReadExternalDataAsCsv(ArtifactEntry artifactEntry, DataSourceStreamFormat format, ExternalDataQueryCallbackContext callbackContext, Int64 recordCountLimit, String& columnsMapping) in C:\source\Src\Engine\DataNode\QueryService\DataEngineQueryPlan\ExternalDataQueryUtils.cs:line 101
at Kusto.DataNode.DataEngineQueryPlan.DataEngineQueryProcessor.DataEngineQueryCallback.GetExternalData(String externalDataUri, DataSourceStreamFormat format, String serializedCallbackContext, Int64 recordCountLimit) in C:\source\Src\Engine\DataNode\QueryService\DataEngineQueryPlan\DataEngineQueryProcessor.cs:line 399').
clientRequestId: KustoWebV2;71c590e6-27ab-41c9-bf3c-03ba4ee0cf3b
Only storage connection strings that are documented here are officially supported; There are no guarantees for others.
To comply with that, for example, you can copy the file from its original location to an Azure blob container, and query it from there.
I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme
I am using parasoft SOATest to test a service response and I got a failure
Message: DataSource: products (row 1): Value Assertion: For element "../item", expected: abc but was: bcd
My Requirement is to validate the following response.
{
"samples" : {
"prds" : [
"abc",
"bcd"
]
}
}
And I have a datasource table which is like follows. First row as the column name.
prds
abc
bcd
In the SOATest I have a JSON Assertor and inside JSON Assertor I have configured a Value Assertion. In the Value Assertion I selected the first item and then in the next step I selected Apply to all "item[*]". Then Finish.
In the Expected Value I select Parameterized and select the prds from the drop down menu.
After all when the service return the above payload it failed with the above given message.
Is this a bug/limitation of SOATest or am I missing some step in here.
I believe this is just because you opted Apply to all "item[*]" instead of Apply to "item[1]" Only