I have an SSIS package reading web data from Azure Data Lake through the component called Azure Data Lake Store Source Editor.
The data, I am reading is large and web-based data, i.e. there are lots of unreadable stuff in it.
The data is JSON and I don't want to parse the data in the Source component. I am parsing it in another component (script transformation editor). I just need a delimiter that says SSIS not to try to parse the data.
All is fine for an hour or two. SSIS is loading the data for many files, but then I get the error.
Error:
Microsoft.SqlServer.Dts.Pipeline.PipelineComponentHResultException (0xC02090F5): Pipeline component has returned HRESULT error code 0xC02090F5 from a method call. at Microsoft.SqlServer.IntegrationService.AdlsComponents.PipelineComponentSource.TransferToOutputBuffers(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)
After some investigation, I found that this is what you get when the delimiter is part of the data.
Almost every character I tried in the ASCII table and I am still getting the error after some processing.
Do you have any idea:
Is there a way to bypass the delimiter?
Is there a delimiter you can recommend (may be some control characters) that can never be used as data?
Thanks for reading & considering
Related
The values for data.CustomerNumber do not get written to the file, the other elements do.
The CustomerNumber is Pascal case because it is a custom value in the system I am pulling the data out of.
I have tried the map complex values option, advanced editor, deleted, started over. I originally was trying to save this to a Azure Sql table but gave up , and just tried to write to a file. I have been able to do this type of mapping in other pipelines so I am really at a loss for why it doesn't work in this case.
The actual json content for the mapping
Using SAS 9.4 M6, is there an easy way to convert an .sas7bat file into a JSON format?
This is needed inside a STP, so that the JSON can be included in some way in a proc stream.
Either have a function/macro which can print the file formatted inside of a proc stream or create a sepereate STP which only returns a file as JSON, so that it can be loaded in the background from a web application.
SAS can write JSON in several ways - the two easiest are:
PROC JSON - this allows you to write JSON, within some limited formats. If the JSON you're writing is pretty simple, this is a very easy way to do it - just pass it a dataset and it will make a single level JSON file with it. This doesn't work as well if you have complicated structures.
See also the PROC JSON Tip sheet.
The data step - if you need something more complicated, just write it with the data step! I've written custom macros before to simplify it - things like %write_array to write an array or %start_object to start a new object, things like that; but it does get a bit complex. This is the only way to do truly complex structures, though, beyond what PROC JSON can do.
PROC JSON is as simple as a PROC EXPORT, though... the example from the documentation says it all:
proc json out="C:\Users\sasabc\JSON\DefaultOutput.json";
export sashelp.class;
run;
To include it in a stream, you'd either use the _webout fileref, or if you really need PROC STREAM specifically, you could probably write it to a file, read it into a big macro variable or maybe use %include, and then write it back in during the PROC STREAM execution. But I suspect that you can use _webout here and skip PROC STREAM.
Here's an example using PROC JSON with _WEBOUT in a stored procedure.
proc json is great but it can cause issues sometimes if your data contains invalid characters - in that case you might want to use this macro: https://core.sasjs.io/mp__jsonout_8sas.html
Fyi, my team built an entire (open source) framework to enable rapid development of SAS-powered web applications - it's called sasjs
You can use it to deploy both backend (data) services and even the frontend (as a streamed app). It works the same on both SAS 9 and Viya, so there are no issues with upgrades. Further info here: https://sasjs.io/resources/
I searched but didn't find anything specific to the issue I was having. My apologies if I overlooked it.
I have a scenario where I'm pulling data from an oData source and everything works fine. I have the oData source in a loop where I loop over each of our different companies and pull the data for that company. I'm doing this to reduce the volume of data that is being returned in each oData call.
Everything works fine as long as there is data being returned from the oData call. In the attached photo you can see that the call is being made and data is being returned.
But when the oData Service is called with parameters/filters that return no data I get the following and that is when the VS_NEEDSNEWMETADATA error is thrown. The image below shows what I get when no data is returned.
So the issue isn't that I have invalid metadata due to changes made to the service (adding/removing fields). It's that nothing is being returned so there is no Metadata. Now it's possible that this is an issue with the system that I'm pulling the oData from (SAP S4) and they way that system surfaces the oData call when there is no data??
Either way, trying to figure out a way to handle this within SSIS. I tried to set validate external metadata = false but the package still fails. I could also fix this by excluding those companies in the script but then once data does exists I'd have to remember to update the scripts and redeploy.
Ideas suggestions?
You've hit the nail on the head except your metadata is changing in the no data case because the shape is different - at least from the SSIS engine perspective.
You basically need to double invoke the OData source. First invocation will simply identify is data returned. If that evaluation is true, only then does the Data Flow start running.
How can you evaluate whether your OData will return data? You're likely looking at a Script Task and C#/VB.NET and hopefully the SAP data returns don't vary between this call and next call.
If that's the case, then you need to define an OnError event handler for the Data Flow, mark the System scoped variable Propagate to False and maybe even force a success status to be returned to get the loop to continue.
I am trying to upload JSON files to BigQuery. The JSON files are outputs from the Lighthouse auditing tool. I have made some changes them in Python to make field names acceptable for BigQuery and converted the format into newline JSON.
I am now testing this process and I have found that while for many web pages the upload runs without issue, BigQuery is rejecting some of the JSON files. The rejected JSONs always seem to be from the same website, for example, many of the audit JSONs from Topshop have failed on upload (the manipulations in Python run without issue). What I am confused by is that I can see no difference in the formatting/structure of the JSONs which succeed and fail.
I have included some examples here of the JSON files: https://drive.google.com/open?id=1x66PoDeQGfOCTEj4l3VqMIjjhdrjqs9w
The error I get from BigQuery when a JSON fails to load is this:
Error while reading table: build_test_2f38f439_7e9a_4206_ada6_ac393e55b8ec4_source, error message: Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Parser terminated before end of string
I have also attempted to upload the failed JSONs to a new table through the interface using the autodetect feature (in an attempt to discover whether the Schema was at fault) and these uploads fail too, with the same error.
This makes me think the JSON files must be wrong, but I have copied them into several different JSON validators which all accept them as one row of valid JSON.
Any help understanding this issue would be much appreciated, thank you!
When you load JSON files to BigQuery, it's good to remember that there are some limitations associated to this format. You can find them here. Even though your files might be valid JSON files, some of them may not comply with BigQuery limitations, so I would recommend you to double check if they are actually correct for BigQuery.
I hope that helps.
I eventually found the error here through a long trial and error process where I uploaded first the first-half and then the second-half of the JSON file to BigQuery. The second-half failed so I split that in half again to see which half the error occurred in. This continued until I found the line.
At a deep level of nesting there was a situation where one field was always a list of strings, but when there were no values associated with the field it appeared as an empty string (rather than an empty list). This inconsistency was causing the error. The trial and error process was long but given the vague error message and that the JSON was thousands of lines long, this seemed like the most efficient way to get there.
I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"#address": "XXXXXX"
...
}
Of course the "#address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
Loading the file in memory into a string var
Replace #address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "#address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "#address". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "#" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.
To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code
See:
Reading nested JSON in Google Dataflow / Apache Beam
https://cloud.google.com/dataflow/docs/templates/overview
https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/
disclosure: the last link is to a blog which was written by one of the engineers I work with.