Merging dataset results in an Azure Data Factory pipeline - json

I am reading a JSON-formatted blob from Azure Storage. I am then using one of the values in that JSON to query a database to get more information. What I need to do is take the JSON from the blob, add the fields from the database to it, then write that combined JSON to another Azure Storage. I cannot, however, figure out how to combine the two pieces of information.
I have tried custom mapping in the copy activity for the pipeline. I have tried parameterized datasets, etc. Nothing seems to provide the results I'm looking for.
Is there a way to accomplish this using native activities and parameters (i.e. not by writing a simple utility and executing it as a custom activity)?

For this I would recommend create a custom U-SQL job to do what you want. So first lookup for both the data you want. Do the job in the U-SQL job and copy the results to the Azure Storage. See this example for your pipeline:
If you are not familiar to U-SQL this can help you:
https://saveenr.gitbooks.io/usql-tutorial/content/
Also this will help you working with Json in your job:
https://www.taygan.co/blog/2018/01/09/azure-data-lake-series-working-with-json-part-2
https://www.taygan.co/blog/2018/03/02/azure-data-lake-series-working-with-json-part-3

Related

Whats the best way to deploy predefined data via api calls?

I have several json files that represent the payload for different API's(I can map which API to call based on the file name, but other methods could be applied as well),
what is the best practice to populate my data on the application with the help of those json files?
My first though was to use some automation framework(rest assured for example) to accomplish my task, but I think it might be an overkill for my scenario.
p.s. snapshot of DB/query direct to DB is not an option because of the nature of the application.

BigQuery to GCS JSON

I wanted to be able to store Bigquery results as json files in Google Cloud Storage. I could not find an OOB way of doing this so what I had to do was
Run query against Bigquery and store results in permanent tables. I use a random guid to name the permanent table.
Read data from bigquery, convert it to json in my server side code and upload json data to GCS.
Delete permanent table.
Return the json file url in GCS to front end application.
While this works there are some issues with this.
A. I do not believe I am making use of BigQuery's caching by making use of my own permanent tables. Can someone confirm this?
B. Step 2 will be a performance bottleneck. To pull data out of GCP to do JSON conversion to reupload into GCP just feels wrong. A better approach would be to use some cloud native serverless function or some other GCP data workflow type service to do this step that gets triggered upon creation of a new table in the dataset. What do you think is the best way to achieve this step?
C. Is there really no way to do this without using permanent tables?
Any help appreciated. Thanks.
With persistent table, your are able to leverage Bigquery Data Exporting to export the table in JSON format to GCS. It has no cost, comparing with you reading the table from your server side.
Right now, there is indeed a way to avoid creating permanent table. Because every query result is actually a temporary table already. If you go to "Job Information" you can find the full name of the temp table, which can be used in Data Exporting to be exproted as a JSON to GCS. However, this is way more complicated than just create a persistent table and delete it afterwards.

(Azure) Data Factory to Data warehouse - Dynamically name the landing tables and schemas

I plan to move data from a number of databases periodically using Azure Data Factory (ADF) and i want to move the data into Azure Parallel Data-Warehouse (APDW). However the 'destination' step in the ADF wizard offers me 2 functions; 1- in the case where data is retrieved from a view you are expected to map the columns to an existing table, and 2- when the data comes from a table you are expected to generate a table object in the APDW.
Realistically this is too expensive to maintain and it is possible to erroneously map source data to a landing zone.
What i would like to achieve is an algorithmic approach using variables to name schemas, customer codes and tables.
After the source data has landed i will be transforming it using our SSIS Integration Runtime. I am wondering also whether a SSIS package could request source data instead of an ADF pipeline.
Are there any resources about connecting to on premises IRs through SSIS objects?
Can the JSON of an ADF be modified to dynamically generate a schema for each data source?
For your question #2, Can the JSON of an ADF be modified to dynamically generate a schema for each data source:
You could put your generate table script in precopyscript.

Nativescript store JSON data in sqlite

I have some JSON data coming in from an API that I want to store it in a Nativescript app.
Is there a way I can store in a simple way to sqlite database.
Currently, I am using loops to iterate over the data and store them in rows in sqlite.
I have tried using application-settings seen here http://docs.nativescript.org/api-reference/modules/_application_settings_.html
I plan to store contact details of 1000s of people. So for that which is the best way to go about it.
Kindly do let me know any other ways that can handele JSON data.
Disclaimer I'm the author of both nativescript-sqlite and of nativescript-localstorage.
SQLite is very useful if you need to do searches and sql queries against the data; unions, filtering, etc.
However if all you need to do is store the data as (i.e. like a NoSQL database); you can use my nativescript-localstorage plugin to store the data as an object and then re-load it when you need it.

Dynamic JSON file vs API

I am designing a system with 30,000 objects or so and can't decide between the two: either have a JSON file pre computed for each one and get data by pointing to URL of the file (I think Twitter does something similar) or have a PHP/Perl/whatever else script that will produce JSON object on the fly when requested, from let's say database, and send it back. Is one more suited for than another? I guess if it takes a long time to generate the JSON data it is better to have already done JSON files. What if generating is as quick as accessing a database? Although I suppose one has a dedicated table in the database specifically for that. Data doesn't change very often so updating is not a constant thing. In that respect the data is static for all intense and purposes.
Anyways, any thought would be much appreciated!
Alex
You might want to try MongoDB which retrieves the objects as JSON and is highly scalable and easy to setup.