BigQuery to GCS JSON - json

I wanted to be able to store Bigquery results as json files in Google Cloud Storage. I could not find an OOB way of doing this so what I had to do was
Run query against Bigquery and store results in permanent tables. I use a random guid to name the permanent table.
Read data from bigquery, convert it to json in my server side code and upload json data to GCS.
Delete permanent table.
Return the json file url in GCS to front end application.
While this works there are some issues with this.
A. I do not believe I am making use of BigQuery's caching by making use of my own permanent tables. Can someone confirm this?
B. Step 2 will be a performance bottleneck. To pull data out of GCP to do JSON conversion to reupload into GCP just feels wrong. A better approach would be to use some cloud native serverless function or some other GCP data workflow type service to do this step that gets triggered upon creation of a new table in the dataset. What do you think is the best way to achieve this step?
C. Is there really no way to do this without using permanent tables?
Any help appreciated. Thanks.

With persistent table, your are able to leverage Bigquery Data Exporting to export the table in JSON format to GCS. It has no cost, comparing with you reading the table from your server side.
Right now, there is indeed a way to avoid creating permanent table. Because every query result is actually a temporary table already. If you go to "Job Information" you can find the full name of the temp table, which can be used in Data Exporting to be exproted as a JSON to GCS. However, this is way more complicated than just create a persistent table and delete it afterwards.

Related

Variable schema and Hive integration using Kafka

I've been searching for answer but haven't found any similar issue or thread which could help.
The problem is that I have a Kafka topic which receives data from a different topic coming from another Kafka. The data is a continuous flow of various Json files, each having its own schema - only few fields are common.
I need data from all of them to be ingested into a single Hive table. I thought of creating a table with only one column to store the whole .json content as a raw string but ultimately failed to integrate with Hive (I was only able to move data to HDFS but I'd rather like to have a table receiving data directly from Kafka as it's a continuous flow).
Unfortunately, I'm not able to alter the original topic in any way. Therefore, does someone have any idea how to deal with that?

connect google cloud storage to mysql

Hi I am not sure If I am heading towards a right solution and needed some advice.
I have some social media platform connectors that dump files in a CSV format. What I am trying to achieve is for instance a CSV file has impressions, reach, clicks as columns - I want to then create a data pipeline in Google Cloud Platform to use MySQL workbench to input only impressions and clicks from the CSV files into a table.
is this possible? If not, what are the recommendations? I can use Big Query for this but we just want to work with a subset of the CSV data and not all of it.
Suggestions please!
For you there is Dataprep - An intelligent cloud data service to visually explore, clean, and prepare data.
There you can
feed in all your CSV from Cloud Storage
visually explore
setup recipes to cleaning, transforming
or filtering, or joining datasets
and jobs to write the combined results into a final CSV for example or BigQuery
According to the power and the price of BigQuery, I will use it. You can either load your file in BigQuery in staging/temp tables or let them in Cloud Storage and scan them with federated table
The principle is to query the files, get the columns/data that you want, optionally transform/filter/clean them, and store them in your final table
Create table XXX
Select ... from <Staging table/external table>

Merging dataset results in an Azure Data Factory pipeline

I am reading a JSON-formatted blob from Azure Storage. I am then using one of the values in that JSON to query a database to get more information. What I need to do is take the JSON from the blob, add the fields from the database to it, then write that combined JSON to another Azure Storage. I cannot, however, figure out how to combine the two pieces of information.
I have tried custom mapping in the copy activity for the pipeline. I have tried parameterized datasets, etc. Nothing seems to provide the results I'm looking for.
Is there a way to accomplish this using native activities and parameters (i.e. not by writing a simple utility and executing it as a custom activity)?
For this I would recommend create a custom U-SQL job to do what you want. So first lookup for both the data you want. Do the job in the U-SQL job and copy the results to the Azure Storage. See this example for your pipeline:
If you are not familiar to U-SQL this can help you:
https://saveenr.gitbooks.io/usql-tutorial/content/
Also this will help you working with Json in your job:
https://www.taygan.co/blog/2018/01/09/azure-data-lake-series-working-with-json-part-2
https://www.taygan.co/blog/2018/03/02/azure-data-lake-series-working-with-json-part-3

Copy Data from MySQL (on-premises) to Cosmos DB

I have several questions as follows:
I was wondering how I could transfer data from MySQL to Cosmos DB
using either Python or Data Azure Factory, or anything else.
If I understand correctly, a row from the table will be transformed
into a document, is it correct?
Is there any way to create one more row for a doc during the copy activity?
If data in MySQL are changed, will the copied data in Cosmos DB be automatically changed too? If not, how to do such triggers?
I do understand that some questions can be simply done; however, I'm new to this. Please bear with me.
1.I was wondering how I could transfer data from MySQL to Cosmos DB using either Python or Data Azure Factory, or anything else.
Yes, you could transfer data from mysql to cosmos db by using Azure Data Factory Copy Activity.
If I understand correctly, a row from the table will be transformed
into a document, is it correct?
Yes.
Is there any way to create one more row for a doc during the copy
activity?
If you want to merge multiple rows for one document,then the copy activity maybe can't be used directly. You could make your own logical code(e.g. Python code) in the Azure Function Http Trigger.
If data in MySQL are changed, will the copied data in Cosmos DB be
automatically changed too? If not, how to do such triggers?
So,you could tolerate delay sync,you could sync the data using Copy Activity between sql and cosmos db in the schedule. If you need to timely sync,as i know, azure function does support sql server trigger.But you could get some solutions from this document.
Defining Custom Binding in Azure functions
If not the binding on Azure Functions side, then it can be a SQL trigger invoking an Azure Functions HTTP trigger.

How to transfer large data between pages in Perl/CGI?

I have worked with CGI pages a lot and dealt with cookies and storing the data in the /tmp directory in Linux.
Basically I am running a query for millions of records using SQL, and am saving it in a hash format. I want to transfer that data to Ajax ( which eventually will perform some calculation and return a graph using Google API.
Or, I want it to transfer that data to another CGI page somehow.
PS : The data I am talking about here is in forms of 10-100+ MB's.
Until now, i've been saving that data on the file in the server, but again, it's a hassle to deal with that data on the server for each query.
You don't mention why it's a hassle to deal with the data on the server for each query, but assuming the hassle is working with the file, DBM::Deep might make it relatively easy to write the hash out and get it back again. Once you have that, you could create a simple script to return it as JSON and access it as needed from Javascript or other pages. Although I think the browser might slow down with 100MB JSON data structure.