I am wanting to start a data warehouse in Google Big Query but I'm not sure how to actually schedule jobs to get the data into the cloud.
To give some background.
I have a MySQL database hosted on-prem which I currently take a demp of each night as a backup. My idea is that I can send this dump to the Google Cloud and have it import the data into Big Query.
I have thought that I could send the dump and probably use a cloud scheduler function to then run something that opens the dump and does this but I'm unsure how these services all fit together.
I'm a bit of a newby with the Google Cloud so if there is a better way to achieve this then I'm happy to change my plan of action.
Thanks in advance.
As the new EXTERNAL_QUERY has been launched and you can query from BigQuery a Cloud SQL instance, your best shot right now is:
Setup replica from your current instance to a Cloud SQL instance, follow this guide.
Understand how Cloud SQL federated queries let's you query from BigQuery Cloud SQL instances.
You get this way a live access to your relational database as:
Example query that you run on BigQuery:
SELECT * EXTERNAL_QUERY(
'connection_id',
'''SELECT * FROM mysqltable AS c ORDER BY c.customer_id'');
You can even join Bigquery table with SQL table:
Example:
SELECT c.customer_id, c.name, SUM(t.amount) AS total_revenue,
rq.first_order_date
FROM customers AS c
INNER JOIN transaction_fact AS t ON c.customer_id = t.customer_id
LEFT OUTER JOIN EXTERNAL_QUERY(
'connection_id',
'''SELECT customer_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY customer_id''') AS rq ON rq.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, rq.first_order_date;
In order to achieve this you will need to create a Cloud Storage bucket running
gsutil mb gs://BUCKET_NAME.
After creating the bucket you need to create a cloud function triggered by the bucket using the finalize option.
You can follow this sample function
'use strict';
const Storage = require('#google-cloud/storage');
const BigQuery = require('#google-cloud/bigquery');
// Instantiates a client
const storage = Storage();
const bigquery = new BigQuery();
/**
* Creates a BigQuery load job to load a file from Cloud Storage and write the data into BigQuery.
*
* #param {object} data The event payload.
* #param {object} context The event metadata.
*/
exports.loadFile = (data, context) => {
const datasetId = 'Your_Dataset_name';
const tableId = 'Your_Table_ID';
const jobMetadata = {
skipLeadingRows: 1,
writeDisposition: 'WRITE_APPEND'
};
// Loads data from a Google Cloud Storage file into the table
bigquery
.dataset(datasetId)
.table(tableId)
.load(storage.bucket(data.bucket).file(data.name), jobMetadata)
.catch(err => {
console.error('ERROR:', err);
});
console.log(`Loading from gs://${data.bucket}/${data.name} into ${datasetId}.${tableId}`);
};
Then create your BigQuery dataset using your desired schema
And now you can upload your csv file into your bucket and you will see the uploaded data in your bigquery.
Related
I have a Pyramid 2.X + SQLAlchemy + Zope App created using the official CookieCutter.
There is a table called "schema_b.table_a" with 0 records.
In the below view count(*) should be more than 0 but it returns 0
#view_config(route_name='home', renderer='myproject:templates/home.jinja2')
def my_view(request):
# Call external REST API. This uses HTTP requests. The API inserts in schema_b.table_a
call_thirdparty_api()
mark_changed(request.dbsession)
sql = "SELECT count(*) FROM schema_b.table_a"
total = request.dbsession.execute(sql).fetchone()
print(total) # Total is 0
return {}
On the other hand, the following code returns the correct count(*):
#view_config(route_name='home', renderer='myproject:templates/home.jinja2')
def my_view(request):
engine = create_engine(request.registry.settings.get("sqlalchemy.url"), poolclass=NullPool)
connection = engine.connect()
# Call external REST API. This uses HTTP requests. The API inserts in table_a
call_thirdparty_api()
sql = "SELECT count(*) FROM schema_b.table_a"
total = connection.execute(sql).fetchone()
print(total) # Total is not 0
connection.invalidate()
engine.dispose()
return {}
It seems that request.session is not able to see the data inserted by the external REST API but it is not clear to me why or how to correct it.
Pyramid and Zope provide transaction managers that extend transactions to far beyond databases. In your example I think a transaction was started in mysql when the request was received on the server by the pyramid_tm package, their documentation states:
"At the beginning of a request a new transaction is started using the request.tm.begin() function."
https://docs.pylonsproject.org/projects/pyramid_tm/en/latest/index.html
Because mysql supports consistent nonblocking reads on the transaction you join when calling request.dbsession.execute you query a snapshot of the database made at the start of the transaction. When you use the normal SQLAlchemy function to execute the query a new transaction is created and the expected result is returned.
https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html
This is very confusing in this situation. But I must admit it's impressive how well it seems to work.
Could not find any answer to this question
google cloud function eats up all available postgresql connections
library used pg#8.8.0
firebase reports 20 active users and postgresql is up to 100 parallel connection
const client = await pool.connect();
await client.query(`INSERT INTO ${table} (id, update_time, doc)
VALUES
($1,NOW(),$2)
ON CONFLICT (id) DO UPDATE
SET update_time = excluded.update_time,
doc = excluded.doc;`, [documentId, document]);
client.release();
am I doing something wrong ?
how can I detect if google spins multiple instances of the same function ?
I have a CSV file in S3 bucket which gets updated/refreshed with new data generated from ML model every week. I have created an ETL pipeline in AWS glue to read data(CSV file) from S3 bucket and load it into RDS(mysql server). I have connected my RDS via SSMS. I was able to load data successfully into RDS and validate currect row counts with 50000. When I run the job again, the whole table; ie same file contents in CSV file gets appended. Here is the sample code:
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "<dbname>", table_name = "<table schema name>", transformation_ctx = "datasink5")
Next week when I run my model there will be 1000 new rows in that CSV file. So when I run my ETL job in Glue, it should append 1000 new row values with previously loaded 5000 rows. Total row counts should reflect as 6000.
Can anyone tell me how to achieve this? Is there anyway we can truncate or drop table before inserting all new data? In that way we could avoid duplication.
Note: I will have to run "Crawler" to read data from S3 bucket every week to get new data with existing row values.
sample code generate using AWS glue.
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Any help would be appreciated.
I wonder what is the best way to read data from csv file (located on S3) and then insert into database table.
I have deployed apache flink on my k8s cluster.
I have tried with DataSet api in the following way:
Source(Read csv) -> Map(Transform POJO to Row) -> Sink(JdbcOutputFormat)
It seems that Sink (writing into DB) is the bottleneck. Source and Map tasks are idle for ~80% while at the same time Sink is idle for 0ms/1s with input rate rate 1.6MB/s.
I can only speed up the whole operation of inserting csv content into my database by spliting the whole operation on new replicas of task managers.
Is there any room for improving performance of my jdbc sink?
[edit]
DataSource<Order> orders = env.readCsvFile("path/to/file") //
.pojoType(Order.class, pojoFields)
.setParallelism(6) //
.name("Read csv"); //
JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
.setQuery("INSERT INTO orders(...) values (...)") //
.setBatchInterval(10000) //
.finish();
orders.map(order -> {
Row r = new Row(29);
//assign values from Order pojo to Row
return r;
}).output(jdbcOutput).name("Postgre SQL Output");
I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s
If instead of writing to external database I print all entries from csv file to stdout (print()) I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink.
With this post just wanted to make sure my code doesn't have any performance issues and I reach max performance from a single Task Manager.
I have the following problem. We are using Azure SQL Database for the Data processing. Instead of wizard import every time we would like to automatically load the data through API from our accounting platforms. (API Documentation Links: https://hellocashapi.docs.apiary.io/#introduction/authentication , https://www.zoho.com/books/api/v3/)
Basically my task is to get the data through API from these platforms and create the table in our Azure SQL Database and insert this data therein.
Can anyone recommend me the platform to resolve this issue? or please send me the link with the documentation which would show me the way to do that.
Thank you.
If you can put the JSON on a SQL variable like this
DECLARE #json NVARCHAR(MAX) = N'[
{
"Order": {
"Number":"SO43659",
"Date":"2011-05-31T00:00:00"
},
"AccountNumber":"AW29825",
"Item": {
"Price":2024.9940,
"Quantity":1
}
},
{
"Order": {
"Number":"SO43661",
"Date":"2011-06-01T00:00:00"
},
"AccountNumber":"AW73565",
"Item": {
"Price":2024.9940,
"Quantity":3
}
}
]
Then you can create a table using the WITH clause
SELECT * INTO TableName1
FROM OPENJSON ( #json )
WITH (
Number varchar(200) '$.Order.Number',
Date datetime '$.Order.Date',
Customer varchar(200) '$.AccountNumber',
Quantity int '$.Item.Quantity',
[Order] nvarchar(MAX) AS JSON
)
Firstly, not all the API is supported as Source data in Data Factory.
Please reference this document: Azure Data Factory connector overview
Data Factory doesn't support hellocashAPI. That means do that with Data Factory.
Secondly, Data Factory now support supports creating a destination table automatically.
Referecec:Copy Activity in Azure Data Factory supports creating a destination table automatically.
Summary:
Load data faster with new support from the Copy Activity feature of Azure Data Factory. Now, if you’re trying to copy data from an any supported source into SQL database/data warehouse and find that the destination table doesn’t exist, Copy Activity will create it automatically. After the data ingestion, review and adjust the sink table schema as needed.
This feature is supported with:
Azure SQL Database
Azure SQL Database Managed Instance
Azure SQL Data Warehouse
SQL Server
Hope this helps.