I created external table in Azure Synapse Analytics Serverless.
The File Format is CSV and it points to a Data Lake Gen 2 folder with multiple CSV files which hold the actual data. The CSV files are being updated from time to time.
I would like to foresee the potential problems that may arise when a user executes a long running query against the external table in the moment when underlying CSV files are being updated.
Will the query fail or maybe the result set will contain dirty data / inconsistent results?
As such there is no issue when connecting Synapse Serverless pool with Azure data lake. Synapse is very much compatible to query, transform and analyze and data stored in data lake.
Microsoft provide the well explained troubleshoot document in case of any error. Please refer Troubleshoot the Azure Synapse Analytics.
Synapse SQL serverless allows you to control what the behavior will be. If you want to avoid the query failures due to constantly appended files, you can use the ALLOW_INCONSISTENT_READS option.
You can see the details here:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file#querying-appendable-files
Related
I extracted data from an API using Airflow.
The data is extracted from the API and saved on cloud storage in JSON format.
The next step is to insert the data into an SQL DB.
I have a few questions:
Should I do it on Airflow or using another ETL like AWS Glue/Azure Data factory?
How to insert the data into the SQL DB? I google "how to insert data into SQL DB using python"?. I found a solution that loops all over JSON records and inserts the data 1 record at a time.
It is not very efficient. Any other way I can do it?
Any other recommendations and best practices on how to insert the JSON data into the SQL server?
I haven't decided on a specific DB so far, so feel free to pick the one you think fits best.
thank you!
You can use Airflow just as a scheduler to run some python/bash scripts in defined time with some dependencies rules, but you can also take advantage of the operators and the hooks provided by Airflow community.
For the ETL part, Airflow isn't an ETL tool. If you need some ETL pipelines, you can run and manage them using Airlfow, but you need an ETL service/tool to create them (Spark, Athena, Glue, ...).
To insert data in the DB, you can create your own python/bash script and run it, or use the existing operators. You have some generic operators and hooks for postgress, MySQL and the different databases (MySQL, postgres, oracle, mssql), and there are some other optimized operators and hooks for each cloud service (AWS RDS, GCP Cloud SQL, GCP Spanner...), if you want to use one of the managed/serverless services, I recommend using its operators, and if you want to deploy your service on a VM or K8S cluster, you need to use the generic ones.
Airflow supports almost all the popular cloud services, so try to choose your cloud provider based on cost, performance, team knowledge and the other needs of your project, and you will surly find a good way to achieve your goal with Airlfow.
You can use Azure Data Factory or Azure Synapse Analytics to move data in Json file to SQL server. Azure Data Factory supports 90+ connectors as of now. (Refer MS doc on Connector overview - Azure Data Factory & Azure Synapse for more details about connectors that are supported by Data Factory).
Img:1 Some connectors which are supported by ADF.
Refer MS docs on pre-requisites and Required Permissions to connect Google cloud storage with ADF
Take source connector as Google Cloud storage in copy activity. Reference: Copy data from Google Cloud Storage - Azure Data Factory & Azure Synapse | Microsoft Learn
Take SQL DB connector for sink.
ADF supports Auto create table option when there is no table created in Azure SQL database. Also, you can map the source and sink columns in mapping settings.
Is there anyway to get the data and connect directly to Snowflake without any third party or open source software?
Our current setup is getting the data from SAP BW into DATAMART and then it is used by PowerBI.
We have a client request to do assessment for moving the data from SAP BW to Snowflake directly cause after research I found that Snowflake doesn't allow a direct connection with SAP or OData data sources.
Is there any recommendation or concerns in going with this approach?
Thank you.
The only data ingestion capabilities Snowflake has is to bulk load from files held on a cloud platform (S3, Azure Blob, etc.).
If you can't, or don't want to, get your data into one of these file stores then you'd need to use a commercial ETL tool or "roll your own" solution using e.g. Python
I have several questions as follows:
I was wondering how I could transfer data from MySQL to Cosmos DB
using either Python or Data Azure Factory, or anything else.
If I understand correctly, a row from the table will be transformed
into a document, is it correct?
Is there any way to create one more row for a doc during the copy activity?
If data in MySQL are changed, will the copied data in Cosmos DB be automatically changed too? If not, how to do such triggers?
I do understand that some questions can be simply done; however, I'm new to this. Please bear with me.
1.I was wondering how I could transfer data from MySQL to Cosmos DB using either Python or Data Azure Factory, or anything else.
Yes, you could transfer data from mysql to cosmos db by using Azure Data Factory Copy Activity.
If I understand correctly, a row from the table will be transformed
into a document, is it correct?
Yes.
Is there any way to create one more row for a doc during the copy
activity?
If you want to merge multiple rows for one document,then the copy activity maybe can't be used directly. You could make your own logical code(e.g. Python code) in the Azure Function Http Trigger.
If data in MySQL are changed, will the copied data in Cosmos DB be
automatically changed too? If not, how to do such triggers?
So,you could tolerate delay sync,you could sync the data using Copy Activity between sql and cosmos db in the schedule. If you need to timely sync,as i know, azure function does support sql server trigger.But you could get some solutions from this document.
Defining Custom Binding in Azure functions
If not the binding on Azure Functions side, then it can be a SQL trigger invoking an Azure Functions HTTP trigger.
I have searched high and low, but it seems like mysqldump and "select ... into outfile" are both intentionally blocked by not allowing file permissions to the db admin. Wouldn't it save a lot more server resources to allow file permissions than to disallow them? Any other import/export method I can find uses executes much slower, especially with tables that have millions of rows. Does anyone know a better way? I find it hard to believe Azure left no good way to do this common task.
You did not list the other options you found to be slow, but have you thought about using Azure Data Factory:
Use Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
It supports exporting data from Azure MySQL and MySQL:
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see Supported data stores and formats
Azure Data Factory allows you to define mappings (optional!), and / or transform the data as needed. It has a pay per use pricing model.
You can start an export manually or using a schedule using the .Net or Python SKD , the Rest api or Powershell.
It seems you are looking to export the data to a file, so Azure Blob Storage or Azure Files are likely to be a good destination. FTP or the local file system are also possible.
"SELECT INTO ... OUTFILE" we can achieve this using mysqlworkbench
1.Select the table
2.Table Data export wizard
3.export the data in the form of csv or Json
I have been building a grails application for quite a while with dummy data using MySQL server, this was eventually supposed to be connected to Greenplum DB (postgresql cluster).
But this is not feasible anymore due to firewall issues.
We were contemplating connecting grails to a CSV file on a shared drive( which is constantly updated by greenplum DB, data is appended hourly only)
These CSV files are fairly large(3mb, 30mb and 60mb) The last file has 550,000+ rows.
Quick questions:
Is this even feasible? Can CSV be treated as a database and can grails directly access this CSV file and run queries on it, similar to that of a DB?
Assuming this is feasible, how much rework will be required in the grails codes in Datasource, controller and index ( Currently, we are connected to Mysql and we filter data in controller and index using sql queries and ajax calls using remotefunction)
Will the constant reading( csv -> grails ) and writing (greenplum -> csv) render the csv file corrupt or bring up any more problems?
I know this is not a very robust method, but I really need to understand the feasibility of this idea. Can grails function wihtout any DB and merely a CSV file on a shared drive accesssible to multiple users?
The short answer is, No. This won't be a good solution.
No.
It would be nearly impossible, if at all possible to rework this.
Concurrent access to a file like that in any environment is a recipe for disaster.
Grails is not suitable for a solution like this.
update:
Have you considered using the built in H2 database which can be packaged with the Grails application itself? This way you can distribute the database engine along with your Grails application within the WAR. You could even have it populate it's database from the CSV you mention the first time it runs, or periodically. Depending on your requirements.