I extracted data from an API using Airflow.
The data is extracted from the API and saved on cloud storage in JSON format.
The next step is to insert the data into an SQL DB.
I have a few questions:
Should I do it on Airflow or using another ETL like AWS Glue/Azure Data factory?
How to insert the data into the SQL DB? I google "how to insert data into SQL DB using python"?. I found a solution that loops all over JSON records and inserts the data 1 record at a time.
It is not very efficient. Any other way I can do it?
Any other recommendations and best practices on how to insert the JSON data into the SQL server?
I haven't decided on a specific DB so far, so feel free to pick the one you think fits best.
thank you!
You can use Airflow just as a scheduler to run some python/bash scripts in defined time with some dependencies rules, but you can also take advantage of the operators and the hooks provided by Airflow community.
For the ETL part, Airflow isn't an ETL tool. If you need some ETL pipelines, you can run and manage them using Airlfow, but you need an ETL service/tool to create them (Spark, Athena, Glue, ...).
To insert data in the DB, you can create your own python/bash script and run it, or use the existing operators. You have some generic operators and hooks for postgress, MySQL and the different databases (MySQL, postgres, oracle, mssql), and there are some other optimized operators and hooks for each cloud service (AWS RDS, GCP Cloud SQL, GCP Spanner...), if you want to use one of the managed/serverless services, I recommend using its operators, and if you want to deploy your service on a VM or K8S cluster, you need to use the generic ones.
Airflow supports almost all the popular cloud services, so try to choose your cloud provider based on cost, performance, team knowledge and the other needs of your project, and you will surly find a good way to achieve your goal with Airlfow.
You can use Azure Data Factory or Azure Synapse Analytics to move data in Json file to SQL server. Azure Data Factory supports 90+ connectors as of now. (Refer MS doc on Connector overview - Azure Data Factory & Azure Synapse for more details about connectors that are supported by Data Factory).
Img:1 Some connectors which are supported by ADF.
Refer MS docs on pre-requisites and Required Permissions to connect Google cloud storage with ADF
Take source connector as Google Cloud storage in copy activity. Reference: Copy data from Google Cloud Storage - Azure Data Factory & Azure Synapse | Microsoft Learn
Take SQL DB connector for sink.
ADF supports Auto create table option when there is no table created in Azure SQL database. Also, you can map the source and sink columns in mapping settings.
Related
Is there anyway to get the data and connect directly to Snowflake without any third party or open source software?
Our current setup is getting the data from SAP BW into DATAMART and then it is used by PowerBI.
We have a client request to do assessment for moving the data from SAP BW to Snowflake directly cause after research I found that Snowflake doesn't allow a direct connection with SAP or OData data sources.
Is there any recommendation or concerns in going with this approach?
Thank you.
The only data ingestion capabilities Snowflake has is to bulk load from files held on a cloud platform (S3, Azure Blob, etc.).
If you can't, or don't want to, get your data into one of these file stores then you'd need to use a commercial ETL tool or "roll your own" solution using e.g. Python
I created external table in Azure Synapse Analytics Serverless.
The File Format is CSV and it points to a Data Lake Gen 2 folder with multiple CSV files which hold the actual data. The CSV files are being updated from time to time.
I would like to foresee the potential problems that may arise when a user executes a long running query against the external table in the moment when underlying CSV files are being updated.
Will the query fail or maybe the result set will contain dirty data / inconsistent results?
As such there is no issue when connecting Synapse Serverless pool with Azure data lake. Synapse is very much compatible to query, transform and analyze and data stored in data lake.
Microsoft provide the well explained troubleshoot document in case of any error. Please refer Troubleshoot the Azure Synapse Analytics.
Synapse SQL serverless allows you to control what the behavior will be. If you want to avoid the query failures due to constantly appended files, you can use the ALLOW_INCONSISTENT_READS option.
You can see the details here:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file#querying-appendable-files
I am currently building a price comparison serverless web application on Firebase Hosting with more than 300K products (documents) in a Firestore DB.
Firestore has still some limitation when it comes to filtering (see here) so im trying to find an alternative to Firestore DB that includes some sort of authentication.
Note: being able to filter for multiple keys (e.g. all blue cars sorted by ranking index and cheaper than 25K is currently not possible in Firestore) is more important than scalability
Question: is it possible to use the Firebase Anonymous Authentication method to limit access to a Google MySQL Instance?
Google Cloud SQL is not integrated with Firebase Authentication. So you can't securely access your Google Cloud SQL database directly from client-side application code. This applies to all supported Cloud SQL databases (MySQL, PostgreSQL, and SQL Server at the moment).
I had developed a web app on a mySQL database and now I am switching to Android Mobile Development but I have large amount of data to be exported into Firebase's Cloud Firestore. I could not find a way to do so, I have the mySQL data stored in JSON and CSV.
Do I have to write a script? If yes then can you share the script or is there some sort of tool?
I have large amounts of data to be exported into Firebase's Cloud Firestore, I could not find a way to do so
If you're looking for a "magic" button that can convert your data from a MySQL database to a Cloud Firestore database, please note that there isn't one.
Do I have to write a script?
Yes, you have to write code in order to convert your actual MySQL database into a Cloud Firestore database. Please note that both types of databases share two different concepts. For instance, a Cloud Firestore database is composed of collections and documents. There are no tables in the NoSQL world.
So, I suggest you read the official documentation regarding Get started with Cloud Firestore.
If yes then can you share the script or is there some sort of tool.
There is no script and no tool for that. You should create your own mechanism for that.
I have searched high and low, but it seems like mysqldump and "select ... into outfile" are both intentionally blocked by not allowing file permissions to the db admin. Wouldn't it save a lot more server resources to allow file permissions than to disallow them? Any other import/export method I can find uses executes much slower, especially with tables that have millions of rows. Does anyone know a better way? I find it hard to believe Azure left no good way to do this common task.
You did not list the other options you found to be slow, but have you thought about using Azure Data Factory:
Use Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
It supports exporting data from Azure MySQL and MySQL:
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see Supported data stores and formats
Azure Data Factory allows you to define mappings (optional!), and / or transform the data as needed. It has a pay per use pricing model.
You can start an export manually or using a schedule using the .Net or Python SKD , the Rest api or Powershell.
It seems you are looking to export the data to a file, so Azure Blob Storage or Azure Files are likely to be a good destination. FTP or the local file system are also possible.
"SELECT INTO ... OUTFILE" we can achieve this using mysqlworkbench
1.Select the table
2.Table Data export wizard
3.export the data in the form of csv or Json