I was reading articles related to Kafka and StreamSets and my understanding was
Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka
StreamsSets is a technology to move data from one source to another through a pipeline
Now, below are my questions, Please help to clarify
What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data?
If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?
How is StreamSets different from SSIS, Informatica etc?
StreamSets is a graphical tool that contains components that allow for data movement, which happen to include Kafka producers and consumers, but you're not required to use them.
They're complementary, and by using Kafka, you can allow for back-pressure in streaming systems or have non-StreamSets producers/consumers interacting with other Kafka topics. No, Kafka doesn't move the data (except for internal replication), the clients that interact with the brokers do.
I've not used Informatica or SSIS, but I'm sure if you contacted someone at StreamSets, they could answer how they compare
In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka
I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.
StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.
SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.
My personal view on how StreamSets and SSIS are different is:
StreamSets is web based while SSIS needs Visual Studio, StreamSets GUI is easier to use and does not require a special software to be installed for each developer.
Deploying StreamSets pipelines to production with source control was easier than SSIS packages.
SSIS is a Microsoft product so it integrates very well with other Microsoft products. StreamSets can be installed on any platform which makes it ideal for the AWS cloud.
If you want to write SSIS scripting tasks you have to use C#/DotNet. StreamSets script tasks can be written in Jython and JavaScript
SSIS is older and has tons of documentation online.
Thanks to all, I think have to share some idea about how we can look towards the specification between kafka and streamset, if we are using both in same cluster then how we can differentiate.
"As we are using reliability of Kafka & Simplicity of Streamset"
Streamset removes coding overhead for producer and Consumer
Streamset Use to 1 source 1 Destination
Kafka take data from multiple sources to multiple destination (pub-sub methodology)
Streamset removes data drift problem
Related
I extracted data from an API using Airflow.
The data is extracted from the API and saved on cloud storage in JSON format.
The next step is to insert the data into an SQL DB.
I have a few questions:
Should I do it on Airflow or using another ETL like AWS Glue/Azure Data factory?
How to insert the data into the SQL DB? I google "how to insert data into SQL DB using python"?. I found a solution that loops all over JSON records and inserts the data 1 record at a time.
It is not very efficient. Any other way I can do it?
Any other recommendations and best practices on how to insert the JSON data into the SQL server?
I haven't decided on a specific DB so far, so feel free to pick the one you think fits best.
thank you!
You can use Airflow just as a scheduler to run some python/bash scripts in defined time with some dependencies rules, but you can also take advantage of the operators and the hooks provided by Airflow community.
For the ETL part, Airflow isn't an ETL tool. If you need some ETL pipelines, you can run and manage them using Airlfow, but you need an ETL service/tool to create them (Spark, Athena, Glue, ...).
To insert data in the DB, you can create your own python/bash script and run it, or use the existing operators. You have some generic operators and hooks for postgress, MySQL and the different databases (MySQL, postgres, oracle, mssql), and there are some other optimized operators and hooks for each cloud service (AWS RDS, GCP Cloud SQL, GCP Spanner...), if you want to use one of the managed/serverless services, I recommend using its operators, and if you want to deploy your service on a VM or K8S cluster, you need to use the generic ones.
Airflow supports almost all the popular cloud services, so try to choose your cloud provider based on cost, performance, team knowledge and the other needs of your project, and you will surly find a good way to achieve your goal with Airlfow.
You can use Azure Data Factory or Azure Synapse Analytics to move data in Json file to SQL server. Azure Data Factory supports 90+ connectors as of now. (Refer MS doc on Connector overview - Azure Data Factory & Azure Synapse for more details about connectors that are supported by Data Factory).
Img:1 Some connectors which are supported by ADF.
Refer MS docs on pre-requisites and Required Permissions to connect Google cloud storage with ADF
Take source connector as Google Cloud storage in copy activity. Reference: Copy data from Google Cloud Storage - Azure Data Factory & Azure Synapse | Microsoft Learn
Take SQL DB connector for sink.
ADF supports Auto create table option when there is no table created in Azure SQL database. Also, you can map the source and sink columns in mapping settings.
Is there anyway to get the data and connect directly to Snowflake without any third party or open source software?
Our current setup is getting the data from SAP BW into DATAMART and then it is used by PowerBI.
We have a client request to do assessment for moving the data from SAP BW to Snowflake directly cause after research I found that Snowflake doesn't allow a direct connection with SAP or OData data sources.
Is there any recommendation or concerns in going with this approach?
Thank you.
The only data ingestion capabilities Snowflake has is to bulk load from files held on a cloud platform (S3, Azure Blob, etc.).
If you can't, or don't want to, get your data into one of these file stores then you'd need to use a commercial ETL tool or "roll your own" solution using e.g. Python
I have searched high and low, but it seems like mysqldump and "select ... into outfile" are both intentionally blocked by not allowing file permissions to the db admin. Wouldn't it save a lot more server resources to allow file permissions than to disallow them? Any other import/export method I can find uses executes much slower, especially with tables that have millions of rows. Does anyone know a better way? I find it hard to believe Azure left no good way to do this common task.
You did not list the other options you found to be slow, but have you thought about using Azure Data Factory:
Use Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
It supports exporting data from Azure MySQL and MySQL:
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see Supported data stores and formats
Azure Data Factory allows you to define mappings (optional!), and / or transform the data as needed. It has a pay per use pricing model.
You can start an export manually or using a schedule using the .Net or Python SKD , the Rest api or Powershell.
It seems you are looking to export the data to a file, so Azure Blob Storage or Azure Files are likely to be a good destination. FTP or the local file system are also possible.
"SELECT INTO ... OUTFILE" we can achieve this using mysqlworkbench
1.Select the table
2.Table Data export wizard
3.export the data in the form of csv or Json
I'm trying to get all the historic information about a sensor of Fi Ware.
I've seen that Orion uses Cygnus to store historics in Cosmos. Is that information accesible or is it only possible to use IDAS to get it?
Where could I get more info about this?
The way you can consume the data is, in an incremental approach from the learning curve point of view:
Working with the raw data, either "locally" (i.e. logging into the Head Node of the cluster) by using the Hadoop commands, either "remotely" by using the WebHDFS/HttpFS REST API. Please observe within this approach you have to implement whichever analyzing logic you need, since Cosmos only allows you to manage, as said, raw data.
Working with Hive in order to query the data in a SQL-like approach. Again, you can do it locally by invoking the Hive CLI, or remotely by implementing your own Hive client in Java (there are some other languages) using the Hive libraries.
Working with MapReduce (MR) in order to implement strong analysis. In order to do this, you'll have to create your own MR-based application (typically in Java) and run it locally. Once you are done with the local run of the MR app, you can go with Oozie, which allows you to run such MR apps in a remote way.
My advice is you start with Hive (the step 1 is easy but does not provide any analyzing capabilities), first locally trying to execute some Hive queries, then remotely implementing your own client. If this kind of analysis is not enough for you, then move to MapReduce and Oozie.
All the documentation regarding Cosmos can be found in the FI-WARE Catalogue of enablers. Within this documentation, I would highlight:
Quick Start for Programmers.
User and Programmer Guide (functionality described in sections 2.1 and 2.2 is not currently available in FI-LAB).
Has anyone had much experience with data migration into and out of NetSuite? I have to export DB2 tables into MySQL, manipulate data, and then export ina CSV file. Then take a CSV file of accounts and manipulate the data again for accounts to match up from our old system to new. Anyone tried to do this in MySQL?
A couple of options:
Invest in a data transformation tool that connects to NetSuite and DB2 or MySQL. Look at Dell Boomi, IBM Cast Iron, etc. These tools allow you to connect to both systems, define the data to be extracted, perform data transformation functions and mappings and do all the inserts/updates or whatever you need to do.
For MySQL to NetSuite, php scripts can be written to access MySQL and NetSuite. On the NetSuite side, you can either do SOAP web services, or you can write custom REST APIs within NetSuite. SOAP is probably a bit slower than REST, but with REST, you have to write the API yourself (server side JavaScript - it's not hard, but there's a learning curve).
Hope this helps.
I'm an IBM i programmer; try CPYTOIMPF to create a pretty generic CSV file. I'll go to a stream file - if you have NetServer running you can map a network drive to the IFS directory or you can use FTP to get the CSV file from the IFS to another machine in your network.
Try Adeptia's Netsuite integration tool to perform ETL. You can also try Pentaho ETL for this (As far as I know Celigo's Netsuite connector is built upon Pentaho). Also Jitterbit does have an extension for Netsuite.
We primarily have 2 options to pump data into NS:
i)SuiteTalk ---> Using which we can have SOAP based transformations.There are 2 versions of SuiteTalk synchronous and asynchronous.
Typical tools like Boomi/Mule/Jitterbit use synchronous SuiteTalk to pump data into NS.They also have decent editors to help you do mapping.
ii)RESTlets ---> which are typical REST based architures by NS can also be used but you may have to write external brokers to communicate with them.
Depending on your need you can have whatever you need.IN most of the cases you will be using SuiteTalk to bring in data to Netsuite.
Hope this helps ...
We just got done doing this. We used an iPAAS platform called Jitterbit (similar to Dell Boomi). It can connect to mySql and to NetSuite and you can do transformations in the tool. I have been really impressed with the platform overall so far
There are different approaches, I like the following to process a batch job:
To import data to Netsuite:
Export CSV from old system and place it in Netsuite's a File Cabinet folder (Use a RESTlet or Webservices for this).
Run a scheduled script to load the files in the folder and update the records.
Don't forget to handle errors. Ways to handle errors: send email, create custom record, log to file or write to record
Once the file has been processed move the file to another folder or delete it.
To export data out of Netsuite:
Gather data and export to a CSV (You can use a saved search or similar)
Place CSV in File Cabinet folder.
From external server call webservices or RESTlet to grab new CSV files in the folder.
Process file.
Handle errors.
Call webservices or RESTlet to move CSV File or Delete.
You can also use Pentaho Data Integration, its free and the learning curve is not that difficult. I took this course and I was able to play around with the tool within a couple of hours.