Data Cleaning Philosophy - Source, Data Warehouse, or Front-end? - mysql

I'm in a traditional Back to Front ETL stack from a data source (Adobe Analytics) to MySQL data warehouse to a Tableau front end for visualization. My question revolves around best practices for cleaning data / mapping and at what step.
1) Cleaning: We have no automated (SSIS, etc.) connector from the source (Adobe) to the data warehouse so we're left with periodic uploading of CSV files. For various reasons these files become less than optimal (misspellings, nulls, etc.) Question: should the 'cleaning' be done to the CSV files, or once the data is uploaded into MySQL data warehouse (in tables/views)?
2) Mapping: a number of different end user use cases require us to map the data to tables (geographic regions, type of accounts, etc.)... should this be done in the data warehouse (MySQL joins) or is it just as good in the front end (Tableau)? The real question pertains to performance, I believe, as you could do it relatively easily in either step.
Thanks!

1) Cleaning: I'd advise you to load the data in the CSV files into a staging database and clean it from there, before it reaches the database to which you connect Tableau to. This way you can keep the original files, which you can eventually reload if necessary. I'm not sure what a "traditional Back to Front ETL stack" is, but an ETL tool like Microsoft SSIS or Pentaho Data Integrator (free) will be a valuable help with building these processes, and you could then run your ETL jobs periodically or every time a new file is uploaded to the directory. Here is a good example of such a process: https://learn.microsoft.com/en-us/sql/2014/integration-services/lesson-1-create-a-project-and-basic-package-with-ssis
2) "Mapping": You should have a data model, probably a dimensional model, built on the database that Tableau connects to. This data model should store clean and "business modelled" data. You should perform the lookups (joins/mappings) when you are Transforming your data, so you can Load it into the data model. Having Tableau explore a dimensional model of clean data will also be better for UX/performance.
The overall flow would look something like: CSV -> Staging database -> Clean/Transform/Map -> Business data model (database) -> Tableau

Related

SSIS staging truncate warehouse

Daily we get the data in excel formats we load the data into staging and then go to SSIS package
and take excel as connection manager and perform transformations and move the data to warehouse.
since we are taking data from excel only then why to create a stage and truncate it,
since we taking excel as source and every manipulation is done with in it? Can someone please
explain Real time scenario? I have seen many websites and couldn't understand what the concept is all about like
staging, source(excel),lookup target(warehouse)
Why to create to stage since everything is being done SSIS package only ?
The staging area is mainly used to quickly extract data from its data sources, minimizing the impact of the sources. After data has been loaded into the staging area, the staging area is used to combine data from multiple data sources, transformations, validations, data cleansing.
You can use a staging design pattern :
Incremental load
Truncate Insert
Using Delimiters with HashBytes for Change Detection
You can find out about the Package design pattern for loading a data warehouse

How to replicate and sync a MySQL to graph database

I need to replicate and continuously sync a MySQL database to a graph database, to further query and analyze data from an eCommerce application.
Came across DZone Article which talks about syncing data between oracle and Neo4J. They are using Oracle GoldenGate and Apache Kafka for this.
However, I have a few questions on how this would be implemented -
What about the graph data model? Can I generate graph data model from existing RDBMS data model? Are there any open source ETL solutions to map data models between source and target?
The RDBMS data model may change as part of new features. How to keep data models in sync between a source RDBMS and target graph database?
Are there any other approaches to make this work?
I'm fairly new to SQL to NoSQL replication. Any inputs?

What is the most efficient way to export data from Azure Mysql?

I have searched high and low, but it seems like mysqldump and "select ... into outfile" are both intentionally blocked by not allowing file permissions to the db admin. Wouldn't it save a lot more server resources to allow file permissions than to disallow them? Any other import/export method I can find uses executes much slower, especially with tables that have millions of rows. Does anyone know a better way? I find it hard to believe Azure left no good way to do this common task.
You did not list the other options you found to be slow, but have you thought about using Azure Data Factory:
Use Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
It supports exporting data from Azure MySQL and MySQL:
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see Supported data stores and formats
Azure Data Factory allows you to define mappings (optional!), and / or transform the data as needed. It has a pay per use pricing model.
You can start an export manually or using a schedule using the .Net or Python SKD , the Rest api or Powershell.
It seems you are looking to export the data to a file, so Azure Blob Storage or Azure Files are likely to be a good destination. FTP or the local file system are also possible.
"SELECT INTO ... OUTFILE" we can achieve this using mysqlworkbench
1.Select the table
2.Table Data export wizard
3.export the data in the form of csv or Json

Qlikview and Qliksense VS MSBI

This question can be seen as very stupid, but i'm actually strugling to make it clear into my head.
I have some academic experience with SSIS, SSAS and SSRS.
In simple terms:
SSIS - Integration of data from a data source to a data destination;
SSAS - Building a cube of data, which allows to analize and discover the data.
SSRS - Allows the data source to create dashboards with charts, etc...
Now, doing a comparison with Qlikview and Qliksense...
Can the Qlik products do exactly the same as SSIS, SSAS, SSRS? Like, can Qlik products do the extraction(SSIS), data proccessing(SSAS) and data visualization(SSRS)? Or it just works more from a SSRS side (creating dashboards with the data sources)? Does the Qlik tools do the ETL stages (extract, transform and load) ?
I'm really struggling here, even after reading tons of information about it, so any clarifications helps ALOT!
Thanks,
Anna
Yes. Qlik (View and Sense) can be used as ETL tool and presentation layer. Each single file (qvw/View and qvf/Sense) contains the script that is used for ETL (load all the required data from all data sources, transform the data if needed), the actual data and the visuals.
Depends on the complexity, only one file can be used for everything. But the process can be organised in multiple files as well (if needed). For example:
Extract - contains the script for data extract (eventually with incremental load implemented if the data volumes are big) and stores the data in qvd files
Transform - loads the qvd files from the extraction process (qvd load is quite fast) and perform the required transformations
Load - load the data model from the transformation file (binary load) and create the visualisations
Another example of multiple files - had a project which required multiple extractors and multiple transformation files. Because the data was extracted from multiple data sources to speed up the process we've ran all the extractors files are the same time, then ran all the transformation files at the same time, then the main transform (which combined all the qvd files) into a single data model.
In addition to the previous comment have a look at the layered Qlik architecture.
There it is described quite well how you should structure your files.
However, I would not recommend to use Qlik for a full-blown data-warehouse (which you could do with SSIS easily) as it lacks some useful functions (e.g. helpers for slowly-changing-dimensions).

(Azure) Data Factory to Data warehouse - Dynamically name the landing tables and schemas

I plan to move data from a number of databases periodically using Azure Data Factory (ADF) and i want to move the data into Azure Parallel Data-Warehouse (APDW). However the 'destination' step in the ADF wizard offers me 2 functions; 1- in the case where data is retrieved from a view you are expected to map the columns to an existing table, and 2- when the data comes from a table you are expected to generate a table object in the APDW.
Realistically this is too expensive to maintain and it is possible to erroneously map source data to a landing zone.
What i would like to achieve is an algorithmic approach using variables to name schemas, customer codes and tables.
After the source data has landed i will be transforming it using our SSIS Integration Runtime. I am wondering also whether a SSIS package could request source data instead of an ADF pipeline.
Are there any resources about connecting to on premises IRs through SSIS objects?
Can the JSON of an ADF be modified to dynamically generate a schema for each data source?
For your question #2, Can the JSON of an ADF be modified to dynamically generate a schema for each data source:
You could put your generate table script in precopyscript.