informatica - Can anyone please explain on how mapping and transformation execute on input data - ssis

I been searching across and did not get a good post explaining how mapping and transformations process input data.
Does entire mapping execute for each record (I understand it can not be the case as sorter and aggregate etc.. need to see entire data (or defined batch size) in order to execute their purpose).
May be it depends on each transformation type - certain may hold entire input data (or defined batch size) before they release output.
May be this question apply yo all data integration tools.

Briefly speaking: this depends on transformation type - e.g. Expression is processing rows without caching them, Aggregator, Sorter needs to cache all input data, Joiner caches just Master pipeline.

Related

How read/write from/to same dataset, e.g. to implement caching mechanism

I have a general question: How would one read & write to the same dataset, e.g. to implement something like a caching mechanism. Naivly, this would create a cycle in the dependency graph and hence it is not allowed?
What I want to do is something like:
if not key in cache.keys():
value = some_long_running_computation(key)
cache[key] = value
return cache[key]
or an equivalent logic with PySpark dataframes.
I thought about incremental transforms but they do not really fit this case since they do not allow to check existence of a key in the cache and hence you would always run your code under the brittle assumption that the cache is "complete" after the incremental transform.
Any ideas?
Thanks
Tobias
There is the ability to access the previous view of an output dataset to the transform. This is done in python when using a #dataframe decorator like so:
previous_output = foundry_output.dataframe("previous")
(you can also provide a schema.)
and in java like so:
foundryOutput.getExistingOutputDataFrame().get()
However, I would encourage this to only be used when it's absolutely essential. There's a huge benefit to keeping your pipelines fully "repeatable"/"stateless" so that you can snapshot and recompute them any time and still get the same results repeatably.
Once you introduce a "statefulness" into your pipeline, doing certain things like adding a column to your output dataset becomes much harder, since you will have to write something akin to a database migration.
That being said, it's fine to use when you really need it, and ideally keep the impact of the added complexity small.

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

Which Hash algorithm should I use to check for file duplicity

I have a WCF service which receive XML files (in a string parameter) for processing. Now I want to implement an error log procedure. I'd like to log an exception when occurred, along with XML file that generated the error.
I've created a MySQL database to do that, and the files will be stored in a long blob field.
My doubt is in how can I avoid duplicity in the table that will store the files, since the user can submit the same file repeated times. To save storage space, I'd like to identify that the exactly same file has already been saved, and in this case, just reuse the reference.
Which method is best for that? My first thought was generating a Hashcode and saving it in another field in the table , so I could use it to search later.
When searching for that I discovered that there are various algorithms available to calculate the hash:
System.Security.Cryptography.KeyedHashAlgorithm
System.Security.Cryptography.MD5
System.Security.Cryptography.RIPEMD160
System.Security.Cryptography.SHA1
System.Security.Cryptography.SHA256
System.Security.Cryptography.SHA384
System.Security.Cryptography.SHA512
Which one is better? Is it safe to use one of them to determine if the file is duplicated? What is the difference between using this methods or the .GetHashCode() function?
All hashes intrinsically have collisions, so you cannot use them to reliably identify a file. (If you attempt to, your system will appear to work fine for a while, the length of that while depending on random chance and the size of the hash, before failing catastrophically when it decides two completely different files are the same.)
Hashes may still be useful as the first step in a mechanism where the hash locates a "bucket" that can contain 0..n files, and you determine actual uniqueness by comparing the full file contents.
Since this is an application where speed of the hashing algorithm is a positive, I'd use MD5.

Cross Stream Data changes - EDW

I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?

How can I better visualize the path that a value takes through an SSIS Data Flow and the operations that are performed upon it?

I find that I regularly need to track a value through an SSIS Data Flow. The process that I use to do this is manual and time consuming. Are there any techniques or tools that I can use to reduce the effort (and potential for error)?
What I really need is a means of quickly identifying the data flow components that modify the values in a specific field and ideally the expressions within which it is referenced. I frequently get questions like 'Where did this value in the database come from?'. I would like to be able to answer with something like the following...
'The origin of this value is this field in this other database. It flows from the source to the destination through this data flow. Along the way, it is incremented here, negated there and concatenated with this other field there.'
You can use what's called a data viewer: How To Add a Data Viewer. It shows you the outputted data that is created after each transformation component.
This is just a random thought, but as SSIS package files are just xml documents, couldn't you search the document for the fieldname you are interested in and find all the references to it that way?