Why do I sometimes get a malformed data error for some Contour boards but not others? - palantir-foundry

I have a dataset that when being used as an input to build another dataset results in a 'malformed record' error indicating that something is wrong with the raw data file (some malformed values). I would therefore expect not to be able to use that dataset in Contour. So, why do I sometimes get a malformed record error for some boards in Contour but not others?

So the answer here is that you will be able to perform some operations in Contour, and others you won't. This depends on whether or not the spark job that is being executed under the hood in Contour actually encounters the malformed records. Basically spark is lazy so won’t actually perform all operations over all of the data - only what it needs to to show you the results. So if the function performed in Contour doesn't include the particular column / rows where the malformed records exist, you'll be able to utilize the dataset.

Related

Handling very large datasets (>1M records) using Power Automate

Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.

Sequence of mysql queries in Spark

I have a requirement in Spark where I need to fetch data from mysql instances and after some processing enrich them with some more data from a different mysql database.
However, when I try to access the database again from inside a map function, I get a
org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
My code looks like this:
val reader = sqlContext.read;
initialDataset.map( r => reader.jdbc(jdbcUrl, "(select enrichment_data from other_table where id='${r.getString(1)'}) result", connectionProperties).rdd.first().get(0).toString )
Any ideas / pointers? Should I use two different Datasets? Thanks!
First of all map() function should accept a row from an existing RDD then will apply the changes you made and returns the updated row. This is the reason why you get this exception since scala can't serialize the code reader.jdbc(jdbcUrl, ...
To solve your issue you have multiple options according to your needs:
You could broadcast one of these datasets after collecting it. With broadcast your dataset will be stored into your nodes' memory. This could work if this dataset is reasonably small to fit in the node's memory. Then you could just query it and combine the results with the 2nd dataset
If both datasets are big and not suitable for loading them into node memory then use mapPartition, you can find more information about mapPartition here. mapPartition is called per partition instead of per element that map() does. If you choose this option then you could access the 2nd dataset from mapPartition or even initialize the whole dataset(e.g retrieve all related records from the 2nd database) from mapPartition.
Please be aware that I assumed that these two datasets they do have some kind of dependency(e.g you need to access some value from the 2nd database before executing the next step). If they don't then just create both ds1 and ds2 and use them normally as you would do with any dataset. Finally remember to to cache the datasets if you are sure that you might need to access it multiple times.
Good luck

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

informatica - Can anyone please explain on how mapping and transformation execute on input data

I been searching across and did not get a good post explaining how mapping and transformations process input data.
Does entire mapping execute for each record (I understand it can not be the case as sorter and aggregate etc.. need to see entire data (or defined batch size) in order to execute their purpose).
May be it depends on each transformation type - certain may hold entire input data (or defined batch size) before they release output.
May be this question apply yo all data integration tools.
Briefly speaking: this depends on transformation type - e.g. Expression is processing rows without caching them, Aggregator, Sorter needs to cache all input data, Joiner caches just Master pipeline.

How can I better visualize the path that a value takes through an SSIS Data Flow and the operations that are performed upon it?

I find that I regularly need to track a value through an SSIS Data Flow. The process that I use to do this is manual and time consuming. Are there any techniques or tools that I can use to reduce the effort (and potential for error)?
What I really need is a means of quickly identifying the data flow components that modify the values in a specific field and ideally the expressions within which it is referenced. I frequently get questions like 'Where did this value in the database come from?'. I would like to be able to answer with something like the following...
'The origin of this value is this field in this other database. It flows from the source to the destination through this data flow. Along the way, it is incremented here, negated there and concatenated with this other field there.'
You can use what's called a data viewer: How To Add a Data Viewer. It shows you the outputted data that is created after each transformation component.
This is just a random thought, but as SSIS package files are just xml documents, couldn't you search the document for the fieldname you are interested in and find all the references to it that way?