How to keep a unique set of keys with an incremental transformation in Palantir Foundry? - palantir-foundry

I am trying to save compute on a python transform in Foundry.
I want to run my code incrementally, but I want to keep a unique set of keys, without having to do a full snapshot read on the full dataset, and then run the unique.
If I try something like df_out = df.select("composite_key").dropDuplicates() I am afraid it uses the full dataset input, I want to make use of the previous deduplication I already did.

The trick here is use the previous version of the output dataset:
df_out = df.unionByName(
df_out.dataframe('previous', schema=df.schema).select("composite_key")
).drop_duplicates()
Using this pattern, you don't need to do a look up on the full dataset, you use the previously computed unique set of keys, union to the new data and then de-dupe.

If there are other columns in the new data but you still want to de-dupe by key you can use this approach.
# If there may be duplicates in the data do this step.
# df = df.dropDuplicates(['composite_key'])
df_prev = df_out.dataframe(mode='previous', schema=df.schema)
# This uses the new row for any existing key.
# You could do the opposite by swapping the places of the tables.
existing = df_prev.join(df, on='composite_key', how='leftanti')
result = existing.unionByName(df)

Related

JSON flattening in AWS Glue ETL job creates inferred schema with duplicated columns

I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...
Context:
I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.
Approach:
I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table).
Then I'm trying to use the approach of unboxing and unnesting.
Let's call that json column data:
def transformTable (glueContext, dfc) -> DynamicFrameCollection:
dyf = dfc.select(list(dfc.keys())[0])
dyf = Unbox.apply(frame=dyf, path="data", format="json")
dyf = UnnestFrame.apply(frame=dyf)
return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)
(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)
My issue:
Glue automatically infers the data types of the my frame schema (rather successfully)
but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.
So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).
I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.
Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)
Questions/help needed:
Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...
Thanks in advance!

SSIS consolidate and concatenate multiple rows into single rows without using SQL

I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.

Apache NiFi: Creating new column using a condition

I have asked a similar question. Yet I wasn't able to find a solution for my problem through that approach. I have a csv which looks like this:
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
I want to add a new column and add values according to the data in the number field.This is the logic.
If the first 4 digits of data in the number column is equal to "0763" then the new column named (status) must be set as INSIDE or if it is any other value its OUTSIDE
As mentioned in the logic the output must look like this:
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
My Approach
I tried to achieve this by first duplicating the number column to the status column. And then trying to take the first 4 digits and dealing with it.
Hope you would be able to suggest a way to Nifi Workflow to make this possible.
I used the UpdateRecord processor twice and got the results that you want.
Input
I started with your input data.
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
Process
First, set the UpdateRecord processor as follows:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Record Path Value
/status /number
it will create the new column status with the value of number column.
Second, the first output should go to another UpdateRecord processor with the options
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/status ${field.value:substring(0,4):equals('0763'):ifElse(${field.value:replace(${field.value},'INSIDE')},${field.value:replace(${field.value},'OUTSIDE')})}
and this will give you the final results.
Be aware that the number column is not an integer column, so you have to set the record reader CSVReader with the option Schema Access Strategy to the Use String Fields From Header.
Output
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
You can try below logic :-
SplitText ->
ExtractText Processor ->
RouteOnAttribute(Add condition if first four number is 0763)
-----Match Relation--> ReplaceText(Extracted Attribute from file + "INSIDE") -> PutFile
-----Unmatch Relation--> ReplaceText(Extracted Attribute from file + "OUTSIDE") -> PutFile
Hope this will help you.

Define custom POST method for MyDAC

I have three tables objects, (primary key object_ID) flags (primary key flag_ID) and object_flags (cross-tabel between objects and flags with some extra info).
I have a query returning all flags, and a one or zero if a given object has a certain flag:
SELECT
f.*,
of.*,
of.objectID IS NOT NULL AS object_has_flag,
FROM
flags f
LEFT JOIN object_flags of
ON (f.flag_ID = of.flag_ID) AND (of.object_ID = :objectID);
In the application (which is written in Delphi), all rows are loaded in a component. The user can assign flags by clicking check boxes in a table, modifying the data.
Suppose one line is edited. Depending on the value of object_has_flag, the following things have to be done:
If object_has_flag was true and still is true, an UPDATE should be done on the relevant row in objects_flags.
If object_has_flag was false but is now true, and INSERT should be done
If object_has_flag was true, but is now false, the row should be deleted
It seems that this cannot be done in one query https://stackoverflow.com/questions/7927114/conditional-replace-or-delete-in-one-query.
I'm using MyDAC's TMyQuery as a dataset. I have written separate code that executes the necessary queries to save changes to a row, but how do I couple this to the dataset? What event handler should I use, and how do I tell the TMyQuery that it should refresh instead of post?
EDIT: apparently, it is not completely clear what the problem is. The standard UpdateSQL, DeleteSQL and InsertSQL cannot be used because sometimes after editing a line (not deleting it or inserting a line), an INSERT or DELETE has to be done.
The short answer is, to paraphrase your answer here:
Look up the documentation for "Updating Data with MyDAC Dataset Components" (as of MyDAC 5.80).
Every TCustomDADataSet (such as TMyQuery) descendant has the capability to set update SQL statements using SQLInsert, SQLUpdate and SQLDelete properties.
TMyUpdateSQL is also a promising component for custom update operations.
It seems that the easiest way is to use the BeforePost event, and determine what has to be done using the OldValue and NewValue properties of several fields.

Help me optimize an ActiveRecord object with too many attributes

I'm working on a app which ties to a legacy database. The primary model is based on a stupidly large 100+ column table. I don't know too much about the inner-workings of ActiveRecord but it seems to me that any request on this model is slowing down because it's creating objects with 100+ attributes. Let's call this SlowModel.
Rendering pages with this model sometimes take 17 seconds on my dev computer. Straight up mysql queries only take ~ 0.5 - 1 second.
I've managed to speed up one portion of the app by using a MySQL view that selects a subset of fields (20 or so). We'll call this QuickModel. Using views is OK but isn't the most portable solution.
I will likely continue to try and add this QuickModel into other parts of the site but I was wondering if anyone had other ideas in speeding up the original object. For instance, is there a way to specify in the model what columns activerecord should just ignore and avoid building? Maybe there are specific column types (:text??) that cause bloat in ActiveRecord objects.
Assume that columns have proper indices.
You can specify which columns are returned in the model lookup using the :select option of the ActiveRecord lookup:
SlowModel.all(:select => 'id, col1, col2, col3')
...will load instances of SlowModel with only the specified columns populated.
How about having a completely new QuickModel that sits to its own table... and a QuickModel has_one SlowModel?
You can use SQL to move the most-necessary data into the QuickModel table and only refer to the SlowModel using my_quick_model.slow_model when necessary.
Alternatively, you can add a "select" to the default scope (you can google "rails default scope" for more). By default it'll only fetch the reduced set - but you can ask for all attributes by passing :select => "*" if necessary.
Along the lines of what Winfield is saying, you may want to take a look at using an attribute tracker like SlimScrooge. The tracker attempts to fetch only the data that you're using, which reduces overhead. It attempts to automatically do what Winfield is suggesting.
Example from the Readme:
# 1st request, sql is unchanged but columns accesses are recorded
Brochure Load SlimScrooged 1st time (27.1ms) SELECT * FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, only fetch columns that were used the first time
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.id FROM `brochures` WHERE (expires_at IS NULL)
# 2nd request, later in code we need another column which causes a reload of all remaining columns
Brochure Reload SlimScrooged (0.6ms) `brochures`.name,`brochures`.comment,`brochures`.image_height,`brochures`.id, `brochures`.tel,`brochures`.long_comment,`brochures`.image_name,`brochures`.image_width FROM `brochures` WHERE `brochures`.id IN ('5646','5476','4562','3456','4567','7355')
# 3rd request
Brochure Load SlimScrooged (4.5ms) SELECT `brochures`.expires_at,`brochures`.operator_id,`brochures`.name, `brochures`.id FROM `brochures` WHERE (expires_at IS NULL)