How does Foundry Magritte append ingestion handle deleted rows in the data source? - palantir-foundry

If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?

For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.
This would look something like the following at first:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
Followed by some updates (inside a subsequent run, incremental on update_ts:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
Now your database would have to explicitly mark any DELETE rows and increment the update_ts for this to be brought in:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | DELETE | 2 |
After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
| key_1 | 1 | DELETE | 2 |
If you are running incrementally in your raw ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean step. If these deletes are found, you'll have to SNAPSHOT the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).
It's worth noting you'll want to materialize the DELETES as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.
If you aren't dealing with JDBC, then #Kellen's answer will apply.

If this is a file-based ingetsion (as opposed to JDBC) magritte ingestion operates on files not on rows. If your transaction type for the ingestion is set to UPDATE, and you make changes to the file, including deleting rows, then when the ingestion runs the new file will completely replace the existing file in that dataset, so any changes made in the file will be reflected in the dataset.
Two additional notes:
If you have the exclude files already synced filter, you will probably want to have last modified date and/or file size options enabled or the modified file won't be ingested.
If your transaction type is set to APPEND and not UPDATE then the ingestion will fail because APPEND doesn't allow changes to existing files.

Related

FDQuery and OnCalcFields, get the previous line

Delphi 10.3.3
FireDAC: DBGrid / FDQuery / MySQL
VCL
Hi all,
I have a table with these fields
----------------------
| id | data |
----------------------
| 1 | 0=A;1=B;2=C |
| 2 | 2=Z |
| 3 | |
| 4 | 0=Y;1=X |
| 5 | |
| 6 | |
Each row of data represents only the change in the table
I would like this to be display in a DBGRID:
-----------------------
| id | C0 | C1 | C2 |
-----------------------
| 1 | A | B | C |
| 2 | A | B | Z |
| 3 | A | B | Z |
| 4 | Y | X | Z |
| 5 | Y | X | Z |
| 6 | Y | X | Z |
What I can do for now is only the following table:
-----------------------
| id | C0 | C1 | C2 |
-----------------------
| 1 | A | B | C |
| 2 | | | Z |
| 3 | | | |
| 4 | Y | X | |
| 5 | | | |
| 6 | | | |
To obtain this result, I create additional columns in the event FDQuery1.BeforeOpen
And in the event OnCreateFields, I fill each column but I don't know the previous row content,
So, how can I do to fill in the missing fields in the DBgrid?
Thanks
Franck
I think you mean OnCalcFields, rather than OnCreateFields.
What you need
is certainly possible, either server-side by deriving the necessary values from the prior
row using e.g. a SQL subquery or client-side using calculated fields. This answer is about doing it
client-side.
The problem with doing client-side calculations involving another dataset row is that
to do this you need to be able to move the dataset cursor during the OnCalcFields event. However, at the time, the DataSet will be in either dsCalcFields or dsInternalCalc state
and, while it is, you can't easily move to another row in the dataset. It is possible to do this, but
requires declaring a descendant dataset class (TMyFDQuery) so that you can access the SetTempState
necessary to do revert to the prior state after you've picked up the necessary info from the "other"
row and, if what you need involves more that one field, you need somewhere to store the values temporarily.
So doing it that way gets messy.
A much cleaner approach involves using functional similarity between FireDAC's datasets and TClientDataSets.
One of the nice features of TClientDatasSets is the ease with which you can move the dataset contents between
two CDSs simply by doing
CDS2.Data := CDS1.Data;
FireDAC datasets can do the same trick, but between any FD dataset types. So here is what I would do in your
situation:
Add an FDMemTable to your form/datamodule and copy the query data into it in the FDQuery's AfterOpen event like
this:
procedure TForm2.FDQuery1AfterOpen(DataSet: TDataSet);
begin
FDQuery1.DisableControls;
try
FDMemTable1.Data := FDQuery1.Data;
FDMemTable1.Open;
finally
FDQuery1.First;
FDQuery1.EnableControls;
end;
end;
The FDQuery1.First is to force it to re-do its calculated fields once the FDMemTable data is available
(during the initial FDQuery1.Open, it can't be, of course).
In the FDQuery's OnCalcFields event, use code like this to base the calculated fields'
values on values picked up from the prior row (if there is one of course, the first
row can't hae a "prior" row):
procedure TForm2.FDQuery1CalcFields(DataSet: TDataSet);
begin
if FDMemTable1.Active then begin
if FDMemTable1.Locate('ContactID', FDQuery1.FieldByName('ContactID').AsInteger, []) then begin
FDMemTable1.Prior;
if not FDMemTable1.Bof then begin
// Set FDQuery1's calculated fields that depend on prior row
FDQuery1.FieldByName('PriorRowID').AsInteger := FDMemTable1.FieldByName('ContactID').AsInteger;
end;
end;
end;
end;
In this example, my queried dataset has a ContactID primary key and the calculated value is simply the ContactID value from the prior row. In real life, of course, it
would be more efficient to use persistent field variables rather than keep calling FieldByName.
I suppose another possibility might be to use the CloneCursor method to obtain a lookup cursor
to access the "prior" row, but I've not tried that myself and it may not be possible anyway
(what happens about the calculated fields in the CloneCuror copy?).

Implementing an enrichment using Spark with MySQL is bad idea?

I am trying to build one giant schema that makes data users to query easier, in order to achieve that, streaming events have to be joined with User Metadata by USER_ID and ID. In data engineering, This operation is called "Data Enrichment" right? the tables below are the example.
# `Event` (Stream)
+---------+--------------+---------------------+
| UERR_ID | EVENT | TIMESTAMP |
+---------+--------------+---------------------+
| 1 | page_view | 2020-04-10T12:00:11 |
| 2 | button_click | 2020-04-10T12:01:23 |
| 3 | page_view | 2020-04-10T12:01:44 |
+---------+--------------+---------------------+
# `User Metadata` (Static)
+----+-------+--------+
| ID | NAME | GENDER |
+----+-------+--------+
| 1 | Matt | MALE |
| 2 | John | MALE |
| 3 | Alice | FEMALE |
+----+-------+--------+
==> # Result
+---------+--------------+---------------------+-------+--------+
| UERR_ID | EVENT | TIMESTAMP | NAME | GENDER |
+---------+--------------+---------------------+-------+--------+
| 1 | page_view | 2020-04-10T12:00:11 | Matt | MALE |
| 2 | button_click | 2020-04-10T12:01:23 | John | MALE |
| 3 | page_view | 2020-04-10T12:01:44 | Alice | FEMALE |
+---------+--------------+---------------------+-------+--------+
I was developing this using Spark, and User Metadata is stored in MySQL, then I realized it would be waste of parallelism of Spark if the spark code includes joining with MySQL tables right?
The bottleneck will be happening on MySQL if traffic will be increased I guess..
Should I store those table to key-value store and update it periodically?
Can you give me some idea to tackle this problem? How you usually handle this type of operations?
Solution 1 :
As you suggested you can keep a local cache copy of in key-value pair on your local and updated the cache as regular interval.
Solution 2 :
You can use a MySql to Kafka Connector as below,
https://debezium.io/documentation/reference/1.1/connectors/mysql.html
For every DML or table alter operations on your User Metadata Table there will be a respective event fired to a Kafka topic (e.g. db_events). You can run a thread in parallel in your Spark streaming job which polls db_events and updates your local cache key-value.
This solution would make your application a near-real time application in true sense.
One over head I can see is that there will be need to run a Kafka Connect service with Mysql Connector (i.e. Debezium) as a plugin.

Finding closest value. How to tell MySQL that the data is already ordered?

Let's say I have a table like the following:
+-----------+------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+------------+------------+------+-----+---------+
| datetime | double | NO | PRI | NULL |
| some_value | float | NO | | NULL |
+------------+------------+------+-----+---------+
Date is necessary to be in double and is registered in unix time with fractional seconds (no possibility to install mysql 5.6 to use fractional DATETIME). In addition, the values of the field datetime are not only primary, they are also always increasing. I would like to find the closest row to certain value. Usually you can use something like:
select * from table order by abs(datetime - $myvalue) limit 1
However, I'm afraid that this implementation will be slow for hundred thousands of values, because it is going to search in all the database. And since I have an ordered list, I know I can do some binary search to speed up the process, but I have no idea how to tell MySQL to perform such kind of search.
In order to test the performance I do the following lines:
SET profiling = 1;
SELECT * FROM table order by abs(datetime - $myvalue) limit 1;
SHOW PROFILE FOR QUERY 1;
With the following results:
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000122 |
| Waiting for query cache lock | 0.000051 |
| checking query cache for query | 0.000191 |
| checking permissions | 0.000038 |
| Opening tables | 0.000094 |
| System lock | 0.000047 |
| Waiting for query cache lock | 0.000085 |
| init | 0.000103 |
| optimizing | 0.000031 |
| statistics | 0.000057 |
| preparing | 0.000049 |
| executing | 0.000023 |
| Sorting result | 2.806665 |
| Sending data | 0.000359 |
| end | 0.000049 |
| query end | 0.000033 |
| closing tables | 0.000050 |
| freeing items | 0.000089 |
| logging slow query | 0.000067 |
| cleaning up | 0.000032 |
+--------------------------------+----------+
Which in my understanding, the sorting the result takes 2.8 seconds, however my data is already sorted. As additional information, I have around 240,000 rows.
It won't scan the entire database. A primary key is indexed by a B-tree. Forcing it into a binary search would be slower, if you could do it, which you can't.
Try making it a field:
select abs(datetime - $myvalue) as date_diff, table.*
from table
order by date_diff
limit 1
Indexes are supported in RDBMSs. Define an index on date time or field of your interest and db will not do the complete table scan

Multiple Data Sources in Microsoft Excel SQL Query

I have a lot of spreadsheets that pull transactional information from our ERP software into Excel using the Microsoft Query that we then perform other calculations on automatically. Recently we upgraded our ERP system, but management made the decision to leave the transactional history in the old databases to have a clean one going forward in the new system. I still need to have some "rolling 12 months" graphs, but if I use only the old database, I'm missing new data and if I use only the new, I'm missing the last 11 months data.
Is there a way that I can write a query in Excel to pull data from the old database PartTran table and merge it with the new database PartTran table without user intervention each time? For instance, I don't want my users (if possible) to have to have two queries that they copy and paste into one Excel table. The schema of the tables (at least the columns I need) are identically named and defined.
If you want to take a bit of a fun, hacky Excel approach, you could do the "copy-paste" bit FOR your users behind the scenes. Given two similar tables OLD and NEW with structures
+-----+------+-------+------------+
| id | foo | bar | date |
+-----+------+-------+------------+
| 95 | blah | $25 | 2015-06-01 |
| 96 | bork | $12 | 2015-07-01 |
| 97 | bump | $200 | 2015-08-01 |
| 98 | fizz | | 2015-09-01 |
| 99 | buzz | $50 | 2015-10-01 |
| 100 | char | ($1) | 2015-11-01 |
| 101 | mope | | 2015-12-01 |
+-----+------+-------+------------+
and
+----+-----+-------+------------+------+---------+
| id | foo | bar | date | fizz | buzz |
+----+-----+-------+------------+------+---------+
| 1 | cat | ($10) | 2016-01-01 | 285B | 1110111 |
| 2 | dog | $25 | 2016-02-01 | 27F5 | 1110100 |
| 3 | ant | $100 | 2016-03-01 | 1F91 | 1001111 |
+----+-----+-------+------------+------+---------+
... you can union together the data for these two datasets with some prudent excel wizardry as below:
Your UNION table ( named using alt+j+t+a ) should have the following items:
New natural ID
DataSet pointer ( name of old or new table )
Derived ID from original dataset
Columns of data you want from Old & New DataSets
example:
+---------+------------+------------+----+------+-----+------------+------+------+
| UnionId | SourceName | SourceRank | id | foo | bar | date | fizz | buzz |
+---------+------------+------------+----+------+-----+------------+------+------+
| 1 | OLD | | | | | | | |
| 2 | NEW | | | | | | | |
+---------+------------+------------+----+------+-----+------------+------+------+
You will then make judicious use of Indirect() and VlookUp() to derive the lookup id and column targets. Sample code below
SourceRank - helper column
=COUNTIFS([SourceName],[#SourceName],[UnionId],"<="&[#UnionId])
id - the id from the original DataSet
=SMALL(INDIRECT([#SourceName]&"[id]"),[#SourceRank])
Everything else is just VlookUp madness!! Although I've taken the liberty of copying the sample code below for reference
foo =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[foo]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
bar =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[bar]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
date =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[date]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
fizz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
buzz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
Output
You'll likely want to make prudent use of If() and/or IfError() to help your users ignore the new column references to the old table and those rows that do not yet have data. Without that, however, you'll end up with something like the below.
This is both ready to accept & read new inputs to both OLD and NEW DataSets and is sortable to get rid of those pesky placeholder rows...
Hope this helps! Happy coding!

How to split CSVs from one column to rows in a new table in MSSQL 2008 R2

Imagine the following (very bad) table design in MSSQL2008R2:
Table "Posts":
| Id (PK, int) | DatasourceId (PK, int) | QuotedPostIds (nvarchar(255)) | [...]
| 1 | 1 | | [...]
| 2 | 1 | 1 | [...]
| 2 | 2 | 1 | [...]
[...]
| 102322 | 2 | 123;45345;4356;76757 | [...]
So, the column QuotedPostIds contains a semicolon-separated list of self-referencing PostIds (Kids, don't do that at home!). Since this design is ugly as a hell, I'd like to extract the values from the QuotedPostIds table to a new n:m relationship table like this:
Desired new table "QuotedPosts":
| QuotingPostId (int) | QuotedPostId (int) | DatasourceId (int) |
| 2 | 1 | 1 |
| 2 | 1 | 2 |
[...]
| 102322 | 123 | 2 |
| 102322 | 45345 | 2 |
| 102322 | 4356 | 2 |
| 102322 | 76757 | 2 |
The primary key for this table could either be a combination of QuotingPostId, QuotedPostId and DatasourceID or an additional artificial key generated by the database.
It is worth noticing that the current Posts table contains about 6,300,000 rows but only about 285,000 of those have a value set in the QuotedPostIds column. Therefore, it might be a good idea to pre-filter those rows. In any case, I'd like to perform the normalization using internal MSSQL functionality only, if possible.
I already read other posts regarding this topic which mostly dealt with split functions but neither could I find out how exactly to create the new table and also copying the appropriate value from the Datasource column, nor how to filter the rows to touch accordingly.
Thank you!
€dit: I thought it through and finally solved the problem using an external C# program instead of internal MSSQL functionality. Since it seems that it could have been done using Mikael Eriksson's suggestion, I will mark his post as an answer.
From comments you say you have a string split function that you you don't know how to use with a table.
The answer is to use cross apply something like this.
select P.Id,
S.Value
from Posts as P
cross apply dbo.Split(';', P.QuotedPostIds) as S