I'm creating a job in Talend where I have to generate files containing data generated with tRowGenerator along with other sources : SQL Server database and delimited files.
The issue is that I have duplicated files with the same primary key.
All i want to get is 100 records(420 rows) :
For each Random UUID generated i shall get 42 rows and so on, but instead i'm getting the same row 10 times(it's duplicated 10 times)
I'm getting data from 3 sources as shown below:
To get this fields in my output file:
If I understand correctly, you're using one of the functions in tRowGenerator to get random data.
The problem is that the data generation functions available from Talend are not really random, they get their values from a predefined list of values. You can look at the source code to verify that they have a hundred or so value, so you're bound to get duplicates.
To get unique values create a Talend routine with a simple method that generates a UUID:
public class Utils {
/**
* getRandom: return a random UUID
*
*
* {talendTypes} String
*
* {Category} User Defined
*
* {param} string("world") input: dummy input
*
* {example} getRandom("world") # 01e98b98-05d6-427c-978d-1f86d0ea4712
*/
public static String getRandom(String input) {
return java.util.UUID.randomUUID().toString();
}
}
You can then access this function from tRowGenerator:
One more thing, I'm not sure what exactly is your requirement, but since you don't have a join key between your inputs, you get are getting a cartesian join between all your inputs (42x298x206 rows). So you might want to define a join condition.
If you do define a join condition, make sure the tMap inputs are in the right order (you are using the tRowGenerator flow as a main connection, and others as lookup).
Related
I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.
My models have both id and counter attributes. The id is a UUID, and the counter is an integer which is auto-incremented by the database.
Both are unique however I rely on id as the primary key. The counter is just a human-friendly name that I sometimes display to the user.
Immediately before an object is created a listener gives it a UUID. This works fine.
When the record is saved, MySQL increments the counter field. This works fine except that the copy of the object which I have in memory does not have the counter value. I can reload the object to find out what its counter is, but that would require another database query.
Is there a way to find the value of the counter without a specific database query? For example, is it returned as part of the response from the database when a record is created?
Few things:
Use create(array $attributes) and you'll get exactly what you want. For this having right, you have to ensure that $fillable array consists all attributes' names passed to create method.
You should use Observer on model instead of listener (most likely creating method).
Personal preference using Eloquent is that you should use id for id (increment field) and forget custom settings between models because by default it is what relations expect and so on
public function secondModels()
{
return $this->hasMany(SecondModel::class);
}
is pretty much no brainer. But for having this working best way would be (also following recommendations of this guy) FirstModel::id, SecondModel::id, SecondModel::first_model_id; first_models, second_models as table names. Avoiding and/or skipping this kind of unification is lot of custom job afterward. I don't say it can't be done but it is lot of non-first-time-successful work done.
Also, if you want visitor to get something other than id field name, you can make computed field with accessor:
/**
* Get the user's counter.
*
* #return string
*/
public function getCounterAttribute(): string
{
return (string)$this->id;
}
Which you call then with $user->counter.
Also personal preference of mine is to have most possible descriptive variable names so uuid field of mine would be something like
$table->uuid('uuid4');
This is some good and easy to make practice of Eloquent use.
Saying all this let me just to say that create() and save() will return created object from database while insert() shall not do it.
I need to export the result of a query of neo4j database to JSON or CSV, including relations and nodes, my query is this:
MATCH
(s:Socio)-[:ES_SOCIO_DE]->(p1:Empresas)-[:OFERTA_A]->(lic:Licitaciones)<-[:OFERTA_A]-(p2:Empresas)<-[:ES_SOCIO_DE]-(s:Socio)
WHERE ID(p1) <> ID(p2) RETURN * limit 100
but when I tried to export it to GraphML for example, it only exports the nodes
Image
Do you have access to the Neo4j browser interface for your installation? Usually, the URL will be something like:
http://[IP_ADDRESS_OF_YOUR_NEO4J_SERVER]:7474/browser/
In the browser interface, you can run your query in the query box, then click either the 'Text' or 'Table' panel on the left side of the returned query results box and you will see that you now have the option to 'Export CSV' in the top right portion of the returned query results box.
You can then either open the CSV directly or save it - and it will contain the nodes and the relationship properties.
If you want to return the type of the relationship (rather than just the properties) - which I have a hunch may be the case - return the relationship variable encapsulated in the built-in type() function. For example, using Neo4j's sample Movie database, I run the following query:
optional match (z:Person)-[x:ACTED_IN]->(v:Movie)
where z.name = "Tom Cruise"
return z,type(x),v
With the above query, rather than returning me the properties of his [:ACTED_IN] relationship, it will simply return "ACTED_IN"
Edit: Judging from your included image, which I admittedly did not notice initially, it looks like the relationships being returned are zero. Are you sure that the relationship that you are specifying actually exists?
I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.
The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.
I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
There are a couple of approaches you can take to do this.
1. Aggregate Transformation
Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
2. Script Component Transformation
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this
line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Select Transformation as the Script Component type
Add the Script Component after the Flat File Source in your Data Flow:
Double Click the Script Component to open the Script Transformation Editor.
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.