A tale of two cities almost...I have 17,000 rows of data that come in as a pair of strings in 2 columns. There are always 5 item numbers and 5 Item Unit counts per row (unit counts are always 4 characters). They have to match up unit and item or it's invalid. What I'm trying to do is "unpivot" the strings into individual rows - Item Number and Item Units
So here's an example of one row of data and the two columns
Record ID Column: 0
Item Number Column: A001E10 A002E9 A003R20 A001B7 XA917D3
Item Units Column: 001800110002000300293
I wrote a C# windows app test harness to unpivot the data into individual rows and it works fine and dandy. So it basically unpivots the data into 85,000 (5 times 17,000) rows and displays it to me in a grid which is what I expect (ID, Item Number and Item Units).
0 | A001E10 | 0018
0 | A002E9 | 0011
and so on...
In my SSIS app I added a script task to process this same data and basically used the same code that my test harness uses. When I run my task I can see it loads the 17,000 rows but it only generates 15,000 +/- on the output so obviously something isn't right.
What I'm thinking is that I don't have the script task setup correctly even though it is using the same code that my test harness uses in that it's dropping records for some reason.
If I go back into my task and give it a particular record ID that it didn't get in the first pass, it will process that ID and generate the right output. So this tells me that the record is ok but for some reason it misses it or drops it on the initial process. Maybe something to do with buffers?
Well - I figured it out.
We have a sequence task with tons of dataflow tasks inside it that are running in parallel. We're relying on the engine to prioritize and handle the data extract and load correctly. However, this one particular script task is not handled by the engine correctly within that sequence container.
The clue was that you could run the script task itself outside of the whole process and it worked fine. So we pulled the script task out of the sequence task and put it by itself after the sequence task and now it runs correctly.
Related
Im quit new in ADF so here's the challenge from me.
I have a Pipeline that consist a LookUp activity and ForEach and inside this a Copy Activity
When i run this pipeline the first output of the Lookup activity looks like this
The output contains 11 different values. From my perspective i only see 11 records that will need to be copied to my Sink which is Azure SQL DB.
The input of the ForEach activity looks like this
During the running the Pipeline copy 11 times and in my sql database it has now 121 records. This amount is based on 11 rows multiple 11 iteration. This is not the output which i expected.
I only expect 11 rows in my sink table. How can i change this pipeline in order to achieve the expected outcome of only 11 rows?
Many thanks!
In order to copy data, Lookup activity and copy data source activity should not be given same configuration. If given so, duplicate rows will be copied.
I tried to repro the same in my environment.
If 3 records are there in source data, 3 times 3 records will be copied.
In order to avoid duplicates, we can use only copy activity to copy data from Source to sink.
Only 3 records are in target table.
This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)
Person -> Item - > Work -> Component.
This are the main tables in the database.
I have to search for Item by a criteria. It will give a list. I will join the Person to get his "parent". After this maybe is a record in Work table maybe not, but if is, I will join also for Work I will join the list of Components if can be found.
The original code it was with nested tables. It will crash the browser, because taking to much memory with that design and is extremely slow around 150 records.
I rewrote it with divs the nested table. The performance got a huge boost, but start to be slow again because of buttons and design. ( It wasn't able to show 200 record before even after 10 min waiting!, now display 5k rows in 23 seconds)
Some of my benchmark logs:
SQL execution time: 0.18448090553284 seconds. Found 5624 rows.
For each result processing took: 0.29220700263977 seconds.
Writing the HTML code took:0.4107129573822 seconds.
Rows in HTML: 26551 headers + data.
Total Cells in HTML: **302491** headers and data.
Time until DOMready: 23691 milliseconds (in JavaScript)
0.18 + 0.29 + 0.41 = 0.88 So is around 1 second!
But when the browser actually want to show it to you ( paint ) it will take like 20 seconds!!!
Please don't suggest the paging! - the customer (final user) want to see all data in 1 web page for whatever his reason. No comment here.
Running on an i7 processor and 8/16 GB ram as requirement is accepted.
Most of the data rows has a collapse/expand button.
Most of the data row has the CRUD buttons: Add, Edit, Delete, View in details
All 4 kind of data table has headers and they don't match the length of the other kind of table header size, neither in column numbers.
When I want to just list the data in a blank page ( without design) and use 1 single table it is like 2 seconds or 3, not 20-30.
The original nested table solution has buttons with functionality in data row.
I would like to use it, and not implement it again.
My idea is to go back to the original nested table design ( to not re implement a lot of functionality of buttons) Then display only the top level table collapsed, with expand buttons. Then call an AJAX to get the second level data, when ready call the 3rd level then the 4th level.
The user is using intranet or the same PC as the server, so this maybe can be acceptable? -and doesn't have a blocking user interface for long time.
How would you handle this case, when there is not an option to show a next page button with 20 records / page.
I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.
I am creating a departmental bar chart that shows time frames for a set of tasks. Some departments share tasks, others are unique. I have the chart running except that I don't want all possible tasks listed for every department. I would only like to display those tasks that the department actually did.
Here is an example of the data (# in days):
IT Pending 5
IT In Process 8
CD Pending 10
CD 1st Inspection 15
CD Re-inspection 5
In this case I don't want to see "1st Inspection" or "Re-inspection" for IT because IT doesn't do that job nor do we want CD to have "In Process".
Is it possible to remove these unneeded series for a category?
The primary reason for asking this is because our data set is so large, it is nearly impossible to read the report. I think removing these unneeded columns would really help.
It must have been a long week for me. I switched how I was generating the graph and got what I wanted. My data was fine, I am now using the status for the category field and got it working.