Neo4j LOAD CSV record processed sequence - csv

I've been using LOAD CSV for some time now with neo4j to import data but I think, not sure, I noticed that the LOAD CSV will start importing rows from the bottom of the csv file.
Or is it completely random?
I'm trying to create an (org)-[:has_suborg]->(subOrg) relationship while I'm processing each row but I want to make sure that the parent orgs are created first to avoid exceptions/errors when a sub is attempted to be related to a parent org and the parent org is not present yet.
If rows are processed from the top or the bottom I can then make sure that my csv records are already sorted the way I want them processed.
Thanks in advance

The CSV will be processed from top to bottom - in that order. What might be worth considering is doing a double load of your data.
First pass just CREATE/MERGE your org nodes. Second pass, MATCH the org nodes, then create the rest of the data.
Using this approach you will avoid any potential order issues, as well as dodging eager queries.

Related

How can I regroup multiple tlogRow in talend to one file where I can see all my rows?

I am working on a MongoDB database ETL, I need to prosses the data and move it to the PostgreSQL database, I used the tExtractJson in talent to extract all the documents and subdocuments but I am having a problem in regrouping all the rows into one output so that I can load my tables in Postgres DB.
to load only one table I need information from multiple tlogRows
I tried a tmap but it requires a main component and other are lookup one
I solved this using a tBufferOutput in the first job, and did the same thing once again on a second job and used a tInputBuffer to recover the result of the first job.
enter image description here
enter image description here

SSIS: How to store master-details records by condition?

I'm new to SSIS and I completely stuck with perhaps easy question.
I have two tables with one-to-many relationship. I parse HTML data in a Script component and create two outputs for Master Data and Detail records.
Then I check the condition for overwriting the existing data, and if it is satisfied, I write Master record to the table. Unfortunately, my data flow looks like on the picture above (schematic view). Detail records are added in any case. I would like the Details are stored only if the condition is met (the green arrow on the picture), but can't imagine how to do it.
I have face the same problem when we have to load the XML data into parent child tables. For this , I have added two data flow tasks in package. In first DFT, I have parsed XML and loaded data into master table only. In second DFT, I have parsed child XML nodes data and pass this output to merge join operator (first input). Now, we have to pass second input to merge join operator, for which i have extract data from master table.
guys!
Eventually I managed to resolve the problem. I split the whole process in two data flows. In the first one I parse html, save master data in table if needed and save parsed detail data in the package Object variable. Also, the first data flow has a Row Count component which saves its value in MasterRowCount variable. In the second data flow I save the detail data in table. The first and the second data flows are connected by expression constrained precedence (#MasterRowCount > 0). Thus, the second data flow executes only if the master data were added.

process csv File with multiple tables in SSIS

i trying to figure out if its possible to pre-process a CSV file in SSIS before importing the Data into SQL.
I currently receive a file that contains 8 tables with different structures in one flat file.
the Tables are identified by a row with the Table name in it encapsulated by Square Brackets i.e. [DOL_PROD]
the the data is underneath in standard CSV format. Headers first and then the data.
the tables are split by a blank line and the process repeats for the next 7 tables.
[DOL_CONSUME]
TP Ref,Item Code,Description,Qty,Serial,Consume_Ref
12345,abc,xxxxxxxxx,4,123456789,abc
[DOL_ENGPD]
TP Ref,EquipLoc,BackClyLoc,EngineerCom,Changed,NewName
is it posible to split it out into seperate CSV files? or Process it in a loop?
i would really like to be able to perform this all with SSIS automatically.
Kind Regards,
Adam
You can't do that by flat file source and connection manager alone.
There are two ways to achieve your goal:
You can use Script Component as source of the rows and to process the files, and then you'd do whatever you want with a file programatically.
The other way, is to read your flat file treating every row as a single column (i.e. without specifying delimiter), and then, via Data Flow Transformations, you'd be splitting rows, recognizing table names, splitting flows and so on.
I'd strongly advise you to use Script Component, even if you'd have to learn .NET first, because the second option will be a nightmare :). I'd use Flat File Source to extract lines from file as single column, and thet work it in Script Component, rather than reading a "raw" file directly.
Here's a resource that should get you started: http://furrukhbaig.wordpress.com/2012/02/28/processing-large-poorly-formatted-text-file-with-ssis-9/

Extract Distinct Record in SSIS

I am writing the SSIS package to import the data from *.csv files to the SQL 2008 DB. The problem is that one of the file contains the duplicate records in the csv file and I want to extract only the distinct values from that source. Please see the image below.
Unfortunately, the generated files are not under my control and it is owned by the third party and I could not change the way they generated.
I did use the LookUp Component. But it only checks the existing data against the incoming data. It does not check the duplicate records in the incoming data.
I believe the sort component gives an option to remove duplicate rows.
Depends on how serious you want to get about the duplicates. Do you need a record of what was duplicated or is it enough to just get rid of them? Sort component will get rid of dups on the sort field. However, the dups may have different data in the other fields and then you want a differnt strategy. Usually I load all to staging tables and clean up from there. I send the dupes removed to an exception table (we have to answer a lot of questions from our customers about why things don't match what they sent) and I often use a set of business rules (and use either an execute SQl or data flow tasks to enforce the rules) to determine which one to pick if there are duplicates in one area but not another (say two business addresses when we can only store 1). I also make sure the client is aware of how we determine which of the two to pick.
Use SORT tool for that from Toolbox, then click on it. You will get all available input columns.
Check the column and change sortType direction and then check "remove rows with duplicate sort value".
Bring in the data from the csv file the way it is, then dedup it after it's loaded.
It'll be easier to debug, too.
I used Aggregate Component and Group By both QualificationID and UnitID. If you want, you can also use Sort Component too. Perhaps, my information might help others.

Reading hierarchical flat file into SSIS

I have flat file that structured in a hierarchical format that looks something like this:
Area|AreaCode|AreaDescription
Region|RegionCode|RegionDescriptoin
Zone|ZoneCode|ZoneDescription
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
ZoneFooter
RegionFooter
AreaFooter
I have to bring this into SSIS and consume information about the Record row and also about the header for the current record row. As well as information from several other sources and output a more simple flat file as a result.
I would like to read the flat file above into a structure that each row contains a record with the appropriate header information included.
My question is, what is the best way to do this if it is even possible?
First how do you tell what type of line you are on if you are on say line 3,987,986? How do you tell what is related to what? Is there apossiblity you could get this in a better format? Before spending lots of time (and don't kid yourself, this will take lots of time to set up and test properly) I would kick the file back to the provider and request it in a different format. You won't always get it, but you should at least try.
When I have done this in the past in DTS, the first characters of each line told me which structure the line referred to. I imported all into a staging table with two columns, one for the recordtype data and one for the rest. Then I parsed the rest into the staging tables for the type of record with the correct column structure for that type of record (and any fileds you might need to do the relationships) and then did clean up and then imported to prod tables. AS you also have differnt number of columns I would try that approach (only you may have to manually populate some columns instead of figuring out directly from the file), also give each record an identity filed in the staging tables. this will help you figure out the realtionships I think.