I have a flat file with several hundred thousand rows. This file has no header rows. I need to load just the first row into a hold table and then read the last field into a variable. This hold table has just two columns, first one for most of the row, second for the field I need to move into the variable. Optionally, how can I read this one field, from the flat file, into a variable?
I should note that I am currently loading the entire file, then reading just the first row to get the FILE_NBR into a variable. I would like to speed it up a bit by only loading that first row, instead of the entire file.
My source is a fixed position file, so I am putting all fields except for the last 6 bytes into one field and then the last 6 bytes into the FILE_NBR field.
I am looking to only load one record, instead of the entire file, as I only need that field from one record (the number is the same on every record in the file), for comparison to another table.
For the use case you're describing, I would likely use a Data Flow Task that is a Script Component (acting as source) to an OLE/ADO Destination.
Assumptions
A variable named #[User::CurrentFileName] exists, is of type String and is populated with a fully qualified path to the source file.
The Script Component, acting as Source, will have two columns (ROR, FILENBR) defined of the appropriate length (not to exceed 4000 characters) and 6 and the output buffer is left as the default of Output0
Approximate source component code (ensure you set CurrentFileName as a ReadOnly variable in the component)
// A variable for holding our data
string inputRow = "";
// Convert the SSIS space variable into a C# space variable
string sourceFile = Dts.Variables["CurrentFileName"].Value.ToString();
// Read from the source file
// (I was lazy, feel free to improve this)
foreach (string line in System.IO.File.ReadLines(sourceFile))
{
inputRow = line;
// We have the one row we want, let's blow this popsicle stand
break;
}
// TODO split line into RestOfRow and FileNumber
// Guessing here, likely have the logic wrong
// and am off by one is all but guaranteed
int lineLen = line.Length;
// Slice out to the final 6 characters
string ror = line.Substring(0,lineLen-6);
// Python would much more elegant
string fileNumber = line.Substring(lineLen);
// Now that we have the two pieces we need, let's do the SSIS specific thing
// Create a row in our output buffer and assign values
Output0Buffer.AddRow();
Output0Buffer.ROR = ror;
Output0Buffer.FILENBR = fileNumber;
Ref
Is File.ReadLines buffering read lines?
In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()
So, like the title says, I'm using Talend ESB in order to handle camel messaging. In my case, I'm sending the contents of a file as the message body to the child Talend job. In some scenarios the contents of the file may have 2+ rows. All I need is to be able to iterate over each of those rows independently within the child-job itself.
I guess my question is 2 folded. 1. If possible how do I do this? and 2. is the iteration process better suited at the route level, or the child-job the route calls.
Right now, the files I'm handling are | delimited. To handle this, I have the tRouteInput_1 going directly to a tExtractDelimtedFields and use those values to set variables globally, like so.
The problem with this, is it's only reading the first row of the file, and moving on. I need to be able to iterate over each row within the file/camel message.
Thanks,
Alex
First you need to split your file on the row delimiter using a tNormalize.
In my example, I simulate your tRouteInput by using a tFixedFlowInput containing the whole file as a single line, with rows separated by \n. Then for each resulting row returned by tNormalize, extract the fields you want (in tExtractDelimitedFields, create the schema corresponding to your row structure):
And the result:
.--------+--------.
| tLogRow_1 |
|=-------+-------=|
|field1 |field2 |
|=-------+-------=|
|field1.1|field1.2|
|field2.1|field2.2|
|field3.1|field3.2|
'--------+--------'
You need to escape "|" by using "\\|" inside tExtractDelimitedFields, as the component accepts regex, and the pipe has special meaning.
As for your 2nd question, I think it's better to do this inside the child job and not the route, as there are dedicated components for this not available in the routing perspective.
Im new to Neo4j and looking for some guidance :-)
Basically I want to create the graph below from the csv below. The NEXT relationship is created between Points based on the order of their property sequence. I would like to be able to ignore if sequences are consecutive. Any ideas?
(s1:Shape)-[:POINTS]->(p1:Point)
(s1:Shape)-[:POINTS]->(p2:Point)
(s1:Shape)-[:POINTS]->(p3:Point)
(p1)-[:NEXT]->(p2)
(p2)[:NEXT]->(p3)
and so on
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
"1-700-y11-1.1.I","53.42646060879","-6.23930113514121","1","0"
"1-700-y11-1.1.I","53.4268571616632","-6.24059395687542","2","96.6074531286277"
"1-700-y11-1.1.I","53.4269700485041","-6.24093540883784","3","122.549696670773"
"1-700-y11-1.1.I","53.4270439028769","-6.24106779537932","4","134.591291249566"
"1-700-y11-1.1.I","53.4268623569266","-6.24155684094256","5","172.866609667575"
"1-700-y11-1.1.I","53.4268380666968","-6.2417384245122","6","185.235926544428"
"1-700-y11-1.1.I","53.4268874080753","-6.24203735638874","7","205.851454672516"
"1-700-y11-1.1.I","53.427394066848","-6.24287421729846","8","285.060040065768"
"1-700-y11-1.1.I","53.4275257974236","-6.24327509689195","9","315.473852717259"
"1-700-y11-1.2.O","53.277024711771","-6.20739084216546","1","0"
"1-700-y11-1.2.O","53.2777605784999","-6.20671521402849","2","93.4772699644143"
"1-700-y11-1.2.O","53.2780318605927","-6.2068238246152","3","124.525619356934"
"1-700-y11-1.2.O","53.2786209984572","-6.20894363498438","4","280.387737910482"
"1-700-y11-1.2.O","53.2791038678913","-6.21057305710353","5","401.635418300665"
"1-700-y11-1.2.O","53.2790975844245","-6.21075327761739","6","413.677012879457"
"1-700-y11-1.2.O","53.2792296384738","-6.21116766400758","7","444.981964564454"
"1-700-y11-1.2.O","53.2799500357098","-6.21065767664905","8","532.073870043666"
"1-700-y11-1.2.O","53.2800290799386","-6.2105343995296","9","544.115464622458"
"1-700-y11-1.2.O","53.2815594673093","-6.20949562301196","10","727.987702875002"
It is the 3rd part that I cant finish. Creating the NEXT relationship!
//1. Create Shape
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
With distinct csv.shape_id as ids
Foreach (x in ids | merge (s:Shape {id: x} ));
//2. Create Point, and Shape to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
MATCH (s:Shape {id: csv.shape_id})
with s, csv
MERGE (s)-[:POINTS]->(p:Point {id: csv.shape_id,
lat : csv.shape_pt_lat, lon : csv.shape_pt_lat,
sequence : toInt(csv.shape_pt_sequence), dist_travelled : csv.shape_dist_traveled});
//3.Create Point to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
???
You'll want APOC Procedures installed for this one. It has both a means of batch processing, and a quick way to link all nodes in a collection together.
Since you already have all shapes the the points of the shape in the db, you don't need to do another load csv, just use the data you've got.
We'll use apoc.periodic.iterate() to batch process each shape, and apoc.nodes.link() to link all ordered points in the shape by relationships.
CALL apoc.periodic.iterate(
"MATCH (s:Shape) RETURN s",
"WITH {s} as shape
MATCH (shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')",
{batchSize:1000, parallel:true}) YIELD batches, total
RETURN batches, total
EDIT
Looks like there may be a bug when using procedure calls within the apoc.periodic.iterate() where no mutating operations occur (attempted this after including a SET operation in the second part of the query to set a property on some nodes, the property was not added).
Unsure if this is a general case of procedure calls being executed within procedure calls, or if this is specific to apoc.periodic.iterate(), or if this only occurs with both iterate() and link().
I'll file a bug if I can learn more about the cause. In the meantime, if you don't need batching, you can forgo apoc.periodic.iterate():
MATCH (shape:Shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')
Am I missing an easy way to do this?
I have a CSV file with a number of params in it, and in my test I want to be able to make some of the fields unique across CSV repetitions with a suffix determined by the number of times I've looped through the file.
So suppose my CSV (simplified) had:
abc
def
ghi
I want to generate in the test
abc_1
def_1
ghi_1 <hit EOF>
abc_2
def_2
ghi_2 <hit EOF>
abc_3
def_3
ghi_3
I thought I could set up a counter to run parallel to my CSV loop, but that won't work unless I increment it by 1/n each iteration, where n is the number of lines in my CSV file. Which you can't do because counters are integers.
I'm going to go flail around and see if I can come up with a solution, but in case I'm not successful, has anyone got any suggestions?
I've used an EOF marker row (index column with something like "EOF" or "END", etc) and used an IF controller with either a non-resetting counter OR user-variables incremented via javascript in a BSF element (BSF assertion or whatever, just a mechanism to run the script).
Unfortunately its the best solution I've come up with without putting too much effort into it.