Parse a flat file with multiple layouts to multiple destination tables? - ssis

I am trying to load a flat file which mixed multiple data sets. The flat file looks like.
1999XX9999
2XXX99
1999XX9999
2XXX99
3XXXXX999.99
1999XX9999
The first character of the every row defines the record type of the line. I want to create a script component in data flow and parse the raw rows (as the below) and save three output (1, 2, 3) to three different tables. Is it possible?
Table1(col1, col2, col3):
999, XX, 9999
999, XX, 9999
999, XX, 9999
Table2(col1, col2):
XXX, 99
XXX, 99
Table3(col1, col2):
XXXXX, 999.99
Any other way in SSIS if script component cannot do it? The best solution is writing a program to split the file into three files and load them using SSIS?

It is possible, and you probably should use a script transformation to create a maintainable solution.
You won't be able to completely parse your input file into columns using a flat file source and connection manager. Read your lines as full and use string functions in the script transformation to parse each line into the desired columns.
Now to distribute records to different destinations, you can either:
Define multiple outputs on your transformation and use a condition on the first character of each line to determine the output to which you send the columns.
Only use the script transformation to parse the line into columns and use a Conditional Split Transformation to logically divide your records over multiple data paths.
Both methods are logically similar, the implementation is different.

Related

How to read 100GB of Nested json in pyspark on Databricks

There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.

How can I extract some contents in the cells of web-scraped csv file?

I am struggling with dealing with a csv file that scraped one crowdfunding website.
My goal is successfully load all information as separate columns, but I found some information are mixed in a single column when I load it using 1) R, 2) Stata, and 3) Python.
Since the real data is really dirty, let me suggest abbreviate version of current dataset.
ID
Pledge
creator
000001
13.7
{"urls":{"web":{"user":"www.kickstarter.com/profile/731"}}, "name":John","id":709510333}
000002
26.4
{"urls":{"web":{"user":"www.kickstarter.com/profile/759"}}, "name":Kellen","id":703514812}
000003
7.6
{"urls":{"web":{"user":"www.kickstarter.com/profile/7522"}}, "name":Jach","id":609542647}
My goal was extracting the "name" and "id" as separate columns, though they are all mixed with URLs in the creator column.
Is there any way that I can extract names (John, Kellen, Jach) and ids as separate columns?
I prefer R, but Stata and Python would also be helpful!
Thank you so much for considering this.
if you want to extract the name and id without any other values you can simply replace the code that is setting the creator column with
replace the creator with what ever variable that holds the dictionary
{"name": creator["name"], "id": creator["id"]}
also if the json data is not formatted correctly (like missing a quote) you can try using regular expressions

Extract JSON data with Data Flow - Flatten Transformation

Following a previous question I started working on a Data Flow, with the purpose of flattening a JSON file, created as a result of an Application Insights REST query. You can find an anonymised version here.
My goal is to extract the data in the "rows" array of arrays, but I end up with the data duplicated in a cartezian manner (I got an original number of 18 rows and I end up with 324, 18*18).
I cannot understand what I am doing wrong or if it is an issue with the JSON "rows" array of arrays.
Here is my Data Flow - Source has the "Document per line" JSON option, "Single documents" raises a [unexpected character "] error, probably due to the strange formatting in the JSON:
This is the Data Preview in the Source - as you can see, it is only one "tables" node, with 18 elements in the "rows" array:
rows:
I tried to Flatten it, but I cannot map "rows" data to a column, I cannot use something like table.rows[0]:
Also, the rows data gets duplicated - 18 rows for each of the 18 rows outputted:
I am not sure how to get to the bottom of this, if it's the JSON format or if I am doing something wrong. From my experience it's probably the latter.
I think this is caused by your special format.
Please try this:
add a rowdata property
flatten your data
#Steve Zhao thank you! But that solution duplicated the data similar to the original situation:
I did not manage to treat these data as JSON so I ended up thinking about it as text that can be manipulated into an array.
So I split the text by "rows:" and retrieve the second part of the split (arrays in Expression Builder start from 1):
Then I split that text as an array:
Which then I can flatten (at last):
From here on I keep splitting these values to get the data I need - I was interested in the first two and fourth column.

Importing massive dataset in Neo4j where each entity has differing properties

I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);

Neo4j Cypher - creating nodes and setting labels with LOAD CSV

I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (x:line.label)
...but I get an invalid syntax error. Is there any way to do this?
bicpence,
First off, this is pretty easy to do with a Java batch import application, and they aren't hard to write. See this batch inserter example. You can use opencsv to read your CSV file.
If you would rather stick with Cypher, and if you have a finite set of labels to work with, then you could do something like this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS LINE
CREATE (n:load {lab:line.label, prop:line.prop});
CREATE INDEX ON :load(lab);
MATCH (n:load {lab:'label1'})
SET n:label1
REMOVE n:load
REMOVE n.lab;
MATCH (n:load {lab:'label2'})
SET n:label2
REMOVE n:load
REMOVE n.lab;
Grace and peace,
Jim
Unfortunately not, parameterized labels are not supported
Chris
you can do a workaround - create all nodes and than filter on them and create the desired nodes, than remove those old nodes
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (tmp:line[1])
WITH tmp
CREATE (x:Person {name: labels(tmp)[0]})
WITH tmp
REMOVE tmp
paste this into http://console.neo4j.org to see example:
LOAD CSV
WITH HEADERS FROM "http://docs.neo4j.org/chunked/2.1.2/csv/import/persons.csv" AS csvLine
CREATE (p:tmp { id: toInt(csvLine.id), name: csvLine.name })
WITH p
CREATE (pp:Person { name: labels(p)[0]})
WITH p, pp
DELETE p
RETURN pp
I looked around at a few questions like this, and came to the conclusion that a nice concise way to handle these kinds of complex frustrations of not being able to easily add dynamic labels through 'LOAD CSV', is simply use your favorite programming language to read CSV lines, and produce a text output file of Cypher statements that will produce the Neo4j node/edge structure that you want. Then you will also be able to edit the text file directly, to alter whatever you want to further customize your commands.
I personally used Java given I am most comfortable with Java. I read each line of the CSV into a custom object that represents a row in my CSV file. I then printed to a file a line that reflects the Cypher statement I wanted. And then all I had to do was cut and paste those commands into Neo4j browser command line.
This way you can build your commands however you want, and you can completely avoid the limitations of 'LOAD CSV' commands with Cypher
Jim Biard's answer works but uses PERIODIC COMMIT which is useful however deprecated.
I was able to write a query that:
Loads from CSV
Uses multiple transactions
Creates nodes
Appends labels
Will work for 4.5 and onwards
:auto LOAD CSV WITH HEADERS FROM 'file:///nodes_build_ont_small.csv' AS row
CALL {
with row
call apoc.create.node([row.label], {id: row.id})
yield node
return null
} IN TRANSACTIONS of 100 rows
return null
Seems that apoc procedures are more useful then the commands themselves since this is not possible (at least in my attempts) with CREATE.