SSIS - Reading from a multi part text file - ssis

I'm trying to build an SSIS package that reads from a text file and outputs into another text file. The catch is that the file I'm trying to read from has multiple sections and I can't find anything that shows how to do that.
The file looks like this:
[sectionA]
key1=value1
key2=value2
key3=value3
[sectionB]
key4=value4
key5=value5
key6=value6
I started with a couple of tutorials that read from a flat file source but the data gets pulled into an equally ugly table. Hoping someone has some input on this.

The SSIS Flat File Connection is built for speed so it doesnt allow for niceties like that.
I would still use the Flat File Connection but just load all the data into a single, wide NVARCHAR column in a SQL table. I would add an IDENTITY column to that table to get a relative Row Number.
Then I would add downstream tasks using SQL to select by Sections e.g. for Section A rows:
WHERE File_Row_Number > ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionA]' )
AND File_Row_Number < ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionB]' )
If the split requirements are as simple as those shown I might attempt them in SQL e.g.
How do I split a string so I can access item x?
But I would probably lean towards using Strings.Split in a Script Task where the code will be simpler and safer.

Related

SQL COPY INTO is unable to parse a CSV due to unexpected line feeds in some fields. ROWTERMINATOR and FIELDQUOTE paramters do not work

I have a CSV file in Azure data lake that , when opened with notepad++ looks something like this:
a,b,c
d,e,f
g,h,i
j,"foo
bar,baz",l
Upon inspection in notepad++ (vew all symbols) it shows me this:
a,b,c[CR][LF]
d,e,f[CR][LF]
g,h,i[CR][LF]
j,"foo[LF]
[LF]
bar,baz",l[CR][LF]
That is to say normal Windows Carriage Return and Line Feed stuff after each row.
With the exceptions that for one of the columns someone inserted a fancy story like such:
foo
bar, baz
My TSQL code to injest the CSV looks like this:
COPY INTO
dbo.SalesLine
FROM
'https://mydatalakeblablabla/folders/myfile.csv'
WITH (
ROWTERMINATOR = '0x0d', -- Tried \n \r\n , 0x0d0a here
FILE_TYPE = 'CSV',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
CREDENTIAL = (IDENTITY = 'Managed Identity') --Used to access datalake
)
But the query doesn't work. The common error message in SSMS is:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (NAME) in data file
I have no option to correct the faulty rows in the data lake or modify the CSV in any way.
Obviously it is much larger file with real data, but I took a simple example.
How can I modify or rescript the TSQL code to correct the CSV when it is being read?
I recreated a similar file and uploaded it to my datalake and serverless SQL pool seemed to manage just fine:
SELECT *
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/badFile.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
) AS [result]
My results:
It probably seems like a bit of a workaround but if the improved parser in serverless is making light work of problems like this, then why not make use of the whole suite that is Azure Synapse Analytics. You could use serverless query as a source in a Copy activity in a Synapse Pipeline and load it to your dedicated SQL pool and that would have the same outcome as using the COPY INTO command.
In the past I've done stuff like written special parsing routines, loaded up the file as one column and split it in the db or used Regular Expressions but if there's a simple solution why not use it.
I viewed my test file via online hex editor, maybe I'm missing something:

Alternative to JSON Flattening via Target Table in Snowflake

Per snowflake: https://docs.snowflake.net/manuals/user-guide/json-basics-tutorial-copy-into.html I created a target table (Testing_JSON), that is a single Variant column that contains an uploaded JSON file.
My Question is How can I cut out creating this "Target Table (i.e. Testing_JSON)" that is a single Variant Column that I have to reference to create the actual and only table I want (TABLE1), which contains the flattened JSON. I have found no way to read in a JSON file from my desktop and 'parse it on the fly' to create a flattened table, VIA THE UI. Not using the CLI as I know this can be done using PUT/COPY INTO
create or replace temporary table TABLE1 AS
SELECT
VALUE:col1::string AS COL_1,
VALUE:col2::string AS COL_2,
VALUE:col3::string AS COL_3
from TESTING_JSON
, lateral flatten( input => json:value);
You can't do this through the UI. If you want to do this then you need to use an external tool on your desktop or - as Mike mentioned - in the COPY statement.
You're going to need to do this in a few steps from your desktop.
use SnowSQL or some other tool to get your JSON file up to blob
storage:
https://docs.snowflake.net/manuals/sql-reference/sql/put.html
use a COPY INTO statement to get the data loaded directly to the flattened table that you want to load to. This will require a SELECT statement in your COPY INTO:
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
There is a good example of this here:
https://docs.snowflake.net/manuals/user-guide/querying-stage.html#example-3-querying-elements-in-a-json-file

Importing a series of .CSV files that contain one field while adding additional 'known' data in other fields

I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob
A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.

cypher - load multiple csv files

I have many csv files with names 0_0.csv , 0_1.csv , 0_2.csv , ... , 1_0.csv , 1_1.csv , ... , z_17.csv.
I wanted to know how can I import them in a loop or something ?
Also I wanted to know am I doing it good ? ( each file is 50MB and whole files size is about 100GB )
This is my code :
create index on :name(v)
create index on :value(v)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///0_0.txt" AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
You could handle multiple files by constructing a file name. Unfortunately this seems to break when using the USING PERIODIC COMMIT query hint so it won't be a good option for you. You could create a script to wrap it up and send the commands to bin/cypher-shell though.
UNWIND ['0','1','z'] as outer
UNWIND range(0,17) as inner
LOAD CSV WITH HEADERS FROM 'file:///'+ outer +'_' + toString(inner) + '.csv' AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
As far as your actual load query goes. Do you name and value nodes come up multiple times in the files? If they are unique, you would be better off loading the the data in multiple passes. Load the nodes first without the indexes; then add the indexes once the nodes are loaded; and then do the relationships as the last step.
Using CREATE for the :kind relationship will result in multiple relationships even if it is the same value for csv.kind. You might want to use MERGE instead if that is the case.
For 100 GB of data though if you are starting with an empty database and are looking for speed, I would take a look at using bin/neo4j-admin import.

insert fields from database1 into table from database2

I am using prestashop and have data in zencart I am matching up information and want to select the data to be inserted into a different table under different fields.
insert into presta_table1 (c1, c2, ...)
select c1, c2, ...
from zen_table1`
Since a lot is different I need to do approximately 800 records once I match up what field is what in what table.
I recently found a example
USE datab1;INSERT INTO datab3.prestatable (author,editor)
SELECT author_name,editor_name FROM author,datab2.editor
WHERE author.editor_id = datab2.editor.editor_id;
be nice to find a way to import avoiding duplicates
I am unable to find examples of this.
Here is what I did to get data out of this POS (Point of sale) system that uses mysql for a database.
I found tables with the data I needed and I exported the data out that came out in a csv format. I used calc in Libreoffice to open and then in anouther sheet manipulate the data into the example csv feilds and that worked good.
I had some issues with some of the data but I used consol commands to help me get by these let me share them with you.
Zencart description data exported from mysqlworkbench had some model numbers I needed to export out into there own field
43,1,"Black triple SS","Black Triple SS
12101-57 (7.5 inch)",,0
i used a command to add in " , " in so I could extract the data and over lay it . essential copy past into the spreadsheet Calc where I needed it in the normal order.
this sed command removes the ),( in the file and replaces with a carriage return
I did a database dump and removed starting ( and ending ) then saved as zencart_product.csv then ran this in a console
sed 's/),(/\n/g' zencart_product.csv > zencart_productNEW.csv
I had about 1000 files with $ and # in them so I put them all in a dir and renamed them with
get rid of the $ symbol
rename 's/\$//g' *
get rid of the $ symbol
rename 's/\$//g' *
get rid of space
rename "s/\s+//g" *
I hope people stuck in some software that want the data out are able to get it out with some time and effort and that this helps someone. Thanks