Merge CSVs Appending Headers - mysql

I have a number of CSV files with hundreds of columns and about 50,000 rows (when opened in Excel). The column headers are almost identical however some column headers may vary from one CSV file to the next, as an example below:-
CSV1
Name Surname DOB
John Smith 31/01/1989
CSV2
Name Age Surname Address DOB
Paul 29 Jones 123 Smith St 30/12/1981
CSV3
Name Surname Address Telephone
Mick Jones 123 Paul St 0123456
Is there any way I can merge all of these into one big CSV file, appending the headers so that in the one main CSV, I would have the headers "Name, Surname, DOB, Age, Address, Telephone" for example and then the respective entries from each CSV falling within their respective column heading. The reason I want to do this is to then populate the information into a big MySql / Sql Server DB table and so it appears easier to do it all initially as one big CSV before importing.
Any suggestions?

Import them into three temporary tables and then merge them into one table using joins on name surname and DOB. Otherwise the data will get all mixed up.

Manual method (bear with me, just giving an idea of the algorithm):
Generate a final list of columns that includes all possible headers in all CSV's.
Open each spreadsheet, one at a time. For each spreadsheet:
Click and drag the headers and insert missing columns so they all match your list from #1
Save the file, and repeat back to #2
Combine all the spreadsheets into a single spreadsheet.
Import.
If you are going to automate this, you will take roughly the same steps. You need a way to determine what all columns are possible, then put the CSV's in the right format and combine them, either in spreadsheet/CSV format, or import them as a bunch of temp tables, and INSERT...SELECT to re-arrange the columns where they belong.
What languages/technologies do you have available to you for the automation? .NET? Java? PHP? How often will this process occur, and how automated does it have to be? Is it a daily process, or weekly, or only going to happen once? How many spreadsheets roughly?

Related

How to insert data into one table which is coming from 2 different csv files using conditional split transformation?

I am having two 3 csv files 1 teacher and 2 students I have to insert teachers data into one table and students data who got more than 50 marks into one table from 2 csv files, please explain how to use conditional split transformation for those 2 students file to put the data into one table
Are you sure you to use the Coniditional Split? You need to combine the student flatfiles into one table, right? If so, what you want to use is a Merge Join transformation.
You can read more about how to use the Merge Join, here.
Not sure if I have understood the question correctly. My assumptions:
Teacher is moved from CSV to a table 1 no conditions.
Student files (CSV) contain only unique records.
Records where student achieved score greater or equal to 50 are inserted into a table 2.
If the above assumptions are correct. The simplest way will be to use a loop container to loop through the students file, and have one workflow which does as follows:
Reads student file
Passes the file to the conditional split
Writes to the destination table
Conditional split task allows one to configure the conditions and outputs on those conditions.
If the file contains the column called StudentScore, then in the conditional split the first condition should be set as in the attached screen, please note that because the StudentScore is set to a string in the source file it has to be converted to the integer hence (DT_I4), if it is set to be an integer in the source file this conversion is redundant.
I also have given an output a name StudentScore, this output then will be linked to the destination file. I hope this helps.

How to skip irregular header information of a Flat File in SSIS?

I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.

Insert missing rows in CSV of incrementally numbered files generated by directory listing?

I have created a CSV from a set of files in a directory that are numbered incrementally:
img1_1.jpg, img1_2.jpg ... img1_1999.jpg, img1_2000.jpg
The CSV output is like so:
filename, datetime
eg:
img1_1.JPG,2011-05-11 09:16:33.000000000
img1_3.jpg,2011-05-11 10:10:55.000000000
img1_4.jpg,2011-05-11 10:17:31.000000000
img1_6.jpg,2011-05-11 10:58:37.000000000
The problem is, there are a number of files missing in the listing, as some of the files don't exist. As a result, when imported, the actual row number does not match the file number.
Can anyone think of a reasonably efficient way to insert the missing rows so that the row number and filename matches up other than manually inserting rows for the missing ones? (There are over 800 missing rows).
Background
A previous programmer developed an uploader script and did not save the creation time of the mysql record in the database. I figured the easiest way to find the creation time for the majority of the records would be to output a directory listing of all the files and combine them in a spreadsheet.
You exactly need to do what you write in your comment to answer #tadman.
A text parser script to inject the missing lines with e.g. a date/time value that reflects the record is an empty one, i.e. there is no real data is behind it (e.g. date it to 1950-01-01 00:00:00). When it is done, bulk import the CSV.I think this must be the best and most efficient solution.
Also, think about any future insert/delete/update events might occur to your data.
That would possibly break the chain you initially have had, so you might prefer instead, to introduce a numeric field for the jpegs IDs (and index that field), and leave the PK "as is" (auto increment).
In this case you can avoid CSV manipulation, as well as being chained to your AUTO PK (means: you will not get in trouble if a new jpeg arrives with an ID which was previously deleted, or existing ID, etc).
So the solution really depends on how you want to use this table in the future. If you give more details, I am sure the community can come up with even more ideas.
If it's a one-time thing, it might be easiest to open up your csv in a spreadsheet.
If your table above is in sheet1, you could put something like the following in sheet2 (this is openoffice, but there are similar functions for Excel)
pre_filename | filename | datetime
img1_1 | = A2&".JPG" | =OFFSET(Sheet1.$B$1;MATCH(B2;Sheet1.$A$2:$A$4;0);0)
You should be able to select the three cells above and drag them down to however many you need.

Using Multi Cast transformation for destinations with different columns

What is the use of Multicast Transformation Task ? With this task, is it possible to send to two destinations from a single source, while each destination has different columns ?
I assume that you are referring to Multicast Transformation inside the Data Flow task. If so, yes it is possible. The purpose of the transformation is to channel data from a single source to n number of Transformation tasks or Destinations.
If source has following columns
Source
Column 1
Column 2
Column 3
and destinations have these columns.
Destination 1 Destination 2
Column 1 Column 2
Column 3
Both destinations will be able to see Columns 1 - 3 that are available in Source. You have to map the columns accordingly in the respective destinations. Refer below example:
Example:
Screenshot #1 shows that Source has two columns Header and Value.
Screenshot #2 shows that Destination 1 has both columns Header and Value. They are mapped accordingly.
Screenshot #3 shows that Destination 2 has only column Header. It is mapped accordingly.
Screenshot #4 shows sample package execution.
Hope that helps.
Screenshot #1:
Screenshot #2:
Screenshot #3:
Screenshot #4:
#Siva did a good job of explaining the how. I'm going to tackle the "What is the use of Multicast Transformation Task?" question.
Let me give you examples of how I have used it or seen it used. First, we like to store the data in a staging table that contains just the raw unchanged data (this makes it easier for us to research data issues to see if the data problem came from a bug in our process or bad data sent by the client.) and at the same time I want to send the same data to another staging table that will be used to transform the data.
Sometimes we use Mulitcast to take denormalized files and send them to normalized data tables. So the names go to the person table, the addresses go to the address table and the phones go to the phone table.
Multicast can be used to do several different transformations on different data fields in the same source at the same time rather than one at a time and then bring all the revised data back together in a Merge join. So one path checks the States to make sure they are valid or converts the long names to the 2 character abbreviations and another checks the zip codes and adds the leading zeros that got lost because the data came from an Excel file. Then the cleaned address data is put back together with the correct values we want for insertion to our database. This can speed up cleaning as data is being scrubbed simultaneously not one step at a time.

Importing Excel Sheets into MySQL with Values that relate to a separate table

First up, this might be the wrong place to ask this question.. So, sincere apologies for that!
I have two MySQL Tables as follows:
First Table: employee_data
id name address phone_no
1 Mark Some Street 647-981-1512
2 Adam Some Street 647-981-1214
3 John Some Street 647-981-1452
Second Table: employee_wages
id employee_id wages start_date
1 3 $15 12 March 2007
2 1 $20 10 Oct 2008
3 2 $18 2 June 2006
I know, both these tables can be combined into one and there is no need to split this data into two tables. But, what i'm working on requires this data to be separate and in two different tables.
Now, previously my company used to handle all this data in Excel sheets and they followed the conventional method of having these two tables combined into one sheet as follows:
Excel Sheets
id name wages start_date
1 Mark $20 10 Oct 2008
2 Adam $18 2 June 2006
3 John $15 12 March 2007
Now, the objective is to Export the data from Excel sheets into MySQL Tables.
As you can notice employee_data.id is linked to employee_wages.employee_id
How can i replace the values in the Excel Sheet 'name' column so that they represent the actual unique ID they're given in the employee_data.id column..
May be i can do it with PHP/MySQL or i can get this done in VB Script.. BUt, I'm not an expert in VB Script..
Any help will be much appreciated..
Thanks!
Save your Excel data as a csv file (In Excel 2007 using Save As)
Check the saved file using a text editor such as Notepad to see what it actually looks like, i.e. what delimiter was used etc.
Start the MySQL Command Prompt (I’m lazy so I usually do this from the MySQL Query Browser – Tools – MySQL Command Line Client to avoid having to enter username and password etc.)
Enter this command:
LOAD DATA LOCAL INFILE ‘C:\\temp\\yourfile.csv’ INTO TABLE database.table FIELDS TERMINATED BY ‘;’ ENCLOSED BY ‘”‘ LINES TERMINATED BY ‘\r\n’ (field1, field2);
[Edit: Make sure to check your single quotes (') and double quotes (") if you copy and paste this code - it seems WordPress is changing them into some similar but different characters]
Done!
use the vlookup function in excel.
import the data into excel
use vlookup to combine what you need
You can query MySQL from Excel, this example uses INSERT: Excel VBA: writing to mysql database
A recordset can be written to Excel with CopyRecordset : http://support.microsoft.com/kb/246335