Insert missing rows in CSV of incrementally numbered files generated by directory listing? - mysql

I have created a CSV from a set of files in a directory that are numbered incrementally:
img1_1.jpg, img1_2.jpg ... img1_1999.jpg, img1_2000.jpg
The CSV output is like so:
filename, datetime
eg:
img1_1.JPG,2011-05-11 09:16:33.000000000
img1_3.jpg,2011-05-11 10:10:55.000000000
img1_4.jpg,2011-05-11 10:17:31.000000000
img1_6.jpg,2011-05-11 10:58:37.000000000
The problem is, there are a number of files missing in the listing, as some of the files don't exist. As a result, when imported, the actual row number does not match the file number.
Can anyone think of a reasonably efficient way to insert the missing rows so that the row number and filename matches up other than manually inserting rows for the missing ones? (There are over 800 missing rows).
Background
A previous programmer developed an uploader script and did not save the creation time of the mysql record in the database. I figured the easiest way to find the creation time for the majority of the records would be to output a directory listing of all the files and combine them in a spreadsheet.

You exactly need to do what you write in your comment to answer #tadman.
A text parser script to inject the missing lines with e.g. a date/time value that reflects the record is an empty one, i.e. there is no real data is behind it (e.g. date it to 1950-01-01 00:00:00). When it is done, bulk import the CSV.I think this must be the best and most efficient solution.
Also, think about any future insert/delete/update events might occur to your data.
That would possibly break the chain you initially have had, so you might prefer instead, to introduce a numeric field for the jpegs IDs (and index that field), and leave the PK "as is" (auto increment).
In this case you can avoid CSV manipulation, as well as being chained to your AUTO PK (means: you will not get in trouble if a new jpeg arrives with an ID which was previously deleted, or existing ID, etc).
So the solution really depends on how you want to use this table in the future. If you give more details, I am sure the community can come up with even more ideas.

If it's a one-time thing, it might be easiest to open up your csv in a spreadsheet.
If your table above is in sheet1, you could put something like the following in sheet2 (this is openoffice, but there are similar functions for Excel)
pre_filename | filename | datetime
img1_1 | = A2&".JPG" | =OFFSET(Sheet1.$B$1;MATCH(B2;Sheet1.$A$2:$A$4;0);0)
You should be able to select the three cells above and drag them down to however many you need.

Related

Editing SQL code in multiple queries at one time

I'd like to automate a procedure some. Basically, what I do is import a few spreadsheets from Excel, delete the old spreadsheets that I previously imported, and then change a few queries to reflect the title of the new imports. And then I change the name of the queries to reflect that I've changed them.
I suppose I could make this a bit easier by keeping the imported documents the same name as the old ones, so I'm open to doing that, but that still leaves changing the queries. That's not too difficult, either. The name stays pretty much the same, except the reports I'm working with are dated. I wish I could just do a "find and replace" in the SQL editor, but I don't think there's anything like that.
I'm open to forms, macros, or visual basic. Just about anything.
I've just been doing everything manually.
Assuming I have correctly understood the setup, there are a few ways in which this could be automated, without the need to continually modify the SQL of the queries which operate on the imported spreadsheet.
Following the import, you could either execute an append query to transfer the data into a known pre-existing table (after deleting any existing data from the table), avoiding the need to modify any of your other queries. Alternatively, you could rename the name of the imported table.
The task is then reduced to identifying the name of the imported table, given that it will vary for each import.
If the name of the spreasheet follows logical rules (you mention that the sheets are dated), then perhaps you could calculate the anticipated name based on the date on which the import occurs.
Alternatively, you could store a list of the tables present in your database and then query this list for additions following the import to identify the name of the imported table.

Pentaho Import uniqe records into database

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. However, only unique records should be imported into the database table. That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not.
So far, I tried out the suggested CRUD-pattern which looks like this:
As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0)
After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good.
But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table.
How can I realize this? Any suggestions? Thank you so much in advance!!
The Merge Rows (diff) expect the input to be sorted. Normally, you have been warned of this by a pop-up.
Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff).
You should do the same between the Table Input and the Merge Rows (diff). On course you may think you could do it in the sql statement of the Table Input.
However, there is a beginner trap here. You have 3 other steps Output Rows, Update and Delete which operates on the same table. And these steps may lock the table. As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. This is known in jargon as an auto-lock, and the way to solve it is to put a Sort Row step as a buffer.
You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve.
Thanks,
Nilesh

How to skip irregular header information of a Flat File in SSIS?

I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.

A way to update data in Oracle

I have a table that I need to update each day. The data comes in a text file every time. I wrote a program that extracts the data from the text file and and writes it in the table, but now I want to modify it to just update the existing data. The data is mostly the same, it might differ only a few things.
I was thinking about MERGE but I don't know very well how I could use this in my program. All the examples that I saw used a second table.
So it would be like creating a second table in which I extract the current data, after which I make the merge into the old table to update the records. I want to avoid creating a second table, so I was wondering if there is any way to do this?
Thanks!

Extract Distinct Record in SSIS

I am writing the SSIS package to import the data from *.csv files to the SQL 2008 DB. The problem is that one of the file contains the duplicate records in the csv file and I want to extract only the distinct values from that source. Please see the image below.
Unfortunately, the generated files are not under my control and it is owned by the third party and I could not change the way they generated.
I did use the LookUp Component. But it only checks the existing data against the incoming data. It does not check the duplicate records in the incoming data.
I believe the sort component gives an option to remove duplicate rows.
Depends on how serious you want to get about the duplicates. Do you need a record of what was duplicated or is it enough to just get rid of them? Sort component will get rid of dups on the sort field. However, the dups may have different data in the other fields and then you want a differnt strategy. Usually I load all to staging tables and clean up from there. I send the dupes removed to an exception table (we have to answer a lot of questions from our customers about why things don't match what they sent) and I often use a set of business rules (and use either an execute SQl or data flow tasks to enforce the rules) to determine which one to pick if there are duplicates in one area but not another (say two business addresses when we can only store 1). I also make sure the client is aware of how we determine which of the two to pick.
Use SORT tool for that from Toolbox, then click on it. You will get all available input columns.
Check the column and change sortType direction and then check "remove rows with duplicate sort value".
Bring in the data from the csv file the way it is, then dedup it after it's loaded.
It'll be easier to debug, too.
I used Aggregate Component and Group By both QualificationID and UnitID. If you want, you can also use Sort Component too. Perhaps, my information might help others.