Finding part of a sequence of numerical values in files with unknown name - extract

I have the following problem I would like to solve.
I have a main table file that contains two columns: a unique identifier and a sequence of numerical values. The file looks something like this. The numerical values were copied into this file from a number of different Excel or .csv files (not more than 10). Each of these source files contain a part of the numerical values in the main file. These files look something like this.
I know the approximate, although not the exact length of each of the sub-sequences making up the numerical values in the column in the main file. I do not know the name of the .csv or Excel file where they originally came from.
I would like to do three things:
For each part of the values in the main file, determine which source file they originally came from.
Identify all numerical values that are in the same column but above and below this sub-sequence in the source file.
Copy these numerical values from the column of the sub-sequence in the source file into a new column in the main file.
Save. The results should look like this.
To restate this, the problems are:
I do not know the exact length of each sub-sequence, although I know the approximate length.
I do not know the source file for each sub-sequence of the numerical values in the main file.
In principle, this could be done by hand, but I would rather not. Does someone have an idea how to approach this?

Related

Remove Column from CSV file using SSIS

I have a CSV file that I am using as a source in SSIS. The file contains additional blank columns in the file
There is an additional column in S, U, V; is there a way I can remove the column through SSIS Script Task before using it as a source file.
Perhaps I misunderstand the problem, but you have a CSV with blank columns that you'd like to import into SQL Server but you do not want to import the blank columns into your table.
So, don't create the target columns in your table. That's it. There's no problem with having "extra" data in your data pipeline, just don't land it.
The Flat File Connection manager must* have a definition for all the columns. As it appears you do not headers for the blank columns and a column name is required, you will need to set up the flat file connection manager as my file has no header columns but then skip 1 row to avoid the header row. It might be a data starts on line setting - doing this from memory. By specifying no header row, that means you need to manually provide your column names. I favor naming them something obvious like IgnoreBlankColumn_S, IgnoreBlankColumn_U, IgnoreBlankColumn_V that way future maintainers will know this is an intentional design decision since the source data has no data.
*You can write a query against a text file which would allow you to only pull in specific columns but this is not going to be worth the effort.

How do i refresh csv data set in quicksight and not replace the data set as this loses my calcs

I am looking to refresh a data set in quicksight, this is in Spice. The data set comes from a csv file that has been updated and now has more data than the original file I uploaded.
I can't seem to find a way to simply repoint to the same file with same format. I know how to replace the file but whenever i do this it states that it can't create some of my calculated fields and so drops multiple rows of data!
I assume I'm missing something obvious but I can't seem to find the right method or any help on the issue.
Thanks
Unfortunately, QuickSight doesn't support refreshing file data-sets to my knowledge. One solution, however, is to put your CSV in S3 and refresh from there.
The one gotcha with this approach is that you'll need to create a manifest file pointing to your CSV. This isn't too difficult and the QuickSight documentation is pretty helpful.
You can replace the datasource by going into the Analysis and clicking on the pencil icon as highlighted in Step 1. By replacing dataset, you will not lose any new calculated fields that might have been calculated already on the old dataset.
If you try to replace the data source by going into the Datasets as highlighted below, you'll lose all calculated fields and modifications etc
I don't know when this was introduced but you can now do this exact thing through the "Edit Dataset", starting either from the Dataset page or from the 'pencil' -> Edit dataset inside an Analysis. It is called "update file" and will result in an updated dataset (additional or different data) without losing anything from your analysis including calculated fields, filters etc.
The normal caveat applies in that the newer uploaded file MUST contain the same column names and datatypes as the original - although it can also contain additional columns if needed.
Screenshots:

MySQL, get (just) column names and write them to a file?

I would like to get more or less what is discussed in this question.
What more? I would like to redirect the output to a file.
What less? The SHOW columns FROM your-table; actually returns a lot of stuff. In addition to column names it also shows i.e. type.
Ideally, in the end, I would like a plain text file listing one column name per line.

Insert missing rows in CSV of incrementally numbered files generated by directory listing?

I have created a CSV from a set of files in a directory that are numbered incrementally:
img1_1.jpg, img1_2.jpg ... img1_1999.jpg, img1_2000.jpg
The CSV output is like so:
filename, datetime
eg:
img1_1.JPG,2011-05-11 09:16:33.000000000
img1_3.jpg,2011-05-11 10:10:55.000000000
img1_4.jpg,2011-05-11 10:17:31.000000000
img1_6.jpg,2011-05-11 10:58:37.000000000
The problem is, there are a number of files missing in the listing, as some of the files don't exist. As a result, when imported, the actual row number does not match the file number.
Can anyone think of a reasonably efficient way to insert the missing rows so that the row number and filename matches up other than manually inserting rows for the missing ones? (There are over 800 missing rows).
Background
A previous programmer developed an uploader script and did not save the creation time of the mysql record in the database. I figured the easiest way to find the creation time for the majority of the records would be to output a directory listing of all the files and combine them in a spreadsheet.
You exactly need to do what you write in your comment to answer #tadman.
A text parser script to inject the missing lines with e.g. a date/time value that reflects the record is an empty one, i.e. there is no real data is behind it (e.g. date it to 1950-01-01 00:00:00). When it is done, bulk import the CSV.I think this must be the best and most efficient solution.
Also, think about any future insert/delete/update events might occur to your data.
That would possibly break the chain you initially have had, so you might prefer instead, to introduce a numeric field for the jpegs IDs (and index that field), and leave the PK "as is" (auto increment).
In this case you can avoid CSV manipulation, as well as being chained to your AUTO PK (means: you will not get in trouble if a new jpeg arrives with an ID which was previously deleted, or existing ID, etc).
So the solution really depends on how you want to use this table in the future. If you give more details, I am sure the community can come up with even more ideas.
If it's a one-time thing, it might be easiest to open up your csv in a spreadsheet.
If your table above is in sheet1, you could put something like the following in sheet2 (this is openoffice, but there are similar functions for Excel)
pre_filename | filename | datetime
img1_1 | = A2&".JPG" | =OFFSET(Sheet1.$B$1;MATCH(B2;Sheet1.$A$2:$A$4;0);0)
You should be able to select the three cells above and drag them down to however many you need.

Reading hierarchical flat file into SSIS

I have flat file that structured in a hierarchical format that looks something like this:
Area|AreaCode|AreaDescription
Region|RegionCode|RegionDescriptoin
Zone|ZoneCode|ZoneDescription
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
ZoneFooter
RegionFooter
AreaFooter
I have to bring this into SSIS and consume information about the Record row and also about the header for the current record row. As well as information from several other sources and output a more simple flat file as a result.
I would like to read the flat file above into a structure that each row contains a record with the appropriate header information included.
My question is, what is the best way to do this if it is even possible?
First how do you tell what type of line you are on if you are on say line 3,987,986? How do you tell what is related to what? Is there apossiblity you could get this in a better format? Before spending lots of time (and don't kid yourself, this will take lots of time to set up and test properly) I would kick the file back to the provider and request it in a different format. You won't always get it, but you should at least try.
When I have done this in the past in DTS, the first characters of each line told me which structure the line referred to. I imported all into a staging table with two columns, one for the recordtype data and one for the rest. Then I parsed the rest into the staging tables for the type of record with the correct column structure for that type of record (and any fileds you might need to do the relationships) and then did clean up and then imported to prod tables. AS you also have differnt number of columns I would try that approach (only you may have to manually populate some columns instead of figuring out directly from the file), also give each record an identity filed in the staging tables. this will help you figure out the realtionships I think.