How to handle 25k .csv files and mapping to database? - mysql

I have a situation that requires mapping over 25k unique csv files, into a MySQL database.
The problem is that each csv file will have unique column headings, and thus requires mapping the columns to correct tables/columns in the MySQL database.
So for example, we may find that column 2 in one csv file is the Country, but in another csv file it is column 6. Or we might find that Country just doesn't exist in a particular csv file. You can see why doing manual mapping of columns for 25k files is not practical.
The reason it needs mapping is because we want the ability to perform a search across all the files, based on a predefined structure.
For example, we will want to find travel companies in the UK that have more than 20 employees.
We need the ability to perform this query across all data in the files, to obtain the right results. The database structure has already been defined and have been parsing in csv files fine, until now we just realised there will be a huge number of csv files to do this mapping with.
Are we better off with a NoSQL solution? Would you recommend something like Neo4j? Is there a better solution for mapping unique csv files to structured MySQL schema?
Edit:
I am planning now for us to first parse the first row of every file, and store that in a new table associated to the file with a many-to-many relationship.
From this, will allow the user to define the matching table column to file column (since I suspect a large number of columns will be practically the same).
Then save this mapping data against the file and have an automated process then perform the insertion based on this mapping.
This should hopefully reduce the workload from having to set the mapping on each file individually, and instead focus on the mapping of associated columns across all files. It will still be a heavy task though I imagine.
Thanks

Related

Flat File as Input - MySQL Best Practice

I receive a flat file (CSV) each day, the contents of which gets imported into my database (rather than data entry through a web form, POS or the like). There are 40 fields in a record and I'm up to 600,000 unique records.
Up until now, I haven't seen the need to make this a relational database though there certainly is some normalization that would make it more efficient; repeating products, stores, customers, resellers, etc.
If I was starting this from the beginning and incrementally inputting the data somehow, I'd know how to do all that (every resource I've gone through covers it that way but none cover it when you have a large volume of data already and need to make it relational). And with the CVS's coming in each day I'm not quite sure how to import the data once the database is set up. If I were to split those 40 fields into say 5 tables would I then have to split that daily file the same way and import them one at a time? Would foreign keys update that way?
If someone could push me in the right direction I'll go do more digging on my own.
If you were faced with the same project, how would you create such a database and perform the daily updates?
Thanks!
Create your database structure independently of what you have right now (CSV structure and data). E.g. organize your tables to fit your future needs, think and define the relations between them good, apply proper indexes.
As the second step - unavoidable in my opinion, write a little program in the programming language of your own. It should be able to
mainly read records/lines from a (CSV) file,
validate/sanitize the fetched data
import/save the data in the correspondent database tables, as needed. By "as needed", I mean, that, in time, can appear a multitude of factors which could unexpectedly influence your first db-structural decisions. For example, the need for some temporal tables. Also, you should benefit from the advantages given to you by the triggers and stored procedures.
properly handle the errors and exceptions raised along the importing process. For example, due to eventual "duplicate key" issues - because data in files can be error-prone, some records could not be imported in a certain day. That doesn't mean that the import should break. Read a record, try to save it. If a problem appears, handle it (copy the line in another file, or save it in a special table, for a later editing/revision and re-import) and let the program follow its course with the next records.
properly log all (main) operations and maintain a counter of the read and of the problematic records.
automatically copy each daily file - after import - in a backup directory, until its not needed anymore.
eventually signalize you per email about the status of the operations.
The third step would be to find a solution to automatize the whole cycle. For example, to find a tasks/cron-jobs manager to start your program daily, once or even twice a day, without you having to make this manually.
Regarding of splitting the file into different files, based on your ddatabase structure: it wouldn't be necessary, e.g. it would be a redundant step, since your program should manage to read the file and handle the data import correspondingly.
As of the type of the program: it should be a web solution, so that you can access and modify it any time you need.
Good luck.

SSIS Flat File - How to handle file versions / generations

I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!
The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.

Apending CSV files with columns in different orders

I need to regularly merge data from multiple CSV files into a single spreadsheet by appending the rows from each source file. Only OpenOffice/LibreOffice is able to read the UTF-8 CSV file, which has quote-delimited fields containing newline characters.
Now, each CSV file column headings, but the order of the columns vary from file to file. Some files also have missing columns, and some have extra columns.
I have my master list of column names, and the order in which I would like them all to go. What is the best way to tackle this? LibreOffice gets the CSV parsing right (Excel certainly does not). Ultimately the files will all go into a single merged spreadsheet. Every row from each source file must be kept intact, apart from the column ordering.
The steps also need to be handed over to a non-technical third party eventually, so I am looking for an approach that will not offer too many non-expert technical hurdles.
Okay, I'm approaching this problem a different way. I have instead gone back to the source application (WooCommerce) to fix the export, so the spreadsheets list all the same columns and all in the same order, on every export. This does have other consequences that I need to follow up, such as managing patches and trying to get the changes accepted by the source project. But it does avoid having to append the CSV files with mis-matched columns, which seems to be a common issue that no-one has any real solutions for (yes, I have searched, a lot).

MySQL, load data from file, into number of tables

My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.

Building a MTurk-like app -- how to use a db when column names change for each task?

I'm building a very simple MTurk-ish app in Rails. The idea is that people will upload csvs containing whatever columns they want (e.g., some id, name of a user, some piece of text, a link, whatever -- these columns will change from task to task), and these csvs will contain all the information for the MTurk task.
My question is: how would I store these csvs in a database? One way is to store each csv row as a blob of unstructured data in MySQL (i.e., I basically leave each row as a string and stick this into a MySQL column). A maybe better way is to use a NoSQL database like MongoDB, where I don't need a predefined schema.
Suggestions? Which way is better, or is there another option? I am using Rails for this, so options that work well with Rails would be great.
Well you pretty much answered your own question.
Either use a NoSQL Document based database (like MongoDB) or split up the cvs and save it in a 1:n correlation within your database as key value pairs attached to a row and column each. your idea to store blobs isn't quite ideal however as it would restrict you from searching within the columns.