My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.
Related
I have a situation that requires mapping over 25k unique csv files, into a MySQL database.
The problem is that each csv file will have unique column headings, and thus requires mapping the columns to correct tables/columns in the MySQL database.
So for example, we may find that column 2 in one csv file is the Country, but in another csv file it is column 6. Or we might find that Country just doesn't exist in a particular csv file. You can see why doing manual mapping of columns for 25k files is not practical.
The reason it needs mapping is because we want the ability to perform a search across all the files, based on a predefined structure.
For example, we will want to find travel companies in the UK that have more than 20 employees.
We need the ability to perform this query across all data in the files, to obtain the right results. The database structure has already been defined and have been parsing in csv files fine, until now we just realised there will be a huge number of csv files to do this mapping with.
Are we better off with a NoSQL solution? Would you recommend something like Neo4j? Is there a better solution for mapping unique csv files to structured MySQL schema?
Edit:
I am planning now for us to first parse the first row of every file, and store that in a new table associated to the file with a many-to-many relationship.
From this, will allow the user to define the matching table column to file column (since I suspect a large number of columns will be practically the same).
Then save this mapping data against the file and have an automated process then perform the insertion based on this mapping.
This should hopefully reduce the workload from having to set the mapping on each file individually, and instead focus on the mapping of associated columns across all files. It will still be a heavy task though I imagine.
Thanks
I receive a flat file (CSV) each day, the contents of which gets imported into my database (rather than data entry through a web form, POS or the like). There are 40 fields in a record and I'm up to 600,000 unique records.
Up until now, I haven't seen the need to make this a relational database though there certainly is some normalization that would make it more efficient; repeating products, stores, customers, resellers, etc.
If I was starting this from the beginning and incrementally inputting the data somehow, I'd know how to do all that (every resource I've gone through covers it that way but none cover it when you have a large volume of data already and need to make it relational). And with the CVS's coming in each day I'm not quite sure how to import the data once the database is set up. If I were to split those 40 fields into say 5 tables would I then have to split that daily file the same way and import them one at a time? Would foreign keys update that way?
If someone could push me in the right direction I'll go do more digging on my own.
If you were faced with the same project, how would you create such a database and perform the daily updates?
Thanks!
Create your database structure independently of what you have right now (CSV structure and data). E.g. organize your tables to fit your future needs, think and define the relations between them good, apply proper indexes.
As the second step - unavoidable in my opinion, write a little program in the programming language of your own. It should be able to
mainly read records/lines from a (CSV) file,
validate/sanitize the fetched data
import/save the data in the correspondent database tables, as needed. By "as needed", I mean, that, in time, can appear a multitude of factors which could unexpectedly influence your first db-structural decisions. For example, the need for some temporal tables. Also, you should benefit from the advantages given to you by the triggers and stored procedures.
properly handle the errors and exceptions raised along the importing process. For example, due to eventual "duplicate key" issues - because data in files can be error-prone, some records could not be imported in a certain day. That doesn't mean that the import should break. Read a record, try to save it. If a problem appears, handle it (copy the line in another file, or save it in a special table, for a later editing/revision and re-import) and let the program follow its course with the next records.
properly log all (main) operations and maintain a counter of the read and of the problematic records.
automatically copy each daily file - after import - in a backup directory, until its not needed anymore.
eventually signalize you per email about the status of the operations.
The third step would be to find a solution to automatize the whole cycle. For example, to find a tasks/cron-jobs manager to start your program daily, once or even twice a day, without you having to make this manually.
Regarding of splitting the file into different files, based on your ddatabase structure: it wouldn't be necessary, e.g. it would be a redundant step, since your program should manage to read the file and handle the data import correspondingly.
As of the type of the program: it should be a web solution, so that you can access and modify it any time you need.
Good luck.
I'm migrating content out of an old proprietary database in a new more structured solution. The new solution asks for CSV files. For approval process -- to be checked by a human eye balls -- I need to have column names as the first line in this CSV file.
select b.Title as Title,
b.listinguuid as UID,
.
.
.
FROM b as biblioRecord
-- more join magic
INTO OUTFILE '/tmp/biblio-import.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
Given the above snippet form an otherwise larger statement, can I direct mysql to inlcude the column header as the first line?
Richard
Having looked at the MySQL docs for data output what you are asking doesn't look like it it is possible.
You have some options for data validation.
Assuming you have some form of scripting knowledge you Amy be able to create an internal stored procedure that will output the whole table (including column headings). If memory serves the script language is based in Java (not Javascript).
However why not ask if the validation can be done via a web interface, then there are a large number of tools (php my admin comes to mind) that can be used to view the tables (with header info). PhP myadmin may even be able to output the tables in CSV format for you :)
A better solution, depending on how much data needs to be validated, and what the constraints are, may be to create a dedicated set of validation scripts. This is something that you may be needing anyway as part of the larger project, it could be run after a system upgrade for example. You should talk to the client. In fact the script would be a better way to confirm everything has transfered correctly as it could compare the old and new databases directly, and report any anomalous results.
Other possibilities:
Do you have an XML schema for your your new database structure? If you do you could dump your data into an XML database, then viewing it in something like Xl, or use an xslt to present it in a web page.
Im sure there are other possibilities, but they are all going to involve some work to get to your desired end result. They will all be more time consuming, but will have other potentially useful knock on effects that need to be elucidated and presented to the client.
Personally if you have a lot of data go for some form of validation script, human eyes get tired looking at lote of rows of data, and tired eyes confuse brains and cause mistakes.
I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!
The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.
I have an application that modifies a table dynamically, think spreadsheet), then upon saving the form (which the table is part of) ,I store that changed table (with user modifications) in a database column named html_Spreadhseet,along with the rest of the form data. right now I'm just storing the html in a plain text format with basic escaping of characters...
I'm aware that this could be stored as a separate file, the source table (html_workseeet) already is. But from a data handling perspective its easier to save the changed html table to and from a column so as to avoid having to come up with a file management strategy (which folder will this live in, now must include folder in backups, security issues now need to apply to files, how to sync db security with file system etc.), so to minimize these issues I'm only storing the ... part in the database column.
My question is should I gzip the HTML , maybe use JSON, or some other format to easily store and retrieve the HTML from the database column, what is the best practice to store HTML content in a datbase? Or just store it as I currently am as an escaped text column?
If what you are trying to do is save the HTML for redisplay, what's wrong with saving it as is, then just retrieving it via a stored proc, and re-displaying it for them when needed?
Say you have an HTML page, which can select some kind of ID from a list, either on a ThickBox page, or from a select option.
Normally for this kind of situation, you would probably query the DB via $Ajax possibly JSon, or not.
Then the result sent back to the $Ajax call will be your resultant data.
Then you replace the Div which holds your SpreadSheet with the DB SpreadSheet.
So, in answer to your original question, you could store the SpreadSheet with some sort of ID, storing it as the HTML of the Div.
When retrieved, you merely replace the Div HTML, with what you have stored.
It depends on the size of the HTML. You could store it as a binary BLOB after zipping it. Most of the time it is just best to store the data directly after escaping the SQL characters that may cause problems.
As asked many times why are you storing the view instead of the model ?
You probably should bit the bullet and parse the table (using an HTML parser perhaps), as otherwise you risk the user storing damaging JavaScript in the table data. (This means the cell contents should be parsed as well). The data can still be stored as a blob (maybe compressed CSV or JSON), but you need to make certain it is not damaging.