I receive a flat file (CSV) each day, the contents of which gets imported into my database (rather than data entry through a web form, POS or the like). There are 40 fields in a record and I'm up to 600,000 unique records.
Up until now, I haven't seen the need to make this a relational database though there certainly is some normalization that would make it more efficient; repeating products, stores, customers, resellers, etc.
If I was starting this from the beginning and incrementally inputting the data somehow, I'd know how to do all that (every resource I've gone through covers it that way but none cover it when you have a large volume of data already and need to make it relational). And with the CVS's coming in each day I'm not quite sure how to import the data once the database is set up. If I were to split those 40 fields into say 5 tables would I then have to split that daily file the same way and import them one at a time? Would foreign keys update that way?
If someone could push me in the right direction I'll go do more digging on my own.
If you were faced with the same project, how would you create such a database and perform the daily updates?
Thanks!
Create your database structure independently of what you have right now (CSV structure and data). E.g. organize your tables to fit your future needs, think and define the relations between them good, apply proper indexes.
As the second step - unavoidable in my opinion, write a little program in the programming language of your own. It should be able to
mainly read records/lines from a (CSV) file,
validate/sanitize the fetched data
import/save the data in the correspondent database tables, as needed. By "as needed", I mean, that, in time, can appear a multitude of factors which could unexpectedly influence your first db-structural decisions. For example, the need for some temporal tables. Also, you should benefit from the advantages given to you by the triggers and stored procedures.
properly handle the errors and exceptions raised along the importing process. For example, due to eventual "duplicate key" issues - because data in files can be error-prone, some records could not be imported in a certain day. That doesn't mean that the import should break. Read a record, try to save it. If a problem appears, handle it (copy the line in another file, or save it in a special table, for a later editing/revision and re-import) and let the program follow its course with the next records.
properly log all (main) operations and maintain a counter of the read and of the problematic records.
automatically copy each daily file - after import - in a backup directory, until its not needed anymore.
eventually signalize you per email about the status of the operations.
The third step would be to find a solution to automatize the whole cycle. For example, to find a tasks/cron-jobs manager to start your program daily, once or even twice a day, without you having to make this manually.
Regarding of splitting the file into different files, based on your ddatabase structure: it wouldn't be necessary, e.g. it would be a redundant step, since your program should manage to read the file and handle the data import correspondingly.
As of the type of the program: it should be a web solution, so that you can access and modify it any time you need.
Good luck.
Related
I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!
A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.
The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/
I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/
My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.
I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )
I have a user activity tracking log table where it logs all user activity as they occur. This is extremely high write table due to the in depth tracking of click by click tracking. Up to here the database design is perfect. Problem is the next step.
I need to output the data for the business folks + these people can query to fetch past activity data. Hence there is semi-medium to high read also. I do not like the idea of reading and writing from the same high traffic table.
So ideally I want to split the tables: The first one for quick writes (less to no fks), then copy that data over fully formatted & pulling in all the labels for the ids into a read table for reading use.
So questions:
1) Is this the best approach for me?
2) If i do keep 2 tables, how to keep them in sync? I cant copy the data to the read table instant as it writes to the write table - it will defeat the whole purpose of having seperate tables then, nor can i keep the read table to be old because the activity data tracked links with other user data like session_id, etc so if these IDs are not ready when their usecase calles for it the writes will fail.
I am using MySQL for user data and HBase for some large tables, with php codeignitor for my app.
Thanks.
Yes, having 2 separate tables is the best approach. I've had the same problem to solve a few months ago, though for a daemon-type application and not a website.
Eventually I ended up with 1 MEMORY table keeping "live" data which is inserted/updated/deleted on almost every event and another table that had duplicates of the live data rows, but without the unnecesary system columns - my history table, which was used for reading only per request.
The live table is only relevant to the running process, so I don't care if the contained data is lost due to a server failure - whatever data needs to be read later is already stored in the history table. So ... there's no problem in duplicating the data in the two tables - your goal is performance, not normalization.
So, at my workplace, they have a huge access file (used with MS Access 2003 and 2007). The file size is about 1.2GB, so it takes a while to open the file. We cannot delete any of the records, and we have about 100+ tables (each month we create 4 more tables, don't ask!). How do I improve this, i.e. downsizing the file?
You can do two things:
use linked tables
"compact" the database(s) every once in a while
The linked tables will not in of themselves limit the overall size of the database, but it will "package" it in smaller, more manageable files. To look in to this:
'File' menu + 'Get External data' + 'Linked tables'
Linked tables also have many advantages such as allowing one to keep multiple versions of data subset, and selecting a particular set by way of the linked table manager.
Compacting databases reclaims space otherwise lost as various CRUD operations (Insert, Delete, Update...) fragment the storage. It also regroup tables and indexes, making search more efficient. This is done with
'Tools' menu + 'Database Utilities' + 'Compact and Repair Database...'
You're really pushing up against the limits of MS Access there — are you aware that the file can't grow any larger than 2GB?
I presume you've already examined the data for possible space saving through additional normalization? You can "archive" some of the tables for previous months into separate MDB files and then link them (permanently or as needed) to your "current" database (in which case you'd actually be benefiting from what was probably an otherwise bad decision to start new tables for each month).
But, with that amount of data, it's probably time to start planning for moving to a more capacious platform.
You should really think about your db architecture. If there aren't any links between the tables you could try to move some of them to another database (One db per year :) as a short-term solution..
A couple of “Grasping at straws” ideas
Look at the data types for each column, you might be able to store some numbers as bytes saving a small amount per record
Look at the indexes and get rid of the ones you don’t use. On big tables unnecessary indexes can add a large amount of overhead.
I would + 2^64 the suggestions about the database design being a bit odd but nothing that hasn’t already been said so I wont labour the point
well .. Listen to #Larry, and keep in mind that, on the long term, you'll have to find another database to hold your data!
But on the short term, I am quite disturbed by this "4 new tables per month" thing. 4 tables per month is 50 per year ... That surely sounds strange to every "database manager" here. So please tell us: how many rows, how are they built, what are they for, and why do you have to build tables every month?
Depending on what you are doing with your data, you could also think about archiving some tables as XML files (or even XLS?). This could make sense for "historic" data, that do not have to be accessed through relations, views, etc. One good example would be the phone calls list collected from a PABX. Data can be saved as/loaded from XML/XLS files through ADODB recordsets or the transferDatabase method
Adding more tables every month: that is already a questionable attitude, and seems suspicious regarding data normalisation.
If you do that, I suspect that your database structure is also sub-optimal regarding field sizes, data types and indexes. I would really start by double checking those.
If you really have a justification for monthly tables (which I cannot imagine, again), why not having 1 back-end per month ?
You could also have on main back-end, with, let's say, 3 month of data online, and then an archive db, where you transfer your older records.
I use that for transactions, with the main table having about 650.000 records, and Access is very responsive.