I want to build an LP whose parameters are given by, between 5 and 10, 25,000,000 to 50,000,000 row .csv files (approx 500mb to 1Gb each).
My model is currently coded in AMPL, and reads in the parameter values directly from the .csv files. The Windows XP with 1 Gb RAM I am using runs out of memory trying to build a model based on data from only one 500mb .csv
My question:
Is there a way to manage my data so that I can build the LP using less memory?
I appreciate all feedback from anyone with experience building massive LP.
It is hard to see that you will ever be able to load and solve such large problems where the .csv file alone is 500 MB or more, if you only have 1 GB RAM on your computer.
If it is not an option to add considerably more RAM memory, you will need to analyze your LP problem to see if it can be separated into smaller independent parts. For example, if you have a problem with 10,000 variables and 10,000,000 rows, perhaps it is possible to break the main problem up into say 100 independent sub-problems with 100 variables and 100,000 rows each?
Here is a link to an albeit dated book chapter that discusses separation of a large LP problem into manageable sub-problems.
Related
I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!
A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.
The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/
I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/
I would like if someone had any experience with speed or optimization effects on the size of JSON keys in a document store database like mongodb or elasticsearch.
So for example: I have 2 documents
doc1: { keeeeeey1: 'abc', keeeeeeey2: 'xyz')
doc2: { k1: 'abc', k2: 'xyz')
Lets say I have 10 million records, then to store data in doc1 format would mean more db file size than to store in doc2.
Other than that would are the disadvantages or negative effects in terms of speed or RAM or any other optimization?
You correctly noticed that the documents will have different size. So you will save at least 15 bytes per document (60% for similar documents) if you decide to adopt the second schema. This will end up in something like 140MB for your 10 million records. This will give you the following advantage:
HDD savings. The only problem is that looking at the prices for current HDD this is mostly useless.
RAM saving. In comparison with hard discs, this can be useful for indexing. In mongodb working set of indexes should fit in RAM to achieve a good performance. So if you will have indexes on these two fields, you will not only save 140MB of HDD space but also 140MB of potential RAM space (which is actually noticable).
I/O. A lot of bottlenecks happens due to the limitation of input/output system (the speed of reading/writing from the disk is limited). For your documents, this means that with schema 2 you can potentially read/write twice as many documents per 1 second.
network. In a lot of situations network is even way slower then IO, and if you DB server is on different machine then you application server the data has to be sent over the wire. And you will also be able to send twice as much data.
After telling about advantages, I have to tell you a disadvantage for a small keys:
readability of the database. When you do db.coll.findOne() and sees {_id: 1, t: 13423, a: 3, b:0.2} it is pretty hard to understand what is exactly stored here.
readability of the application similar with the database, but at least here you can have a solution. With a mapping logic, which transforms currentDate to c and price to p you can write a clean code and have a short schema.
As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
I am looking for a ballpark estimate of the size of the database tables after I have converted the current CSV files to MyISAM tables. I know the file size of the CSV's and I need an estimate of the filesize of the MyISAM. I guess it will be bigger, because of indexes (just 1 simple index is enough), but how much? Is it about 2 times, 10 times or 50 times?
Presumably you're looking for capacity-planning size data to give to your server wrangler.
If your CSV files total less than about 200-300 megabytes in size, just tell her "two gigabytes or less."
If your CSV files are larger than that, you are definitely going to need to run some trials to figure this out. Notice that indexes aren't necessarily smaller than the data they index, and that they can grow larger faster than linearly.
If you are wrangling with your server wrangler about any amount of disk space less than a couple of gigabytes, then it's unfortunate evidence that your organization is short of money and talent.
In this case you're going to have to prove it to them: load the database on your personal computer, measure the size of the myISAM files, and tell them what you came up with.
We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.
You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!
If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.
This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.
A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.
I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.
What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.
You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.