What is the fastest way to do this:
One table, no references that I cannot prefill (i.e. there is one reference key there, but i have all the data filled in)
LOTS of data. We talk of hundreds of millions of rows per day, coming in dynamically through an API
Requests must / should be processed as soon as feasible in a near real time scenario (i.e. no writing out to a file for upload one per day). 2 seconds is the normal maximal delay
Separate machines for data / application and the SQL Server
What I do now:
Aggregate up to 32*1024 rows into an array, then queue it.
Read the queue in 2-3 threads. Insert into database using SqlBulkCopy.
I get about 60k-75k rows imported per second, which is not enough, but quite close. I would love to hit 250.000 rows.
So far nothing is really used. I get 20% time "network I/O" blocks, have one core 80% loaded CPU side. Discs are writing out 7mb-14mb, mostly idle. Average queue length on a RAID 10 of 6 raptors is.... 0.25.
Anyone any idea how to speed this up? Faster server (so far it is virtual, 8gb ram, 4 cores, physical disc pass through for data).
Adding some clarifications:
This is a 2008 R2 Enterprise SQL Server on a 2008 R2 server. machine has 4 cores, 8gb ram. All 64 bit. The 80% load average comes from this machine showing about 20% cpu load.
The table is simple, has no primary key, only an index on a relational reference (instrument reference) and a unique (within a set of instruments, so this is not enforced) timestamp.
The fields on the table are: timestamp, instrument reference (no enforced foreign key), data type (char 1, one of a number of characters indicating what data is posted), price (double) and volume (int). As you can see this is a VERY thin table. The data in question is tick data for financial instruments.
The question is also about hardware etc. - mostly because i see no real bottleneck. I am inserting in multiple transactions and it gives me a benefit, but a small one. Discs, CPU are not showing significant load, network io wait is high (300ms/second, 30% at the moment), but this is on the same virtualization platform which runs JSUT the two servers and has enough cores to run all. I pretty much am open to "buy another server", but i want to identify the bottleneck first.... especially given that at the end of the day I am not grabbing what the bottleneck is. Logging is irrelevant - the bulk inserts do NOT go into the data log as data (no clustered index).
Would vertical partitioning help, for example by a byte (tinyint) that would split the instrument universe by for example 16 tables, and me thus doing up to 16 inserts at the same time? As actually the data comes from different exchanges I could make a partition per exchange. This would be a natural split field (which is actually in instrument, but I could duplicate this data here).
Some more clarifications: Got the speed even higher (90k), now clearly limited by network IO between machines, which could be VM switching.
What I do now is do a connection per 32k rows, put up a temp table, insert into this with SqlBUlkdCopy, THEN use ONE sql statement to copy to main table - minimizes any lock times on the main table.
Most waiting time is now still on network IO. Seems I run into issues where VM wise. Will move to physical hardware in the next months ;)
If you manage 70k rows per second, you're very lucky so far. But I suspect it's because you have a very simple schema.
I can't believe you ask about this kind of load on
virtual server
single array
SATA disks
The network and CPUs are shared, IO is restricted: you can't use all resources.
Any load stats you see are not very useful. I suspect the network load you see is traffic between the 2 virtual servers and you'll become IO bound if you resolve this
Before I go on, read this 10 lessons from 35K tps. He wasn't using a virtual box.
Here is what I'd do, assuming no SAN and no DR capability if you want to ramp up volumes.
Buy 2 big phyical servers, CPU RAM kind of irreleveant, max RAM, go x64 install
Disks + controllers = fastest spindles, fastest SCSI. Or a stonking great NAS
1000MB + NICs
RAID 10 with 6-10 disk for one log file for your database only
Remaining disk RAID 5 or RAID 10 for data file
For reference, our peak load is 12 million rows per hour (16 core, 16GB, SAN, x64) but we have complexity in the load. We are not at capacity.
From the answers I read here, it seems you really have a hardware problem rather than a code problem. Ideally, you'll get your performance boosts by making available more disk I/O or network bandwidth, or by running the program on the same virtual machine that hosts the database.
However I do want to share the idea that table parameter inserts are really ideal for big data transfer; although SqlBulkCopy appears to be just as fast, it's significantly less flexible.
I wrote an article about the topic here: http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
The overall answer is that you roughly want to create a table type:
CREATE TYPE item_drop_bulk_table_rev4 AS TABLE (
item_id BIGINT,
monster_class_id INT,
zone_id INT,
xpos REAL,
ypos REAL,
kill_time datetime
)
Then, you create a stored procedure to copy from the table parameter into the actual table directly, so there are fewer in-between steps:
CREATE PROCEDURE insert_item_drops_rev4
#mytable item_drop_bulk_table_rev4 READONLY
AS
INSERT INTO item_drops_rev4
(item_id, monster_class_id, zone_id, xpos, ypos, kill_time)
SELECT
item_id, monster_class_id, zone_id, xpos, ypos, kill_time
FROM
#mytable
The SQL Server code behind looks like this:
DataTable dt = new DataTable();
dt.Columns.Add(new DataColumn("item_id", typeof(Int64)));
dt.Columns.Add(new DataColumn("monster_class_id", typeof(int)));
dt.Columns.Add(new DataColumn("zone_id", typeof(int)));
dt.Columns.Add(new DataColumn("xpos", typeof(float)));
dt.Columns.Add(new DataColumn("ypos", typeof(float)));
dt.Columns.Add(new DataColumn("timestamp", typeof(DateTime)));
for (int i = 0; i < MY_INSERT_SIZE; i++) {
dt.Rows.Add(new object[] { item_id, monster_class_id, zone_id, xpos, ypos, DateTime.Now });
}
// Now we're going to do all the work with one connection!
using (SqlConnection conn = new SqlConnection(my_connection_string)) {
conn.Open();
using (SqlCommand cmd = new SqlCommand("insert_item_drops_rev4", conn)) {
cmd.CommandType = CommandType.StoredProcedure;
// Adding a "structured" parameter allows you to insert tons of data with low overhead
SqlParameter param = new SqlParameter("#mytable", SqlDbType.Structured);
param.Value = dt;
cmd.Parameters.Add(param);
cmd.ExecuteNonQuery();
}
}
Are there any indexes on the table that you could do without? EDIT: asking while you were typing.
Is it possible to turn the price into an integer, and then divide by 1000 or whatever on queries?
Have you tried adding a pk to the table? Does that improve the speed?
There is also a set-based way to use tally tables to import csv data from http://www.sqlservercentral.com/articles/T-SQL/62867/ (near bottom, requires free registration but worth it).
You might like to try that and test its performance ... with a small tally properly indexed tally table.
It is all slow.
Some time ago we solved a similar problem (insert into DB tens of thousands price data, as I remember it was about 50K per time frame, and we had about 8 time frames that all clashed at :00, so it was about 400K records) and it worked very-very fast for us (MS SQL 2005). Imagine how it will work today (SQL 2012):
<...init...>
if(bcp_init(m_hdbc, TableName, NULL, NULL, DB_IN) == FAIL)
return FALSE;
int col_number = 1;
// Bind columns
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.SymbolName, 0, 16, (LPCBYTE)"", 1, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.Time, 0, 4, 0, 0, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.Open, 0, 8, 0, 0, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.High, 0, 8, 0, 0, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.Low, 0, 8, 0, 0, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.Close, 0, 8, 0, 0, 0, col_number++) == FAIL) return FALSE;
if(bcp_bind(m_hdbc, (BYTE *)&m_sd.Volume, 0, 8, 0, 0, 0, col_number++) == FAIL) return FALSE;
<...save into sql...>
BOOL CSymbolStorage::Copy(SQL_SYMBOL_DATA *sd)
{
if(!m_bUseDB)
return TRUE;
memcpy(&m_sd, sd, sizeof(SQL_SYMBOL_DATA));
if(bcp_sendrow(m_hdbc) != SUCCEED)
return FALSE;
return TRUE;
}
Could you use horizontal partitioning?
See: http://msdn.microsoft.com/en-us/library/ms178148.aspx & http://msdn.microsoft.com/en-us/library/ms188706.aspx
You might also want to look at this question, and possibly change the recovery model:
Sql Server 2008 Tuning with large transactions (700k+ rows/transaction)
Some questions:
What edition of SQL Server are you using?
Why is the one core at 80%? That might be the bottleneck, so is probably something worth investigating.
What OS are you using, and is it 64 bit?
Related
So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!
This is not a question about query optimization. Rather, a sanity check about what to expect of data transfer rates from MySQL 5.5.27 (Amazon RDS).
When running a particularly heavy query, MySQL Workbench is showing data transfer rate of about 1MB/s and the query runs for about 420 seconds. This adds up to about 420M bytes of data being transferred.
If this data is saved into a simple text file, the size of the file ends up being less than 7M bytes. I certainly expected to see some overhead due to metadata of the ResultSet, JDBC driver mechanisms, etc. But 420M vs. 7M seems like an extraordinary terrible ratio to me. Or, is this normal?
Any feedback is much appreciated.
Much thanks!
PS. More details:
-the JDBC Driver is mysql-connector-java-5.1.13
-the data is transferred between Amazon RDS and an EC2 instance
-Java 1.6 PreparedStatement is used to execute the query
Wireshark is a wonderful free and open-source (GPL) network analysis tool that can be used to great effect in cases like this. I ran the following test to see how much traffic a "typical" JDBC connection to a "normal" MySQL server might generate.
I created a table named jdbctest in MySQL (5.5.29-0ubuntu0.12.04.2) on my test server.
CREATE TABLE `jdbctest` (
`id` int(11) DEFAULT NULL,
`textcol` varchar(6) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I populated it with 100,000 rows of the form
id textcol
------ -------
1 ABCDEF
2 ABCDEF
3 ABCDEF
...
100000 ABCDEF
At 4 bytes per id value and 6 bytes per textcol value, retrieving all 100,000 rows should represent somewhere on the order of 1 MB of data.
I fired up Wireshark, started a trace, and ran the following Java code which uses mysql-connector-java-5.1.26:
import java.sql.*;
public class mysqlTestMain {
static Connection dbConnection = null;
public static void main(String[] args) {
try {
String myConnectionString = "";
myConnectionString =
"jdbc:mysql://192.168.1.3:3306/mytestdb";
dbConnection = DriverManager.getConnection(myConnectionString, "root", "whatever");
PreparedStatement stmt = dbConnection.prepareStatement("SELECT * FROM jdbctest");
ResultSet rs = stmt.executeQuery();
int i = 0;
int j = 0;
String s = "";
while (rs.next()) {
i++;
j = rs.getInt("id");
s = rs.getString("textcol");
}
System.out.println(String.format("Finished reading %d rows.", i));
rs.close();
stmt.close();
dbConnection.close();
} catch (SQLException ex) {
ex.printStackTrace();
}
}
}
The console output confirmed that I had retrieved all 100,000 rows.
Looking at the summary of the Wireshark trace, I found:
Packets captured: 1811
Avg. packet size: 992.708 bytes
Bytes: 1797795
The breakdown by direction was
packets bytes
------- -----
from me to server 636 36519
from server to me 1175 1761276
So it appears that to retrieve my ~1 MB of data I received 1.72 MB of total network traffic from the MySQL server. That ~72% overhead on the download (or ~76% including traffic in both directions) is certainly nowhere near the ~5900% overhead suggested by your (rate * time) calculation.
I strongly suspect that the ~1 MB/s rate being reported by MySQL Workbench is not the overall average transfer rate over the entire time. The best way to determine the overhead in your particular circumstance would be to use a tool like Wireshark and measure it yourself.
First, please refer to this block of code:
while(1) {
lt = time(NULL);
ptr = localtime(<);
int n = read (fd, buf, sizeof(buf));
strftime(str, 100, "%c", ptr);
int temp = sprintf(tempCommand, "UPDATE roomtemp SET Temperature='%s' WHERE Date='Today'", buf);
temp = sprintf(dateCommand, "UPDATE roomtemp SET Date='%s' WHERE Type='DisplayTemp'", str);
printf("%s", buf);
mysql_query(conn, tempCommand);
mysql_query(conn, dateCommand);
}
The read function is actually reading data coming in from a serial port. It works great, but the problem I am experiencing (I think) is the time it takes for the loop to execute. I have data being sent to the serial port every second. Suppose the data is "22" every second. What this loop does is read in "2222" or sometimes "222222". What I think is happening is that the loop takes too long to iterate, and that causes data to accumulate in the serial buffer. The read statement reads in everything in the buffer, hence giving me repeated values.
Is there any way to get around this? Perhaps at the end of the loop, I can flush the buffer. But I am not certain I know how to do this. Or perhaps there is some way to cut down the code inside the loop in order to reduce the overall time each iteration takes in the first place. My guess is that the MySQL queries are what take the most time anyway.
To start with you should check for errors from read, and also properly terminate the received "string".
To continue with your problem, there are a couple of ways to solve this. One it to put either the reading from the serial port or the database updates in a separate thread. Then you can pass "messages" between the the threads. Be careful though, as it seems your database is slow and the message queue might build up. This message-buildup can be averted by having a message queue of size one, which always contain the latest temperature read. Then you only need a single flag that the temperature reading thread sets, and the database updating thread checks and then clears.
Another solution is to modify the protocol used for the communication, so it includes a digit to tell how big the message is.
Does the value returned by MySQL's MD5 hash function continue to change indefinitely as the string given to it grows indefinitely?
E.g., will these continue to return different values:
MD5("A"+"B"+"C")
MD5("A"+"B"+"C"+"D")
MD5("A"+"B"+"C"+"D"+"E")
MD5("A"+"B"+"C"+"D"+"E"+"D")
... and so on until a very long list of values ....
At some point, when we are giving the function very long input strings, will the results stop changing, as if the input were being truncated?
I'm asking because I want to use the MD5 function to compare two records with a large set of fields by storing the MD5 hash of these fields.
======== MADE-UP EXAMPLE (YOU DON'T NEED THIS TO ANSWER THE QUESTION BUT IT MIGHT INTEREST YOU: ========
I have a database application that periodically grabs data from an external source and uses it to update a MySQL table.
Let's imagine that in month #1, I do my first download:
downloaded data, where the first field is an ID, a key:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
I store this
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
Month #2, I get
1,"A","B","C"
2,"A","D","X"
3,"B","D","E"
4,"B","F","E"
Notice that the record with ID 2 has changed. Record with ID 4 is new. So I store two new records:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
2,"A","D","X"
4,"B","F","E"
This way I have a history of *changes* to the data.
I don't want have to compare each field of the incoming data with each field of each of the stored records.
E.g., if I'm comparing incoming record x with exiting record a, I don't want to have to say:
Add record x to the stored data if there is no record a such that x.ID == a.ID AND x.F1 == a.F1 AND x.F2 == a.F2 AND x.F3 == a.F3 [4 comparisons]
What I want to do is to compute an MD5 hash and store it:
1,"A","B","C",MD5("A"+"B"+"C")
Let's suppose that it is month #3, and I get a record:
1,"A","G","C"
What I want to do is compute the MD5 hash of the new fields: MD5("A"+"G"+"C") and compare the resulting hash with the hashes in the stored data.
If it doesn't match, then I add it as a new record.
I.e., Add record x to the stored data if there is no record a such that x.ID == a.ID AND MD5(x.F1 + x.F2 + x.F3) == a.stored_MD5_value [2 comparisons]
My question is "Can I compare the MD5 hash of, say, 50 fields without increasing the likelihood of clashes?"
Yes, practically, it should keep changing. Due to the pigeonhole principle, if you continue doing that enough, you should eventually get a collision, but it's impractical that you'll reach that point.
The security of the MD5 hash function is severely compromised. A collision attack exists that can find collisions within seconds on a computer with a 2.6Ghz Pentium4 processor (complexity of 224).
Further, there is also a chosen-prefix collision attack that can produce a collision for two chosen arbitrarily different inputs within hours, using off-the-shelf computing hardware (complexity 239).
The ability to find collisions has been greatly aided by the use of off-the-shelf GPUs. On an NVIDIA GeForce 8400GS graphics processor, 16-18 million hashes per second can be computed. An NVIDIA GeForce 8800 Ultra can calculate more than 200 million hashes per second.
These hash and collision attacks have been demonstrated in the public in various situations, including colliding document files and digital certificates.
See http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf
A number of projects have published MD5 rainbow tables online, that can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.
I recently gave a interview to one of the TOP software company. I was completely stuck with only one question asked by interviewer to me, which was
Q. I have a machine with 512 mb / 1 GB RAM and I have to sort a file (XML, or any) of 4 GB size. How will I proceed? What will be the data structure, and which sorting algorithm will I use and how?
Do you think it is achievable? If yes then can you please explain?
Thanks in advance!
The answer the interviewer might want maybe how you manage to efficiently sort the data set which exceeds system memory.The following section is taken from Wikipedia:
Memory usage patterns and index
sorting
When the size of the array to be
sorted approaches or exceeds the
available primary memory, so that
(much slower) disk or swap space must
be employed, the memory usage pattern
of a sorting algorithm becomes
important, and an algorithm that might
have been fairly efficient when the
array fit easily in RAM may become
impractical. In this scenario, the
total number of comparisons becomes
(relatively) less important, and the
number of times sections of memory
must be copied or swapped to and from
the disk can dominate the performance
characteristics of an algorithm. Thus,
the number of passes and the
localization of comparisons can be
more important than the raw number of
comparisons, since comparisons of
nearby elements to one another happen
at system bus speed (or, with caching,
even at CPU speed), which, compared to
disk speed, is virtually
instantaneous.
For example, the popular recursive
quicksort algorithm provides quite
reasonable performance with adequate
RAM, but due to the recursive way that
it copies portions of the array it
becomes much less practical when the
array does not fit in RAM, because it
may cause a number of slow copy or
move operations to and from disk. In
that scenario, another algorithm may
be preferable even if it requires more
total comparisons.
One way to work around this problem,
which works well when complex records
(such as in a relational database) are
being sorted by a relatively small key
field, is to create an index into the
array and then sort the index, rather
than the entire array. (A sorted
version of the entire array can then
be produced with one pass, reading
from the index, but often even that is
unnecessary, as having the sorted
index is adequate.) Because the index
is much smaller than the entire array,
it may fit easily in memory where the
entire array would not, effectively
eliminating the disk-swapping problem.
This procedure is sometimes called
"tag sort".[5]
Another technique for overcoming the
memory-size problem is to combine two
algorithms in a way that takes
advantages of the strength of each to
improve overall performance. For
instance, the array might be
subdivided into chunks of a size that
will fit easily in RAM (say, a few
thousand elements), the chunks sorted
using an efficient algorithm (such as
quicksort or heapsort), and the
results merged as per mergesort. This
is less efficient than just doing
mergesort in the first place, but it
requires less physical RAM (to be
practical) than a full quicksort on
the whole array.
Techniques can also be combined. For
sorting very large sets of data that
vastly exceed system memory, even the
index may need to be sorted using an
algorithm or combination of algorithms
designed to perform reasonably with
virtual memory, i.e., to reduce the
amount of swapping required.
Use Divide and Conquer.
Here's the pseudocode:
function sortFile(file)
if fileTooBigForMemory(file)
pair<firstHalfOfFile, secondHalfOfFile> = breakIntoTwoHalves()
sortFile(firstHalfOfFile)
sortFile(secondHalfOfFile)
else
sortCharactersInFile(file)
endif
MergeTwoHalvesInOrder(firstHalfOfFile, secondHalfOfFile)
end
Two well-known algorithms that fall in to the divide and conquer category are merge sort and quick sort algorithm. So you could use them for implementation.
As for the data structure, a char array containing characters in the file could do. If you want to be more object oriented, wrap it in a class called File:
class File {
private char[] characters;
//methods to access and mutate 'characters'
}
There is a nice post on the Guido van Rossum blog which has something to suggest. Beware that the code is in Python.
Split your file to chunks which fit into memory.
Sort each chunk using quick sort and save it to a separate file.
Then merge result files and you get your result.
I would use a multiway merge. There is an excellent book called Managing Gigabytes that shows several different ways of doing it. They also go into sort based inversion for files that are larger than physical memory. Look around page 240 for a pretty detailed algorithm on sorting through chunks on disk.
The post above is correct in that you split the file and sort each portion.
Say you have the 4GB file and only want to load a max of 512MB. That means you need to split the file into 8 chunks minimum. If you are not sure how much extra overhead your sort is going to use, you might even double that number to be safe to 16 chunks.
The 16 files are then sorted one at a time to be in a guaranteed order. So now you have chunk 0-15 as sorted files.
Now you open 16 file handles to those files and read one entry at a time, writing the lowest one to the final output. Since you know each of the files is already sorted, taking the lowest from each means you are then writing them in the correct order to the final output.
I have used such a system in C# for sorting large collections of spam words from emails. The original system required all of them to load into RAM in order to sort them and build a dictionary for spam counts. Once the file grew over 2 GB the in memory structures were requiring 6+GB of RAM and taking over 24 hours to sort due to paging and VM. The new system using the chunking above sorted the entire file in under 40 minutes. That was an impressive speedup for such a simple change.
I played with various load options (1/4 system memory per chunk, etc). It turned out that for our situation the best option was about 1/10 system memory. Then Windows had enough memory left over for decent File I/O buffering to offset the increased file traffic. And the machine was left very responsive to other processes running on it.
And yes, I do frequently like to ask these types of questions in interviews as well. Just to see if people can think outside the box. What do you do when you can't just use .Sort() on a list?
Just simulate a virtual memory, overload the array index operator, []
Find a quicksort implementation that sorts an array in C++ or C#. overload the indexer operator [] which will read from and save to file. That way, you can just plug existing sort algorithms, you just change what happens behind the scenes on those []
here's one example of simulating virtual memory on C#
source: http://msdn.microsoft.com/en-us/library/aa288465(VS.71).aspx
// indexer.cs
// arguments: indexer.txt
using System;
using System.IO;
// Class to provide access to a large file
// as if it were a byte array.
public class FileByteArray
{
Stream stream; // Holds the underlying stream
// used to access the file.
// Create a new FileByteArray encapsulating a particular file.
public FileByteArray(string fileName)
{
stream = new FileStream(fileName, FileMode.Open);
}
// Close the stream. This should be the last thing done
// when you are finished.
public void Close()
{
stream.Close();
stream = null;
}
// Indexer to provide read/write access to the file.
public byte this[long index] // long is a 64-bit integer
{
// Read one byte at offset index and return it.
get
{
byte[] buffer = new byte[1];
stream.Seek(index, SeekOrigin.Begin);
stream.Read(buffer, 0, 1);
return buffer[0];
}
// Write one byte at offset index and return it.
set
{
byte[] buffer = new byte[1] {value};
stream.Seek(index, SeekOrigin.Begin);
stream.Write(buffer, 0, 1);
}
}
// Get the total length of the file.
public long Length
{
get
{
return stream.Seek(0, SeekOrigin.End);
}
}
}
// Demonstrate the FileByteArray class.
// Reverses the bytes in a file.
public class Reverse
{
public static void Main(String[] args)
{
// Check for arguments.
if (args.Length == 0)
{
Console.WriteLine("indexer <filename>");
return;
}
FileByteArray file = new FileByteArray(args[0]);
long len = file.Length;
// Swap bytes in the file to reverse it.
for (long i = 0; i < len / 2; ++i)
{
byte t;
// Note that indexing the "file" variable invokes the
// indexer on the FileByteStream class, which reads
// and writes the bytes in the file.
t = file[i];
file[i] = file[len - i - 1];
file[len - i - 1] = t;
}
file.Close();
}
}
Use the above code to roll your own array class. Then just use any array sorting algorithms.