split a csv file by content of first column without creating a copy? - csv

I am trying to accomplish something similar to what was described in this thread: How to split a huge csv file based on content of first column?
There, the best solution seemed to be use awk which does do the job. However, I am dealing with very massive csv files and I would like to split up the file without creating a new copy since the disk I/O speed is killing me. Is there a way to split the original file without creating a new copy?

I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.
You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.

"Splitting a file" still requires RAM and disk I/O. There's no way around that; it's just how the world works.
However, you can certainly reduce the impact of I/O-bound processes on your system. Some obvious solutions are:
Use a RAM disk to reduce disk I/O.
Use a SAN disk to reduce local disk I/O.
Use an I/O scheduler to rate-limit your disk I/O. For example, most Linux systems support the ionice utility for this purpose.
Chunk up the file and use batch queues to reduce CPU load.
Use nice to reduce CPU load during file processing.
If you're dealing with files, then you're dealing with I/O. It's up to you to make the best of it within your system contraints.

Related

Best way to store large read only files to be accessed from multiple nodes

I have a small collection of files (less than 50), each file is about 4GB on average. I would like to have access to those files from different servers with as little latency as possible. How would you do it?
I was at first considering using some database but MySQL has limitations for files larger than 2GB.
The interface I'm looking for is really simple, I would like to publish files associating them with a key, and retrieve them using that key (like with redis but redis is for caching and has a max size of 512MB).
If you are simply trying to make a file available with as little latency as possible, you're looking for a CDN (content delivery network).
A CDN will place the file in multiple data centers around the world, and pull from the one that is physically closest to you, reducing latency as much as possible. CDNs are also optimized for retrieving and sending the file as quickly as possible. Once cached, the request won't even have to go to your server, it'll just be between the CDN and the end user.
The "key" can simply be the file name.
When it comes to a file, storing it as a file is (almost*) always going to be fastest. If you stored it in a DB, you'd only be adding additional latency for retrieving the file. It's always going to be at least a little bit slower than retrieving the file directly.
There are a number of CDNs out there. AWS' CloudFront is probably the most accessible and cheapest (at least initially). Akamai is probably the largest. MaxCDN is also a good option.
*: From a super-technical, pure-speed perspective of just retrieving the file, keeping it in memory (RAM) instead of storage (hard disk) will likely be faster which you can easily do with a database, though you can also do with custom file system drivers. While with a CDN, you lose this kind of low-level control, the benefits of a CDN having distributed servers is going to far more beneficial.
The least latency would be to read the files from storage local to the app reading them. That would eliminate all network latency.
Then the task becomes how to keep copies of these files in sync across your different servers? I'd consider using SyncThing to help with this.

Which is better: Many files or a singular file with a lot of data

I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.

Can massive writing & deleting files hurt our server performance?

We run a system that for cache purposes, currently writes and deletes about 1,000 small files (10k) every hour.
In the near future this number will raise to about 10,000 - 20,000 files being written and deleted every hour.
For every files that is being written a new row on our mysql DB is added and deleted when the file is deleted an hour later.
My question:
Can this excessive write & delete operation hurt our server performance eventually somehow?
(btw we currently run this on a VPS and soon on a dedicated server.)
Can writing and deleting so many rows eventually slow our DB?
This greatly depends on operating system, file system and configuration of file system caching. Also this depends on whether your database is stored on the same disk as files that are written/deleted.
Usually, operation that affect file system structure such as file creations and file deletions require some synchronous disk IO, so operating system will not loose these changes after power failure. Though, some operating systems and file systems may support more relaxed policy for this. For example, UFS file system on FreeBSD has nice "soft updates" option that does this. Probably etx3/Linus should have similar feature.
Once you will move to dedicated server I think it would be reasonable to attach several HDDs to it and to make sure that database is stored on once disk while massive file operations are performed on another disk. In this cases DB performance should not be affected.
You should make some calculations and estimate needed throughtput for the storage. In your worst scenario, 20000 files x 10K = 200MB per hour which is a very low requirement.
Deleting a file, on modern filesystems, takes a very little time.
In my opinion you don't have to worry, especially if your applications creates and deletes files sequentially.
Consider also that modern operative systems cache parts of file system in memory to improve performance and reduce disk access (this is true especially for multiple deletes).
Your database will grow but engines are optimized for it, no need to care about it.
Only downside is that handling many small files could cause disk fragmentation if your file system is subjected to it.
For a performance bonus, you should consider to use a separate phisical storage for these files (e.g. a different disk drive or disk array) so you will take advantage of full bandwidth transfer with no other interferences.

stream file to mysql in c

I've been searching all over the place for streaming a file into MySQL using C, and I can't find anything. This is pretty easy to do in C++, C#, and many other languages, but I can't find anything for straight C.
Basically, I have a file, and I want to read that file into a TEXT or BLOB column in my MySQL database. This can be achieved pretty easily by looping through the file and using subsequent CONCAT() calls to append the data to the column. However, I don't think this is as elegant as a solution, and is probably very error prone.
I've looked into the prepared statements using mysql_stmt_init() and all the binds, etc, but it doesn't seem to accept a FILE pointer to read the data into the database.
It is important to note I am working with very large files that cannot be stored in RAM, so reading the entire file into a temporary variable is out of the question.
Simply put: how can I read a file from disk into a MySQL database using C? And keep in mind, there needs to be some type of buffer (ie, BUFSIZ due to the size of the files). Has anyone achieved this? Is it possible? And I'm looking for a solution that works both with text and binary files.
Can you use LOAD DATA INFILE in a call to mysql_query()?
char statement[STMT_SIZE];
snprintf(statement, STMT_SIZE, "LOAD DATA INFILE '%s' INTO TABLE '%s'",
filename, tablename);
mysql_query(conn, statement);
See
http://dev.mysql.com/doc/refman/5.6/en/load-data.html and http://dev.mysql.com/doc/refman/5.6/en/mysql-query.html for the corresponding pages in the MySQL docs.
You can use a loop to read through the file, but instead of using a function like fgets() that reads one line at a time, use a lower-level function like read() or fread() that will fill an arbitrary-sized buffer at a time:
allocate large buffer
open file
while NOT end of file
fill buffer
CONCAT to MySQL
close file
release buffer
I don't like answering my own questions, but I feel the need in case someone else is looking for a solution to this down the road.
Unless I'm missing something, my research and testing has shown me that I have three general options:
Decent Solution: use a LOAD DATA INFILE statement to send the file
pros: only one statement will ever be needed. Unlike loading the entire file into memory, you can tune the performance of LOAD DATA on both the client and the server to use a given buffer size, and you can make that buffer much smaller, which will give you "better" buffer control without making numerous calls
cons: First of all, the file absolutely MUST be in a given format, which can be difficult to do with binary blob files. Also, this takes a fair amount of work to set up, and requires a lot of tuning. By default, the client will try to load the entire file into memory, and use swap-space for the amount of the file that does not fit into memory. It's very easy to get terrible performance here, and every time you wish to make a change you have to restart the mysql server.
Decent Solution: Have a buffer (eg, char buf[BUFSIZ]), and make numerous queries with CONCAT() calls to update the content
pros: uses the least amount of memory, and gives the program better control over how much memory is being used
cons: takes up A LOT of processing time because you are making numerous mysql calls, and the server has to find the given row, and then append a string to it (which takes time, even with caching)
Worst Solution: Try to load the entire file into memory (or as much as possible), and make only one INSERT or UPDATE call to mysql
pros: limits the amount of processing performance needed on the client, as only a minimum number of calls (preferably one) will need to be buffered and executed.
cons: takes up a TON of memory. If you have numerous clients making these large calls simultaneously, the server will run out of memory quickly, and any performance gains will turn to losses very quickly.
In a perfect world, MySQL would implement a feature which allowed for buffering queries, something akin to buffering a video: you open a MySQL connection, then within that open a 'query connection' and stream the data in buffered sets, then close the 'query connection'
However, this is NOT a perfect world, and there is no such thing in MySQL. this leaves us with the three options shown above. I decided to stick with the second, where I make numerous CONCAT() calls because my current server has plenty of processing time to spare, and I'm very limited on memory in the clients. For my unique situation, trying to beat my head around tuning LOAD DATA INFILE doesn't make sense. Every application, however, will have to analyze it's own problem.
I'll stress none of these are "perfect" for me, but you can only do the best with what you have.
Points to Adam Liss for giving the LOAD DATA INFILE direction.

simultaneous 2 or more read/write to/from harddisk

it is said that there is only one spindle in hard disk which reads or writes data to/from hard disk, how is it possible to write or read 2 or more data's to/from hard disk SIMULTANEOUSLY. the operating system used is windows xp.EXAMPLE, i need to copy two different movies to hard disk from pen drive so i click both movies copy them from pen drive and am pasting them in a disk partition, coping process of two movies to hard disk occurs simultaneously. how does this happen?
These operations aren't simultaneous at all, but the operating system manages both operations concurrently.
What happens is the file manager (say, windows explorer) tells the operating system to copy file from one location to another, once each for the two copy operations.
The operating system breaks this command across two parts of its own system, the "filesystem" and the "disk driver". The file system works out what blocks on what disk are associated with the particular files in question, and tells the disk driver to read or write to those blocks.
The disk driver builds up a queue of reads and writes and figures out the most efficient way to satisfy them. A desktop operating system will usually try to service those requests quickly, to make the system as responsive as possible, but a server operating system will queue up the block operations as long as possible so that it can handle them in an order that allows it to make the most efficient use of block ordering.
Once the disk driver decides to act on a block operation, it tells the disk to move it's head and read or write some data. The result of the action is then passed back to the filesystem, and ultimately to the user application.
The fact that the operations appear simultaneous is only an illusion of the multitasking facilities of the operating system. This is pretty easy discern since multiple file copies take a little longer than just one copy (or sometimes a LOT longer, if you're trying to do a bunch at the same time.)
of course, the OS is still able to move two separate drives simultaneously if they really are different disks.