I've been searching all over the place for streaming a file into MySQL using C, and I can't find anything. This is pretty easy to do in C++, C#, and many other languages, but I can't find anything for straight C.
Basically, I have a file, and I want to read that file into a TEXT or BLOB column in my MySQL database. This can be achieved pretty easily by looping through the file and using subsequent CONCAT() calls to append the data to the column. However, I don't think this is as elegant as a solution, and is probably very error prone.
I've looked into the prepared statements using mysql_stmt_init() and all the binds, etc, but it doesn't seem to accept a FILE pointer to read the data into the database.
It is important to note I am working with very large files that cannot be stored in RAM, so reading the entire file into a temporary variable is out of the question.
Simply put: how can I read a file from disk into a MySQL database using C? And keep in mind, there needs to be some type of buffer (ie, BUFSIZ due to the size of the files). Has anyone achieved this? Is it possible? And I'm looking for a solution that works both with text and binary files.
Can you use LOAD DATA INFILE in a call to mysql_query()?
char statement[STMT_SIZE];
snprintf(statement, STMT_SIZE, "LOAD DATA INFILE '%s' INTO TABLE '%s'",
filename, tablename);
mysql_query(conn, statement);
See
http://dev.mysql.com/doc/refman/5.6/en/load-data.html and http://dev.mysql.com/doc/refman/5.6/en/mysql-query.html for the corresponding pages in the MySQL docs.
You can use a loop to read through the file, but instead of using a function like fgets() that reads one line at a time, use a lower-level function like read() or fread() that will fill an arbitrary-sized buffer at a time:
allocate large buffer
open file
while NOT end of file
fill buffer
CONCAT to MySQL
close file
release buffer
I don't like answering my own questions, but I feel the need in case someone else is looking for a solution to this down the road.
Unless I'm missing something, my research and testing has shown me that I have three general options:
Decent Solution: use a LOAD DATA INFILE statement to send the file
pros: only one statement will ever be needed. Unlike loading the entire file into memory, you can tune the performance of LOAD DATA on both the client and the server to use a given buffer size, and you can make that buffer much smaller, which will give you "better" buffer control without making numerous calls
cons: First of all, the file absolutely MUST be in a given format, which can be difficult to do with binary blob files. Also, this takes a fair amount of work to set up, and requires a lot of tuning. By default, the client will try to load the entire file into memory, and use swap-space for the amount of the file that does not fit into memory. It's very easy to get terrible performance here, and every time you wish to make a change you have to restart the mysql server.
Decent Solution: Have a buffer (eg, char buf[BUFSIZ]), and make numerous queries with CONCAT() calls to update the content
pros: uses the least amount of memory, and gives the program better control over how much memory is being used
cons: takes up A LOT of processing time because you are making numerous mysql calls, and the server has to find the given row, and then append a string to it (which takes time, even with caching)
Worst Solution: Try to load the entire file into memory (or as much as possible), and make only one INSERT or UPDATE call to mysql
pros: limits the amount of processing performance needed on the client, as only a minimum number of calls (preferably one) will need to be buffered and executed.
cons: takes up a TON of memory. If you have numerous clients making these large calls simultaneously, the server will run out of memory quickly, and any performance gains will turn to losses very quickly.
In a perfect world, MySQL would implement a feature which allowed for buffering queries, something akin to buffering a video: you open a MySQL connection, then within that open a 'query connection' and stream the data in buffered sets, then close the 'query connection'
However, this is NOT a perfect world, and there is no such thing in MySQL. this leaves us with the three options shown above. I decided to stick with the second, where I make numerous CONCAT() calls because my current server has plenty of processing time to spare, and I'm very limited on memory in the clients. For my unique situation, trying to beat my head around tuning LOAD DATA INFILE doesn't make sense. Every application, however, will have to analyze it's own problem.
I'll stress none of these are "perfect" for me, but you can only do the best with what you have.
Points to Adam Liss for giving the LOAD DATA INFILE direction.
Related
I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.
I am currently importing a huge CSV file from my iPhone to a rails server. In this case, the server will parse the data and then start inserting rows of data into the database. The CSV file is fairly large and would take a lot time for the operation to end.
Since I am doing this asynchronously, my iPhone is then able to go to other views and do other stuff.
However, when it requests another query in another table.. this will HANG because the first operation is still trying to insert the CSV's information into the database.
Is there a way to resolve this type of issue?
As long as the phone doesn't care when the database insert is complete, you might want to try storing the CSV file in a tmp directory on your server and then have a script write from that file to the database. Or simply store it in memory. That way, once the phone has posted the CSV file, it can move on to other things while the script handles the database inserts asynchronously. And yes, #Barmar is right about using an InnoDB engine rather than MyISAM (which may be default in some configurations).
Or, you might want to consider enabling "low-priority updates" which will delay write calls until all pending read calls have finished. See this article about MySQL table locking. (I'm not sure what exactly you say is hanging: the update, or reads while performing the update…)
Regardless, if you are posting the data asynchronously from your phone (i.e., not from the UI thread), it shouldn't be an issue as long as you don't try to use more than the maximum number of concurrent HTTP connections.
I am trying to accomplish something similar to what was described in this thread: How to split a huge csv file based on content of first column?
There, the best solution seemed to be use awk which does do the job. However, I am dealing with very massive csv files and I would like to split up the file without creating a new copy since the disk I/O speed is killing me. Is there a way to split the original file without creating a new copy?
I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.
You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.
"Splitting a file" still requires RAM and disk I/O. There's no way around that; it's just how the world works.
However, you can certainly reduce the impact of I/O-bound processes on your system. Some obvious solutions are:
Use a RAM disk to reduce disk I/O.
Use a SAN disk to reduce local disk I/O.
Use an I/O scheduler to rate-limit your disk I/O. For example, most Linux systems support the ionice utility for this purpose.
Chunk up the file and use batch queues to reduce CPU load.
Use nice to reduce CPU load during file processing.
If you're dealing with files, then you're dealing with I/O. It's up to you to make the best of it within your system contraints.
I'm writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I'd like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:
Minimize storage space used in the database - possibly by minifying HTML code, Huffman encoding, or some other form of compression. I'd like to maintain the possibility of fulltext searching the field - I don't know if compression algorithms like Huffman encoding will allow this.
Minimize the processor usage necessary to encode and store large volumes of rows.
Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it's going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?
If you don't mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won't be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).
Do you actully need to store the source in the database?
Trying to run 'LIKE' queries against the data is going to suck big time anyway.
Store the raw data on the file system, as standard files. Just dont stick them all in one folder. use hashes of the id, to store them in predictable folders.
(while of course it is perfectly possible to store the text in the database, it bloats the size of your database, and makes it harder to work with. backups are (much!) bigger, changing storage engine, becomes more painful etc. Scaling your filesystem, is usually just a case of adding another harddisk. That doesnt work so easily with a database - you start needing to shard)
... to do any sort of searching on the data, you looking at building an index. I only have experience with SphinxSearch, but that allows you to specify a filename in the input database.
What do Repair and Compact operations do to an .MDB?
If these operations do not stop a 1GB+ .MDB backed VB application crashing, what other options are there?
Why would a large sized .MDB file cause an application to crash?
"What do compact and repair operations do to an MDB?"
First off, don't worry about repair. The fact that there are still commands that purport to do a standalone repair is a legacy of the old days. That behavior of that command was changed greatly starting with Jet 3.51, and has remained so since that. That is, a repair will never be performed unless Jet/ACE determines that it is necessary. When you do a compact, it will test whether a repair is needed and perform it before the compact.
So, what does it do?
A compact/repair rewrites the data file, elmininating any unused data pages, writing tables and indexes in contiguous data pages and flagging all saved QueryDefs for re-compilation the next time they are run. It also updates certain metadata for the tables, and other metadata and internal structures in the header of the file.
All databases have some form of "compact" operation because they are optimized for performance. Disk space is cheap, so instead of writing things in to use storage efficiently, they instead write to the first available space. Thus, in Jet/ACE, if you update a record, the record is written to the original data page only if the new data fits within the original data page. If not, the original data page is marked unused and the record is rewritten to an entirely new data page. Thus, the file can become internally fragmented, with used and unused data pages mixed in throughout the file.
A compact organizes everything neatly and gets rid of all the slack space. It also rewrites data tables in primary key order (Jet/ACE clusters on the PK, but that's the only index you can cluster on). Indexes are also rewritten at that point, since over time those become fragmented with use, also.
Compact is an operation that should be part of regular maintenance of any Jet/ACE file, but you shouldn't have to do it often. If you're experiencing regular significant bloat, then it suggests that you may be mis-using your back-end database by storing/deleting temporary data. If your app adds records and deletes them as part of its regular operations, then you have a design problem that's going to make your data file bloat regularly.
To fix that error, move the temp tables to a different standalone MDB/ACCDB so that the churn won't cause your main data file to bloat.
On another note not applicable in this context, front ends bload in different ways because of the nature of what's stored in them. Since this question is about an MDB/ACCDB used from VB, I'll not go into details, but suffice it to say that compacting a front end is something that's necessary during development, but only very seldom in production use. The only reason to compact a production front end is to update metadata and recompile queries stored in it.
It's always been that MDB files become slow and prone to corruption as they get over 1GB, but I've never known why - it's always been just a fact of life. I did some quick searching, I can't find any official, or even well-informed insider, explanations of why this size is correlated with MDB problems, but my experience has always been that MDB files become incredibly unreliable as you approach and exceed 1GB.
Here's the MS KB article about Repair and Compact, detailing what happens during that operation:
http://support.microsoft.com/kb/209769/EN-US/
The app probably crashes as the result of improper/unexpected data returned from a database query to an MDB that large - what error in particular do you get when your application crashes? Perhaps there's a way to catch the error and deal with it instead of just crashing the application.
If it is crashing a lot then you might want to try a decompile on the DB and/or making a new database and copying all the objects over to the new container.
Try the decompile first, to do that just add the /decompile flag to the startup options of your DB for example
“C:\Program Files\access\access.mdb” “C:\mydb.mdb” /decompile
Then compact, compile and then compact again
EDIT:
You cant do it without access being installed but if it is just storing data then a decompile will not do you any good. You can however look at jetcomp to help you with you compacting needs
support.microsoft.com/kb/273956