It seems that the word "dump" or "dumping" is often used when a program get the memory content or some data that we cannot get directly. So when should we use that word properly?
Dump is typically used when you are getting the raw data, right from the disk or RAM, unfiltered.
Related
I'm testing an application written in QT that deals with PDFs saved on a database, i was having trouble trying to save anything larger than about 1Mb the application would crash, reading on Goggle end up changing the MAX_ALLOWED_PACKET and let me save blobs.
I plotted several uploads of different size of PDF and i got a number of about 200Kb/sec saving files. Then it came my surprise, checking the data base i realized that anything over around 5Mb would not store. There is no error and it seems that the handshake between the application and MySQL goes ok, as i don't get any errors.
I have some experience with MySQL and Oracle but i have never dealt with Blobs.
I read on a post somewhere that i should try to change the value of innodb_log_file_size (i tried 2000000000) but MySQL tells me that it is a read only variable. Could some body help me fix this problem? I'm running MySQL on Ubuntu.
It's not surprising that you got an error, because the default innodb log file size is 48MB (50331648 bytes).
Your innodb log file size must be at least 10x the size of your largest blob that you try to save. In other words, you can save a blob only if it's no larger than 1/10th the log file size. This started being enforced in MySQL 5.6; before that it was recommended in the manual, but not enforced.
You can change the log file size, but it requires restarting the MySQL Server. The steps are documented here: https://dev.mysql.com/doc/refman/5.7/en/innodb-data-log-reconfiguration.html
P.S. As for the comments about storing images in the database vs. as files on disk, this is a long debate. Some people will make unequivocal statements that it's bad to store images in the database, but there are pros and cons on both sides of the argument. See my answer to Should I use MySQL blob field type?
Bill has the immediate answer. But I suggest that you will get bigger and bigger documents, and be hitting one limit after another.
Meanwhile, unless you have a very recent MySQL, changing the innodb_log_file_size is a pain.
If you ever get to 1GB, you will hit the limit for max_allowed_packet. Even if you got past that, then you will hit another hard limit -- 4GB, the maximum size of LONGBLOB or LONGTEXT.
I suggest you either bite the bullet
Plan A: Put documents in the file system, or
Plan B: Chunk the documents into pieces, storing into multiple BLOB rows. This will avoid all limits, even the 4GB limit. But the code will be messy on input and output.
I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.
I am trying to accomplish something similar to what was described in this thread: How to split a huge csv file based on content of first column?
There, the best solution seemed to be use awk which does do the job. However, I am dealing with very massive csv files and I would like to split up the file without creating a new copy since the disk I/O speed is killing me. Is there a way to split the original file without creating a new copy?
I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.
You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.
"Splitting a file" still requires RAM and disk I/O. There's no way around that; it's just how the world works.
However, you can certainly reduce the impact of I/O-bound processes on your system. Some obvious solutions are:
Use a RAM disk to reduce disk I/O.
Use a SAN disk to reduce local disk I/O.
Use an I/O scheduler to rate-limit your disk I/O. For example, most Linux systems support the ionice utility for this purpose.
Chunk up the file and use batch queues to reduce CPU load.
Use nice to reduce CPU load during file processing.
If you're dealing with files, then you're dealing with I/O. It's up to you to make the best of it within your system contraints.
I've been searching all over the place for streaming a file into MySQL using C, and I can't find anything. This is pretty easy to do in C++, C#, and many other languages, but I can't find anything for straight C.
Basically, I have a file, and I want to read that file into a TEXT or BLOB column in my MySQL database. This can be achieved pretty easily by looping through the file and using subsequent CONCAT() calls to append the data to the column. However, I don't think this is as elegant as a solution, and is probably very error prone.
I've looked into the prepared statements using mysql_stmt_init() and all the binds, etc, but it doesn't seem to accept a FILE pointer to read the data into the database.
It is important to note I am working with very large files that cannot be stored in RAM, so reading the entire file into a temporary variable is out of the question.
Simply put: how can I read a file from disk into a MySQL database using C? And keep in mind, there needs to be some type of buffer (ie, BUFSIZ due to the size of the files). Has anyone achieved this? Is it possible? And I'm looking for a solution that works both with text and binary files.
Can you use LOAD DATA INFILE in a call to mysql_query()?
char statement[STMT_SIZE];
snprintf(statement, STMT_SIZE, "LOAD DATA INFILE '%s' INTO TABLE '%s'",
filename, tablename);
mysql_query(conn, statement);
See
http://dev.mysql.com/doc/refman/5.6/en/load-data.html and http://dev.mysql.com/doc/refman/5.6/en/mysql-query.html for the corresponding pages in the MySQL docs.
You can use a loop to read through the file, but instead of using a function like fgets() that reads one line at a time, use a lower-level function like read() or fread() that will fill an arbitrary-sized buffer at a time:
allocate large buffer
open file
while NOT end of file
fill buffer
CONCAT to MySQL
close file
release buffer
I don't like answering my own questions, but I feel the need in case someone else is looking for a solution to this down the road.
Unless I'm missing something, my research and testing has shown me that I have three general options:
Decent Solution: use a LOAD DATA INFILE statement to send the file
pros: only one statement will ever be needed. Unlike loading the entire file into memory, you can tune the performance of LOAD DATA on both the client and the server to use a given buffer size, and you can make that buffer much smaller, which will give you "better" buffer control without making numerous calls
cons: First of all, the file absolutely MUST be in a given format, which can be difficult to do with binary blob files. Also, this takes a fair amount of work to set up, and requires a lot of tuning. By default, the client will try to load the entire file into memory, and use swap-space for the amount of the file that does not fit into memory. It's very easy to get terrible performance here, and every time you wish to make a change you have to restart the mysql server.
Decent Solution: Have a buffer (eg, char buf[BUFSIZ]), and make numerous queries with CONCAT() calls to update the content
pros: uses the least amount of memory, and gives the program better control over how much memory is being used
cons: takes up A LOT of processing time because you are making numerous mysql calls, and the server has to find the given row, and then append a string to it (which takes time, even with caching)
Worst Solution: Try to load the entire file into memory (or as much as possible), and make only one INSERT or UPDATE call to mysql
pros: limits the amount of processing performance needed on the client, as only a minimum number of calls (preferably one) will need to be buffered and executed.
cons: takes up a TON of memory. If you have numerous clients making these large calls simultaneously, the server will run out of memory quickly, and any performance gains will turn to losses very quickly.
In a perfect world, MySQL would implement a feature which allowed for buffering queries, something akin to buffering a video: you open a MySQL connection, then within that open a 'query connection' and stream the data in buffered sets, then close the 'query connection'
However, this is NOT a perfect world, and there is no such thing in MySQL. this leaves us with the three options shown above. I decided to stick with the second, where I make numerous CONCAT() calls because my current server has plenty of processing time to spare, and I'm very limited on memory in the clients. For my unique situation, trying to beat my head around tuning LOAD DATA INFILE doesn't make sense. Every application, however, will have to analyze it's own problem.
I'll stress none of these are "perfect" for me, but you can only do the best with what you have.
Points to Adam Liss for giving the LOAD DATA INFILE direction.
I'm writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I'd like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:
Minimize storage space used in the database - possibly by minifying HTML code, Huffman encoding, or some other form of compression. I'd like to maintain the possibility of fulltext searching the field - I don't know if compression algorithms like Huffman encoding will allow this.
Minimize the processor usage necessary to encode and store large volumes of rows.
Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it's going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?
If you don't mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won't be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).
Do you actully need to store the source in the database?
Trying to run 'LIKE' queries against the data is going to suck big time anyway.
Store the raw data on the file system, as standard files. Just dont stick them all in one folder. use hashes of the id, to store them in predictable folders.
(while of course it is perfectly possible to store the text in the database, it bloats the size of your database, and makes it harder to work with. backups are (much!) bigger, changing storage engine, becomes more painful etc. Scaling your filesystem, is usually just a case of adding another harddisk. That doesnt work so easily with a database - you start needing to shard)
... to do any sort of searching on the data, you looking at building an index. I only have experience with SphinxSearch, but that allows you to specify a filename in the input database.