I have a table that has 100.000 rows, and soon it will be doubled. The size of the database is currently 5 gb and most of them goes to one particular column, which is a text column for PDF files. We expect to have 20-30 GB or maybe 50 gb database after couple of month and this system will be used frequently.
I have couple of questions regarding with this setup
1-) We are using innodb on every table, including users table etc. Is it better to use myisam on this table, where we store text version of the PDF files? (from memory usage /performance perspective)
2-) We use Sphinx for searching, however the data must be retrieved for highlighting. Highlighting is done via sphinx API but still we need to retrieve 10 rows in order to send it to Sphinx again. This 10 rows may allocate 50 mb memory, which is quite large. So I am planning to split these PDF files into chunks of 5 pages in the database, so these 100.000 rows will be around 3-4 million rows and couple of month later, instead of having 300.000-350.000 rows, we'll have 10 million rows to store text version of these PDF files. However, we will retrieve less pages, so again instead of retrieving 400 pages to send Sphinx for highlighting, we can retrieve 5 pages and it will have a big impact on the performance. Currently, when we search a term and retrieve PDF files that have more than 100 pages, the execution time is 0.3-0.35 seconds, however if we retrieve PDF files that have less than 5 pages, the execution time reduces to 0.06 seconds, and it also uses less memory.
Do you think, this is a good trade-off? We will have million of rows instead of having 100k-200k rows but it will save memory and improve the performance. Is it a good approach to solve this problem and do you have any ideas how to overcome this problem?
The text version of the data is used only for indexing and highlighting. So, we are very flexible.
Edit: We store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine. So, we still store PDF files on our cloud, but the text version must be in our db as well to index their tag title and description as well. They are different tables, but it must be in the database as well.
Thanks,
it sounds like you don't really need to retrieve your entire pdf file every time you hit on a row for that pdf file.
are you separating the metadata about your pdf files from the file itself? you definitely shouldn't have just one table here. you might want something like table pdf_info with 100 columns (do you really have that much metadata? why 100 columns?) and a foreign key to the the pdf_files table containing the actual text for the files. then you can experiment with, maybe, making the info table innodb and the files table myisam.
IMHO: there are many, many reasons to NOT store your pdf file in the mysql database. i would just store the file paths to a SAN or some other file distribution mechanism. sql is good for storing any abstract data, and files are certainly in that category. but file systems are specifically designed to store files, and webservers specifically designed to deliver those files to you as quickly as possible. so... just something to think about.
Use Solr, it is possible index text files with their metadata from a database. I have switched the search engine to Solr.
That sounds like a really bad technology choice. If you can slow the growth so you can keep everything in memory (affordable to 128GB or so) or partion for a larger size, you can basically be network transfer limited.
[edit]
If the pdfs are on disk, and not in ram, your disk needs to be accessed. If you don't have a SSD, you can do that 50 times/second/disk. As long as a pdf is smaller than a disk track, splitting is not very interesting. If you split the pdfs and then need access to all parts, you might need to load from several tracks, slowing you down a lot.
Handling large documents with a RDBMs in a multi-user setup is not a good idea, performance wise.
Related
I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.
It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).
From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.
Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?
In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?
A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”
You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.
On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.
Whow!
A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.
So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!
1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.
2nd) I made a Ramdisk. Be sure that you have enough for your Project!
3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz
In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.
The final release results in processing from:
PHP/MySQL > 5 sec
Perl/HDD ~ 1.2 sec
Perl/RamDisk ~ 0.001 sec
When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.
The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...
Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.
Our employees are doing work for clients, and the clients send us files which contain information that we turn into performance metrics (we do not have direct access to this information - it needs to be sent from the clients). These files are normally .csv or .xlsx so typically I read them with pandas and output a much cleaner, smaller file.
1) Some files contain call drivers or other categorical information which repeats constantly (for example, Issue Driver 1 with like 20 possibilities and Issue Driver 2 with 100 possibilities) - these files are about 100+ million records per year so they become pretty large if I consolidate them. Is it better to create a dictionary and map each driver out to an integer? I read a bit about the category dtype in pandas - does this make output file sizes smaller too or just in-memory?
2) I store the output as .csv which means that I lose the dtypes if I ever read the file again. How do I maintain dtypes and should I save the files to sqlite instead perhaps instead of massive .csv files? My issue now is that I literally create codes to break the files up into separate .csvs per month and then maintain a massive file which I use for analysis (dump it into Tableau normally). If I need to to make changes to the monthly files I have to re-write them all which is slow on my laptops non-SSD hard drive.
3) I normally only need to share data with one or two people. And most analysis requests are adhoc but involve like one - three years worth of very granular data (individual surveys or interactions each represented by a single row in separate files). In other words I do not need a system with high concurrency of read-write. Just want something fast, efficient, and consolidated.
We have about 60 million webpages in a compressed format. We would like to de-compress and work with these files individually.
Here are my questions!
First, if I decompress them into the file system, would the FS cope with such number of files. My file system is ext4. (I've 4 different file systems so I can divide the data between them like 15 M pages for each file system)
Secondly, Would storing these files into a relational database be a better option? assuming that all the hassle of cleaning html text is done before inserting them into the database.
Thanks,
If you extract them into a single directory you may exceed the maximum allocated indices in that folder. If you extract them into multiple directories you will fair better.
60 Million is definitely a fair amount, if you plan on doing any indexing on them or searching then a database would be your best option, you can do indexing on files using something like lucene it all depends on what you want to do with the files After they ave been extracted.
I currently have a similar issue with images on a large user site, the way I got around this issue was to give each image a GUID and for each byte in the guid assign it to a different directory, then the next byte under a subdirectory (down to 8 bytes) if my fill ratio goes up I'll create more subdirectories to compensate, it also means I can spread it across different net storage boxes.
I need a database that can store a large number of BLOBS. The BLOBs would be picture files and would also have a timestamp and a few basic fields (size, metrics, ids of objects in other databases, things like that), but the main purpose of the database is to store the pictures.
We would like to be able to store the data in the database for a while, in the order of few months. With the data coming in maybe every few minutes, the number of BLOBs stored can grow quite quickly.
For now (development phase) we will be using a MySQL for this. I was wondering if MySQL is a good direction to go, in terms of:
Being able to store binary data efficiently
Scalability
Maintenance requirements.
Thanks,
MySQL is a good database, and can handle large data sets. However, there is a great benefit in making your whole database fit into RAM, in such case all database-related activity will be much faster. By putting large and seldom-accessed objects into your database, you're making this harder.
So, I think a combined approach is the best:
Save only metadata in the database, and save the files on disk as-is. Better to hash the directories if you're talking about 100,000 of files, then save file under the name of an index field in your database. E.g. such directory structure:
00/00001.jpg
00/00002.jpg
00/00003.jpg
....
....
10/10234.jpg
10/10235.jpg
In this case, your directories won't have too many files, and accessing the files is fast and easy. Of course if your database server is distributed/redundant, things get more interesting, any such approach may or may not be warranted, depending on the load, redundancy/fail over requirements, etc.
I suggest to store images on hard disk and in your mysql implementation maintain the metadata of your image including the filename (maybe). So your script can easily pick it up from your local hard drive.
For Reading & storing files, hard disk and most modern OS are really good at it. So I believe mysql is not going to solve anything here.
I have a very large dataset, each item in the dataset being roughly 1kB in size. The data needs to be queried rapidly by many applications distributed over a network. The dataset has more than a million items (so 500 million+ 1kB data chunks).
What would be the best method to storing this dataset (need to allow adding more items, and reading them rapidly, but never modifying already added data)? Would using a MySQL DB using the binary blob format be appropriate?
Or should each of these be stored as files on a file system?
edit: the number is 1 million items now, but needs to be able to scale to well over 500 million items easily.
Since there is no need to index anything inside the object. I would have to say a filesystem is probably your best bet not a relational database. Since there's only an unique ID and a blob, there really isn't any structure here, so there's no value to putting it in a database.
You could use a web server to provide access to the repository. And then a caching solution like nginx w/memcache to keep it all in memory and scale out using load balancing.
And if you run into further performance issues, you can remove the filesystem and roll your own like Facebook did with their photos system. This can reduce the unnecessary IO operations for pulling unneeded meta-data from the file system like security information.
If you need to retrive saved data then storing in files is certainly not a good idea.
MySQL is a good choice. But make sure you have right indexes set.
Regarding binary-blob. It depends on what you plan to store. Give us more details.
That's one GB of data. What are you going to use the database for?
That's definitely just a file, read it into ram when starting up.
Scaling to 500Million is easy. That just takes some more machines.
Depending on the precise application characteristics, you might be able to normalize or compress the data in ram.
You might be able to keep things on disk, and use a database, but that seriously limits your scalability in terms of simultaneous access. You get 50 disk accesses/sec from a disk, so just count how many disk you need.