I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.
It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).
From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.
Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?
In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?
A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”
You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.
On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.
Whow!
A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.
So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!
1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.
2nd) I made a Ramdisk. Be sure that you have enough for your Project!
3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz
In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.
The final release results in processing from:
PHP/MySQL > 5 sec
Perl/HDD ~ 1.2 sec
Perl/RamDisk ~ 0.001 sec
When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.
The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...
Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.
Related
json will be updated up to ~4 times a day
json will be loaded often (every user will use this as a base
data)
will need to keep last previous version every saved change
(one backup copy)
given these cases is there a definite pro/con of storing the json data in a file on the server vs storing it in the database? And if storing it in the database, would it make sense for it to have its own table (two rows, one current version, one backup copy)
Storing, fetching and even querying JSON these days isn't a big deal - especially with the NoSQL solutions like MongoDB & Cassandra. In fact, a platform like MongoDB will allow you to make direct queries into JSON itself - in fact, it stores it's data as JSON documents and performs quite well. (I am going to assume you are not talking about massive scale, at least not yet.)
The point being that a system like MongoDB has done a lot of the hard work for you. It will effectively optimize things for you like loading frequent documents into memory, optimize their sizes and provide mechanisms for traversing large JSON documents without huge footprints.
If you were to deal with this at the file-by-file level, there are going to be a lot of unforeseen issues that you will need to deal with down the road. You need to manage file handles, watch out for read/write locks on concurrent reads, filesystem permissions, handling disk I/O performance bottlenecks - the list goes on. Even for webservers these days which serve files day and night, where they have done some pretty interesting optimizations to manage the performance of handling files end up working with CDNs (Content Delivery Networks) to optimize performance at the edge and manage scale.
Retaining prior versions of the JSON data can be as simple as simply not over-writing the existing entry and marking previous previous (n-2) version for deletion. This can then be done in a separate thread to "clean-up" or a batch process overnight to remove the extraneous data. (NOTE: this could lead to some fragmentation down the line but it's something that can be compacted later on.)
So, long story short. I wouldn't store JSON on the filesystem anymore. Put it in something like MongoDB and let it handle the nitty gritty details. Until you really get to 1B+ transactions, this should probably do pretty well for you.
I am working with a lot of separate data entries and unfortunately do not know SQL, so I need to know which is the faster method of storing data.
I have several hundred, if not in the thousands, individual files storing user data. In this case they are all lists of Strings and nothing else, so I have been listing them line by line as such, accessing the files as needed. Encryption is not necessary.
test
buyhome
foo
etc. (About 75 or so entries)
More recently I have learned how to use JSON and had this question: Would it be faster to leave these as individual files to read as necessary, or as a very large JSON file I can keep in memory?
In memory access will always be much faster than disk access, however if your in memory data is modified and the system crashes you will lose that data if it has not been saved to a form of persistent data storage.
Given the amount of data you say you are working with, you really should be using a database of some sort. Either drop everything and go learn some SQL (the basics are not that hard) or leverage what you know about JSON and look into a NoSQL database like MongoDB.
You will find that using the right tool for the job often saves you more time in the long run than trying to force the tool you currently have to work. Even if you need to invest some time upfront to learn something new.
First thing is: DO NOT keep data in memory. Unless you are creating portal like SO or Reddit, RAM as a storage is a bad idea.
Second thing is: reading a file is slow. Opening and closing a file is slow too. Try to keep number of files as low as possible.
If you are gonna use each and every of those files (key issue is EVERY), keep them together. If you will only need some of them, store them separately.
This question already has answers here:
Storing Images in DB - Yea or Nay?
(56 answers)
Closed 9 years ago.
I have a site with more than 100k static files in a single directory (600k+ dirs and files in total). I guess I could get a VPS to host it without inode issues, but it won't be a high traffic site, so I'd rather use a cheap webhost.
I'm thinking to store the files in a MySQL table indexed by URL path and serve through PHP. Are there better approaches?
EDIT: Just to clarify, this is NOT the same as storing images on the DB. I'm talking about HTML pages.
I think your best approach would not be to store them in the database to start with. When it comes to storing and serving files, that is what a file system does best. There are no possible reasons that a database can do this more efficiently that a normal file system.
If you were to store them in a database then given the size restrictions you would want to use a BLOB field (e.g. TEXT) and for efficiency hash the URL and store that in a column rather than having some huge VARCHAR field indexed.
However, as you've said they are static there really isn’t any point in this – as they are static have your webserver add some long caching headers to the pages so they will be stored locally for future hits from the same client.
[Edit 1 - in response to comment]
I was answering the question with the information given and keeping it generic where information wasn't provided by OP.
It depends on how much of the VARCHAR you index – which is related to the length of the data stored (URL / path / page name) you’re indexing.
If you’re indexing less than about 45 characters for only 100k rows I guess it really wouldn't make much difference, a hash will use less memory but size and performance for a small set probably wouldn't really make that much difference.
I answered it as the OP asked about the database but still can’t see any reason why you would want to put them there in the first place – it will be slower than using the file system.0 Why connect to the database, deal with network performance (unless they are on the same box – unlikely in a web host) query an index, fetch a row, run that data through the database provider and stream the output to the response stream when the webserver can do the same outcome with much less CPU cycles and in comparison to a database a fraction of the memory usage?
Yes - a filesystem is a database. All the filesystems I've come across in the last 10 years can easily accommodate this number of files in a directory - and the directories are implemented as trees (there are some using B-Trees - but structures with bigger fanouts such as H-Trees work better for this kind of application).
(actually, given the coice I'd recommend structuring it into a hierarchy of directory - e.g. using dirs for the first 2 letters of the filename or md5 hash of the content - it'd make managing the content a lot easier without compromising performance).
Relational databases are all about storing small pieces of structured data - they are not an efficient way to manage large variable sized data.
I don't have any benchmarks to hand but just as I'd pick a station wagon to move several petabytes of data quickly over a sports motorcycle, I'd go with a suitable filesystem (such as BTRFS or Ext4 - ZFS would do the job too but it's not a good choice on anything other than Solaris - and it's questionable whether solaris makes any sense for a webserver).
Problem is that cheap hosting companies rarely provide this level of information up front.
Note that a wee tweak of the filesystem behaviour can yield big imperovements in performance - in your case, if running on Linux, I'd recommend reducing the vfs_cache_pressure significantly. But this requires root access.
An alternative approach would be to use a document database rather than a relational database (not a key/value store). These are a type of Schema free (NoSQL) database designed to provide fast replication and handling of large datastructures. Hence this would provide a more scalable solution (if that's a concern). e.g. RavenDB. You could use a key/value store but these are rarely optimized to handle large data payloads.
I'd only consider MySQL if you have a very strong reason other than what you've described here.
We run a system that for cache purposes, currently writes and deletes about 1,000 small files (10k) every hour.
In the near future this number will raise to about 10,000 - 20,000 files being written and deleted every hour.
For every files that is being written a new row on our mysql DB is added and deleted when the file is deleted an hour later.
My question:
Can this excessive write & delete operation hurt our server performance eventually somehow?
(btw we currently run this on a VPS and soon on a dedicated server.)
Can writing and deleting so many rows eventually slow our DB?
This greatly depends on operating system, file system and configuration of file system caching. Also this depends on whether your database is stored on the same disk as files that are written/deleted.
Usually, operation that affect file system structure such as file creations and file deletions require some synchronous disk IO, so operating system will not loose these changes after power failure. Though, some operating systems and file systems may support more relaxed policy for this. For example, UFS file system on FreeBSD has nice "soft updates" option that does this. Probably etx3/Linus should have similar feature.
Once you will move to dedicated server I think it would be reasonable to attach several HDDs to it and to make sure that database is stored on once disk while massive file operations are performed on another disk. In this cases DB performance should not be affected.
You should make some calculations and estimate needed throughtput for the storage. In your worst scenario, 20000 files x 10K = 200MB per hour which is a very low requirement.
Deleting a file, on modern filesystems, takes a very little time.
In my opinion you don't have to worry, especially if your applications creates and deletes files sequentially.
Consider also that modern operative systems cache parts of file system in memory to improve performance and reduce disk access (this is true especially for multiple deletes).
Your database will grow but engines are optimized for it, no need to care about it.
Only downside is that handling many small files could cause disk fragmentation if your file system is subjected to it.
For a performance bonus, you should consider to use a separate phisical storage for these files (e.g. a different disk drive or disk array) so you will take advantage of full bandwidth transfer with no other interferences.
I'm completely new to databases so pardon the simplicity of the question. We have an embedded Linux system that needs to store data collected over a time span of several hours. The data will need to be searchable sequentially and includes data like GPS, environmental data, etc. This data will need to saved off in a folder on a removable SSD and labeled as a "Mission". Several "Missions" can exists on a single SSD and should not be mixed together because they need to be copied and saved off individually at the users discretion to external media. Data will be saved off as often as 10 times a second and needs to be very robust because of the potential for power outages.
The data will need to be searchable on the system it is created on but also after the removalable disk is taken to another system (also Linux) it needs to be loaded and used there also. In the past we have done custom files to store the data but it seems like a database might be the best option. How portable are databases like MySQL? Can a user easily remove a disk with a database on it and plug it in a new machine to use without too much effort? Our queries will mostly be time based because the user will be "playing" through the data after it is collected in perhaps 10x the collection rate. Also, our base code is written in Qt (C++) so we would need to interact with the database in that way.
I'd go with SQLite. It's small and lite. It stores all its data into one file. You can copy or move the file to another computer and read it there. You data writer can just remake the file, empty when it detects that today's ssd does not already have the file.
It's also worth mentioning that SQLite undergoes testing at the level afforded only by select few safety-critical pieces of software. The test suite, while partly autogenerated, is a staggering 100 million lines of code. It is not lite at all when it comes to robustness. I would trust SQLite more than a random self-made database implementation.
SQLite is used in certified avionics AFAIK.