it is said that there is only one spindle in hard disk which reads or writes data to/from hard disk, how is it possible to write or read 2 or more data's to/from hard disk SIMULTANEOUSLY. the operating system used is windows xp.EXAMPLE, i need to copy two different movies to hard disk from pen drive so i click both movies copy them from pen drive and am pasting them in a disk partition, coping process of two movies to hard disk occurs simultaneously. how does this happen?
These operations aren't simultaneous at all, but the operating system manages both operations concurrently.
What happens is the file manager (say, windows explorer) tells the operating system to copy file from one location to another, once each for the two copy operations.
The operating system breaks this command across two parts of its own system, the "filesystem" and the "disk driver". The file system works out what blocks on what disk are associated with the particular files in question, and tells the disk driver to read or write to those blocks.
The disk driver builds up a queue of reads and writes and figures out the most efficient way to satisfy them. A desktop operating system will usually try to service those requests quickly, to make the system as responsive as possible, but a server operating system will queue up the block operations as long as possible so that it can handle them in an order that allows it to make the most efficient use of block ordering.
Once the disk driver decides to act on a block operation, it tells the disk to move it's head and read or write some data. The result of the action is then passed back to the filesystem, and ultimately to the user application.
The fact that the operations appear simultaneous is only an illusion of the multitasking facilities of the operating system. This is pretty easy discern since multiple file copies take a little longer than just one copy (or sometimes a LOT longer, if you're trying to do a bunch at the same time.)
of course, the OS is still able to move two separate drives simultaneously if they really are different disks.
Related
I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.
It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).
From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.
Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?
In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?
A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”
You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.
On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.
Whow!
A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.
So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!
1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.
2nd) I made a Ramdisk. Be sure that you have enough for your Project!
3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz
In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.
The final release results in processing from:
PHP/MySQL > 5 sec
Perl/HDD ~ 1.2 sec
Perl/RamDisk ~ 0.001 sec
When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.
The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...
Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.
I've read multiple times that I can cause read/write errors if I create a snapshot. Is it possible to create a snapshot of the disk my machine is booted off of?
It depends on what you mean by "snapshot".
A snapshot is not a backup, it is a way of temporarily capturing the state of a system so you can make changes test the results and revert back to the previously known good state if the changes cause issues.
How to take a snapshot varies depending on the OS you're using, whether you're talking about a physical system or a virtual system, what virtualization platform, you're using, what image types you're using for disks within a given virtualization platform etc. etc. etc.
Once you have a snapshot, then you can make a real backup from the snapshot. You'll want to make sure that if it's a database server that you've flushed everything to disk and then write lock it for the time it takes to make the snapshot (typically seconds). For other systems you'll similarly need to address things in a way that ensures that you have a consistent state.
If you want to make a complete backup of your system drive, directly rather than via a snapshot then you want to shut down and boot off an alternate boot device like a CD or an external drive.
If you don't do that, and try to directly back up a running system then you will be leaving yourself open to all manner of potential issues. It might work some of the time, but you won't know until you try and restore it.
If you can provide more details about the system in question, then you'll get more detailed answers.
As far as moving apps and data to different drives, data is easy provided you can shut down whatever is accessing the data. If it's a database, stop the database, move the data files, tell the database server where to find its files and start it up.
For applications, it depends. Often it doesn't matter and it's fine to leave it on the system disk. It comes down to how it's being installed.
It looks like that works a little differently. The first snapshot will create an entire copy of the disk and subsequent snapshots will act like ordinary snapshots. This means it might take a bit longer to do the first snapshot.
According to :
this you ideally want to shut down the system before taking a snapshot of your boot disk. If you can't do that for whatever reason, then you want to minimize the amount of writes hitting the disk and then take the snapshot. Assuming you're using a journaling filesystem (ext3, ext4, xfs etc.) it should be able to recover without issue.
You an use the GCE APIs. Use the Disks:insert API to create the Persistence disk. you have some code examples on how to start an instance using Python, but Google has libraries for other programming languages like Java, PHP and other
I plan on mount persistent disks into folders Apache(/var/www) and Mysql (/var/lib/mysql) to avoid having to replicate information between servers.
Anyone has done tests to know the I/O performance of persistent disk is similar when attaching the same disk to 100 instances as well as only 2 instances? Also has a limit of how many instances can be attach one persistent disk?
I'm not sure exactly what setup you're planning to use, so it's a little hard to comment specifically.
If you plan to attach the same persistent disk to all servers, note that a disk can only be attached to multiple instances in read-only mode, so you may not be able to use temporary tables, etc. in MySQL without extra configuration.
It's a bit hard to give performance numbers for a hypothetical configuration; I'd expect performance would depend on amount of data stored (e.g. 1TB of data will behave differently than 100MB), instance size (larger instances have more memory for page cache and more CPU for processing I/O), and access pattern. (Random reads vs. sequential reads)
The best option is to set up a small test system and run an actual loadtest using something like apachebench, jmeter, or httpperf. Failing that, you can try to construct an artificial load that's similar to your target benchmark.
Note that just running bonnie++ or fio against the disk may not tell you if you're going to run into problems; for example, it could be that a combination of sequential reads from one machine and random reads from another causes problems, or that 500 simultaneous sequential reads from the same block causes a problem, but that your application never does that. (If you're using Apache+MySQL, it would seem unlikely that your application would do that, but it's hard to know for sure until you test it.)
We run a system that for cache purposes, currently writes and deletes about 1,000 small files (10k) every hour.
In the near future this number will raise to about 10,000 - 20,000 files being written and deleted every hour.
For every files that is being written a new row on our mysql DB is added and deleted when the file is deleted an hour later.
My question:
Can this excessive write & delete operation hurt our server performance eventually somehow?
(btw we currently run this on a VPS and soon on a dedicated server.)
Can writing and deleting so many rows eventually slow our DB?
This greatly depends on operating system, file system and configuration of file system caching. Also this depends on whether your database is stored on the same disk as files that are written/deleted.
Usually, operation that affect file system structure such as file creations and file deletions require some synchronous disk IO, so operating system will not loose these changes after power failure. Though, some operating systems and file systems may support more relaxed policy for this. For example, UFS file system on FreeBSD has nice "soft updates" option that does this. Probably etx3/Linus should have similar feature.
Once you will move to dedicated server I think it would be reasonable to attach several HDDs to it and to make sure that database is stored on once disk while massive file operations are performed on another disk. In this cases DB performance should not be affected.
You should make some calculations and estimate needed throughtput for the storage. In your worst scenario, 20000 files x 10K = 200MB per hour which is a very low requirement.
Deleting a file, on modern filesystems, takes a very little time.
In my opinion you don't have to worry, especially if your applications creates and deletes files sequentially.
Consider also that modern operative systems cache parts of file system in memory to improve performance and reduce disk access (this is true especially for multiple deletes).
Your database will grow but engines are optimized for it, no need to care about it.
Only downside is that handling many small files could cause disk fragmentation if your file system is subjected to it.
For a performance bonus, you should consider to use a separate phisical storage for these files (e.g. a different disk drive or disk array) so you will take advantage of full bandwidth transfer with no other interferences.
Is my general understanding that a typical database management systems bypass file system correct? I understand that they manage their own space on disk and they write actual data and index systems like B tree directly into disk blocks bypassing any intermediate help from file system.
This assumes that root would provide the database user permission to directly read and write from disk blocks. In Linux, this is still easier as disk can be treated as a file.
Any pointer to real case studies will be greatly appreciated.
Most rely on the underlying file system for WAL etc: basically they outsource it to the OS.
Some DBMS support (Oracle, MySQL) "raw" partitions, but it isn't typical. Too much hassle (see this chat about Postgres) because you still need WAL etc on your raw partition.
DBMSs do not bypass filesystem. If this was the case, the table names would not be case-insensitive under Windows and case-sensitive under Linux (in MySQL). What they do is to allocate large space on the file system (by the way, the data is still visible as a file / set of files in the underlying operating system) and manage internal data structure. This lower the fragmentation and the overall overhead. In a similar way cache systems works - Varnish allocates entire memory it needs with a single call to operating system, then maitains internal data structure.
Not completely, mysql asks for data directory and stores the data in a specific file format where it tries to optimize the reads,writes from a file and also stores indexes there.
More over it can also differ from one storage engine to another.
Mongodb uses memory mapped files for disk IO
Looking forward for more discussion here.