RAM disks in GCP Dataflow - is it possible? - google-compute-engine

Google Compute Engine support RAM disks - see here.
I develop a project that will reuse existing code which manipulate local files.
For scalability, I am going to use Dataflow.
The files are in GCS, and I will send it to the Dataflow workers for manipulation.
I was thinking on creating better performance by using RAM disks on the workers, copy the files from GCS directly to the RAM disk, and manipulate it there.
I fail to find any example of such capability.
Is this a valid solution, or should I avoid this kind of a "trick" ?

It is not possible to to use ramdisk as the disk type for the workers since ramdisk is being set up on an OS level. The only available disk for the workers are Standard persistent disks (pd-standard), and SSD persistent disks (pd-ssd). Among these, SSD is definitely faster. You can try adding more workers or using a faster CPU to process your data faster.
For comparison I tried running a job that uses standard and ssd and it turns out that it is 13% faster when using SSD compared to standard disk. But take note that I just tested the quick start from the dataflow docs.
Using SSD (3m 54s elapsed time):
Using Standard Disk (4m 29s elapsed time):

While what you want to do might be technically possible by creating a setup.py with custom commands, it will not help you increase performance. Beam already uses as much of the workers' RAM as it can in order to perform effectively. If you are reading a file from GCS and operating on it, then that file is already going to be loaded into RAM. By earmarking a big chunk of RAM to a ramdisk, you will probably make Beam run slower, not faster.
If you just want stuff to happen faster, try using SSD, increase the # of workers, or try using the c2 machine family.

Related

Which persistent storage is used by Dataflow to keep persistent state implemented with Apache Beam Timers?

It is not said explicit but I suppose that Dataflow could use Persistent Disk Resources
Anyway I can not find confirmation for that.
I wonder if I could assume that limitations and expected performance for using Timers is equal to that provided here: https://cloud.google.com/compute/docs/disks/performance
Dataflow uses Persistent disks to store timers. But there's also a significant amount of caching involved so performance should be better than just reading from persistent disks.

Apache & MySQL with Persistent Disks to Multiple Instances

I plan on mount persistent disks into folders Apache(/var/www) and Mysql (/var/lib/mysql) to avoid having to replicate information between servers.
Anyone has done tests to know the I/O performance of persistent disk is similar when attaching the same disk to 100 instances as well as only 2 instances? Also has a limit of how many instances can be attach one persistent disk?
I'm not sure exactly what setup you're planning to use, so it's a little hard to comment specifically.
If you plan to attach the same persistent disk to all servers, note that a disk can only be attached to multiple instances in read-only mode, so you may not be able to use temporary tables, etc. in MySQL without extra configuration.
It's a bit hard to give performance numbers for a hypothetical configuration; I'd expect performance would depend on amount of data stored (e.g. 1TB of data will behave differently than 100MB), instance size (larger instances have more memory for page cache and more CPU for processing I/O), and access pattern. (Random reads vs. sequential reads)
The best option is to set up a small test system and run an actual loadtest using something like apachebench, jmeter, or httpperf. Failing that, you can try to construct an artificial load that's similar to your target benchmark.
Note that just running bonnie++ or fio against the disk may not tell you if you're going to run into problems; for example, it could be that a combination of sequential reads from one machine and random reads from another causes problems, or that 500 simultaneous sequential reads from the same block causes a problem, but that your application never does that. (If you're using Apache+MySQL, it would seem unlikely that your application would do that, but it's hard to know for sure until you test it.)

Does MySQL scale on a single multi-processor machine?

My application's typical DB usage is to read/update on one large table. I wonder if MySQL scales read operations on a single multi-processor machine? How about write operations - are they able to utilize multi-processors?
By the way - unfortunately I am not able to optimize the table schema.
Thank you.
Setup details:
X64, quard core
Single hard disk (no RAID)
Plenty of memory (4GB+)
Linux 2.6
MySQL 5.5
If you're using conventional hard disks, you'll often find you run out of IO bandwidth before you run out of CPU cores. The only way to pin a four core machine is to have a very high performance SSD striped RAID array.
If you're not able to optimize the schema you have very limited options. This is like asking to tune a car without lifting the hood. Maybe you can change the tires or use better gasoline, but fundamental performance gains come from several factors, including, most notably, additional indexes and strategically de-normalizing data.
In database land, 4GB of memory is almost nothing, 8GB is the absolute minimum for a system with any loading, and a single disk is a very bad idea. At the very least you should have some form of mirroring for data integrity reasons.

MySQL vs SQLite on Amazon EC2

I have a Java program and PHP website I plan to run on my Amazon EC2 instance with an EBS volume. The program writes to and reads from a database. The website only reads from the same database.
On AWS you pay for the amount of IOPS (I/O requests Per Second) to the volume. Which database has the least IOPS? Also, can SQLite handle queries from both the program and website simultaneously?
The amount of IO is going to depend a lot on how you have MySQL configured and how your application uses the database. Caching, log file sizes, database engine, transactions, etc. will all affect how much IO you do. In other words, it's probably not possible to predict in advance although I'd guess that SQLite would have more disk IO simply because the database file has to be opened and closed all the time while MySQL writes and reads (in particular) can be cached in memory by MySQL itself.
This site, Estimating I/O requests, has a neat method for calculating your actual IO and using that to estimate your EBS costs. You could run your application on a test system under simulated loads and use this technique to measure the difference in IO between a MySQL solution and a SQLite solution.
In practice, it may not really matter. The cost is $0.10 per million IO requests. On a medium-traffic e-commerce site with heavy database access we were doing about 315 million IO requests per month, or $31. This was negligible compared to the EC2, storage, and bandwidth costs which ran into the thousands. You can use the AWS cost calculator to plug in estimates and calculate all of your AWS costs.
You should also keep in mind that the SQLite folks only recommend that you use it for low to medium traffic websites. MySQL is a better solution for high traffic sites.
Yes SQLite can handle queries from both the program and website simultaneously. SQLite uses file level locking to ensure consistency.
In memory SQLite is intended for standalone or embedded programs.
Do not use in memory only SQLite:
when you share the db between multiple processes
when you have a php based website in which case you won't be able to leverage php fastcgi

Why should we store log files and bin-log files on different path or disks in mysql

I have replication setup mysql databases....the log file location the bin-log file all are at one path that is default my data directory of mysql.
I have read that for better performance one should store them separately.
Can anyone provide me how this improves the performance. Is there is documentation available for the same. The reason why one should do so?
Mainly because then, reads and writes can be made almost in parallel. Stored separately meaning on different disks.
Linux and H/W optimizations for MySQL is a nice presentation of ways to improve MySQL performance - it presents benchmarks and conclusions of when to use SSD disks and when to use SCSI disks, what kind of processors are better for what tasks.
Very good presentation, a must read for any DBA!!
It also can be really embarrassing to have your log files fill the file system and bring the database to a halt.
One consideration is that using a separate disk for binlogging introduces another SPOF since if MySQL cannot write the binlog it will croak the same as if it couldn't write to the data files. Otherwise, adding another disk just better separates the two tasks so that binlog writes and data file writes don't have to contend for resources. With SSDs this is much less of an issue unless you have some crazy heavy write load and are already bound by SSD performance.
It's mostly for cases where your database write traffic is so high that a single disk volume can't keep up while writing for both data files and log files. Disks have a finite amount of throughput, and you could have a very busy database server.
But it's not likely that separating data files from binlogs will give better performance for queries, because MySQL writes to the binlog at commit time, not at query time. If your disks were too slow to keep up with the traffic, you'd see COMMIT become a bottleneck.
The system I currently support stores binlogs in the same directory as the datadir. The datadir is on a RAID10 volume over 12 physical drives. This has plenty of throughput to support our workload. But if we had about double our write traffic, this RAID array wouldn't be able to keep up.
You don't need to do every tip that someone says gives better performance, because any given tip might make no difference to your application's workload. You need to measure many metrics of performance and resource use, and come up with the right tuning or configuration to help the bottlenecks under your workload.
There is no magic configuration that makes everything have high performance.