Automatic distributed reading of csv file in Jmeter distributed load testing? - csv

My scenario is while doing distributed load testing through jmeter i want csv file should read in auto distributed manner . example
if i have 100 users entry in csv data set config file and number of slave server is 10. so in normal scenario i have to keep csv file entry in arranged manner like
user1- to 10 at slave-1
user-11to20 at slave-2
.
.
.
user-91 to 100 at slave 3
so i want same csv file have entry of all 100 users should be placed at all slave and jmeter automatically read entry from these files and distribute it .

You cannot do that at the moment but if you want a configuration in which each thread uses unique data even across slave machines, then you should use different Test Data files on different slave machine.
You have to place this Test Data file on each slave machine (with same location as on master). JMeter will use Test Data from the slave machine and not from the Master, so placing different data set on different machine will ensure uniqueness.

JMeter doesn't provide such functionality out of the box so the only option I can think of is reading required X lines with the given offset depending on the slave hostname or IP address somewhere in setUp Thread Group using JSR223 Sampler and Groovy language and writing this range of lines into a new file which will be used in the CSV Data Set Config.
Another possible solution would be going for HTTP Simple Table Server, it's READ endpoint allows removing the value after reading it so you will have unique data for all slaves.

Related

Using Consul for dynamic configuration management

I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.

Local database replication from multiple devices to a single external database

I have a bunch of devices that process images from cameras. Many cameras can be connected to a single device. Device has an unique serial number.
Each image is stored on local filesystem and associated metadata is written to local database (with path to the file).
I'd like to move data from the device to external server. This means that files are moved to media server and new rows from database are moved to the external database.
That external database should aggregate data from many local ones.
I'd like also to synchronize both replication, this means that I don't want a new row to appear in the external database before file is accessible on media server.
Are there any solutions that allows for such synchronized replication or do I need to write some service that will do that?
I am open to suggestions on file transfer protocol, but database is created in MySQL.
Before you send the data from external server, first convert it the images to BASE64 and in the server save it in physical path, after validate if exists the image and save it in the data base.

Can Redshift's COPY command copy more than one table?

I am looking to move some MySQL databases to the cloud in Amazon Redshift. Currently I am creating a Python script to convert the tables to CSVs, encrypt them, put them in S3, then COPY the data into Redshift. However, the way it is set up I would have to copy the data one table at a time. I have read that you can split your data into multiple files and upload them in parallel, however I believe this is still only for loading data into one table. Is there a way to use COPY on multiple tables at once? Having to copy data over from each table individually seems very inefficient.
All of your statements are correct.
The COPY command can load from multiple files in parallel (in fact, that is recommended because it can then spread the load job across multiple nodes), but it only loads on table per COPY command.
You could connect to Redshift via multiple sessions and run a COPY command in each session to load multiple tables simultaneously (but be careful of the impact on production users).
If you are wishing to migrate data from an on-premises database to Amazon Redshift, consider using:
AWS Schema Conversion Tool
AWS Database Migration Service
The Database Migration Service can even perform on-going updates of Redshift whenever data is updated in the source database.

EC2 suitability for synching large CSV files from an FTP

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

cassandra sstableloader load data from csv with various partition keys

I want to load a large CSV file to my cassandra cluster (1 node at this moment).
Basing on: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
My data is transformed by CQLSSTableWriter to SSTables files, then I use SSTableLoader to load that SSTables to a cassandra table already containing some data.
That CSV file contains various partition keys.
Now lets assume that multi-node cassandra cluser is used.
My questions:
1) Is the loading procedure that I use correct in case of multinode cluster?
2) Will that SSTable files be splitted by SSTableLoader and send to nodes responsible for the specific partition keys?
Thank you
1) Loading into a single-node cluster or 100-node cluster is the same. The only difference is that the data will be distributed around the ring if you have a multi-node cluster. The node where you run sstableloader becomes the coordinator (as #rtumaykin already stated) and will send the writes to the appropriate nodes.
2) No. As in my response above, the "splitting" is done by the coordinator. Think of the sstableloader utility as just another instance of a client sending writes to the cluster.
3) In response to your follow-up question, the sstableloader utility is not sending files to nodes but sending writes of the rows contained in those SSTables. The sstableloader reads the data and sends write requests to the cluster.
Yes
It will be actually done by the coordinator node, not by the SSTableLoader.