I have a script for stress test using JMeter, the problem is the data I will be using is too many that I needed to divide it to multiple CSVs.
Is it possible in JMeter to change the CSV file, the source of data if the file is at the last data ?
Example:
I have 1 million data in CSV, during run-time when the iteration gets to the 1 million data it will change the file with newer data.
You can have multiple CSV Data Set Config with different variable names as id id1 id2
Mark as Recycle on EOF false
Recycle on EOF? Should the file be re-read from the beginning on reaching EOF? (default is true)
And when you will get to the end check value is EOF as "${id}" == "<EOF>" and override id/use ${id1} instead
Example:
if ("<EOF>".equals(vars.get("Email")){
if ("<EOF>".equals(vars.get("Email2")){
vars.put("Email",vars.get("Email3"));
vars.put("Password",vars.get("Password3));
} else {
vars.put("Email",vars.get("Email2"));
vars.put("Password",vars.get("Password2));
}
}
Related
I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")
I'm using v5.1.1 of JMeter and attempting to use the "CSV Data Set Config". The file is read correctly as I can tell from the Debug Sampler/Results Tree, but the file is not being read line by line. In other words, it reads the first line and never proceeds to the next line for processing.
I would like to use the data inside the CSV to iterate over a series of HTTP Requests to an external API. I currently have a single thread with only the "CSV Data Set Config" and "HTTP Request".
Do I need to wrap this with a ForEach controller or another looping construct? Perhaps I'm missing it but I do not see in the documentation that would indicate it's necessary.
Thanks
You dont need to wrap this in a ForEach loop. First line in the CSV file is a var name:
Let's say your csv file looks like
foo, bar
1, John
2, George
3, Laura
And you use an http request sampler
then ${foo} and ${bar} will get iterated sequentially. However please make sure you are mindful about the CSV Data Set Config options. The following options works ok for me:
By default CSV Data Set Config doesn't trigged any "looping", it reads next line from the CSV file for each thread (virtual user) for each iteration.
So if you want to see more values from the CSV file - either add more users or loops or both.
Given
This CSV file:
line1
line2
line3
Following CSV Data Set Config setup:
And the following Thread Group setup:
You will get the following values (assuming __threadNum() function to visualize current virtual user number and ${__jm__Thread Group__idx} pre-defined variable to show current Thread Group iteration) :
Check out JMeter Parameterization - The Complete Guide article for more information on various approaches on parameterizing JMeter tests using external data sources
I need to create test data preparation script and capture JSON response data to CSV file.
In the actual test, I need to read parameters from CSV file.
Is there any possibilities of saving entire JSON data as filed in CSV file (or) need to extract each filed and save it to CSV file?
The main issue JSON have comma, You can overcome it by saving JSON to file and use different delimiter instead of comma separated, for example #
Then read file using CSV Data Set Config using # Delimiter
Delimiter to be used to split the records in the file. If there are fewer values on the line than there are variables the remaining variables are not updated - so they will retain their previous value (if any).
Also you can save JSON in every row and then get data using different delimiter as #
You can save entire JSON response into a JMeter Variable by adding a Regular Expression Extractor as a child of the HTTP Request sampler which returns JSON and configuring it like:
Name of created variables: anything meaningful, i.e. response
Regular Expression: (?s)(^.*)
Template: $1$
Then you need to declare this response as a Sample Variable by adding the next line to user.properties file:
sample_variables=response
And finally you can use Flexible File Writer plugin to store the response variable into a file, if you don't have any other Sample Variables you should use variable#0
I've a csv like this:
NAME;F1;F2;
test1;field1;field2
test2;field1;field2
test3;field1;field2
I would test only test1, so I would change the csv in
ID;F1;F2;
test1;field1;field2
#test2;field1;field2
#test3;field1;field2
how can I skip rows of test2 and test3 in jmeter?
There is always a way to do to something..
maybe my way is not the best and "pretty" but it worth!
Thread Group
Loop Controller
csv Data Set Config
if Controller
Http Request
Inside If Controller I added this code:
${__groovy(vars.get('ID').take(1)!='#')}
In this way when you put an # at the start of the row it will be skipped.
I hope it could be helpfull for someone.
You cannot, the only option I can think of is creating a new CSV file out of the existing one with just first 2 lines like:
Add setUp Thread Group to your Test Plan
Add JSR223 Sampler to the setUp Thread Group
Put the following code into "Script" area
new File('original.csv').readLines().take(2).each {line ->
new File('new.csv') << line << System.getProperty('line.separator')
}
Replace original.csv with path to the current CSV file and set up CSV Data Set Config to use new.csv
The above code will write first 2 lines from the original.csv into the new.csv so you will be able to access limited external data instead of the full CSV file.
More information:
File.readLines()
Collection.take()
The Groovy Templates Cheat Sheet for JMeter
Is it possible for each thread select the same row from the CSV file?
eg. I have 5 users and only 5 records (rows) in my CSV file. In each iteration, the 1st value from CSV should be assigned to User1, similarly for all users.
User1: myID1,pass1,item1,product1
User2: myID2,pass2,item2,product2
User3: myID3,pass3,item3,product3
User4: myID14,pass4,item4,product4
User5: myID15,pass5,item5,product5
.
.
Any solution, please?
If you have only 5 threads and 5 lines in CSV I would suggest considering switching to User Parameters instead of working with CSV.
If your CSV file can have > 5 lines and your test can have > 5 virtual users and requirement like "user 1 takes line 1" is a must, you will have to pre-load the CSV file into memory with a scripting test element like Beanshell Sampler like:
Add setUp Thread Group to your Test Plan (with 1 thread and 1 iteration)
Add Beanshell Sampler and put the following code into "Script" area:
import org.apache.commons.io.FileUtils;
List lines = FileUtils.readLines(new File("test.csv"));
bsh.shared.lines = lines;
The above code will read the contents of test.csv file (replace it with relative or full path to your CSV file) and store it into bsh.shared namespace
Add Beanshell PreProcessor as a child of the request where you need to use the values from the CSV file and put the following code into "Script" area:
int user = ctx.getThreadNum();
String line = bsh.shared.lines.get(user);
String[] tokens = line.split(",");
vars.put("ID", tokens[0]);
vars.put("pass", tokens[1]);
vars.put("item", tokens[2]);
vars.put("product", tokens[3]);
The above code will fetch the line from the list, stored in the bsh.shared namespace basing on current virtual user number, split it by comma and store the values into the JMeter Variables so you will be able to access them as:
${ID}
${pass}
${item}
${product}
See How to Use BeanShell: JMeter's Favorite Built-in Component guide for more information on using Beanshell scripting in JMeter tests.