How to increase hadoop map tasks by implementing getSplits - csv

I want to process multiline CSV files and for that I wrote a custom CSVInputFormat.
I would like to have about 40 threads processing CSV lines on each hadoop node. However, when I create a cluster on Amazon EMR with 5 machines (1 master and 4 cores), I can see I get only 2 map tasks running, even if there are 6 available map slots:
I implemented getSplits in my inputFormat so it would behave like NLineInputFormat. I was expecting with this I would get more thing running in parallel, but have had no effect. Also, I tried setting arguments -s,mapred.tasktracker.map.tasks.maximum=10 --args -jobconf,mapred.map.tasks=10, but no effect.
What can I do to have lines being processed in parallel? The way hadoop is running, it's not scalable, as doesn't matter how many instances I allocate to the cluster, only two map tasks will run at most.
UPDATE:
When I use a non compressed file (zip) as origin, it create more map tasks, about 17 for 1.3 million rows. Even so, I wonder why it shouldn't be more and why more mappers aren't created when data is zipped.

Change the split size to have more splits.
Configuration conf= new Cofiguration();
//set the value that increases your number of splits.
conf.set("mapred.max.split.size", "1020");
Job job = new Job(conf, "My job name");

Related

JMeter - How to run parallel execution with CSV as dataset?

What I am trying to accomplish => Run 50 threads in parallel using a CSV file as the dataset.
Here's how the CSV looks like (Let's say there are 50 records):
Username,Password
user1,password1
user2,password2
...,...
user50, password50
In JMeter, when I run my test case, each thread will consume 1 record of the CSV file in parallel. By that I mean, Thread 1 takes the first record (user1,password1), Thread 2 takes the second record (user2,password2), until the last record (50 in this example). And all of that happens at the same time.
I am still new to JMeter and I would like to know if this is something that is "doable" through this tool. If it is possible, your help is greatly appreciated! :)
Given default CSV Data Set Config setup which looks like:
each thread (virtual user) will take next line of the CSV file on each loop (iteration)
when the last line of the CSV file will be reached it will start over from the beginning
With regards to your "at the same time" requirement - the load pattern is controlled by Thread Group settings (number of threads, loops, ramp-up period), depending on your setup you will have concurrency from 0 to 50 active users, you can observe it using i.e. Active Threads Over Time listener
If you want to send 50 requests at exactly the same moment - consider using Synchronizing Timer
This is possible using CSV Data Set Config element. It reads the data row by row and username and password can be assigned to each thread. You can use the values with following syntax
Username ${Username}
password ${Password}
Also please note that you do not have to define the variable names in the CSV Dataset Config Element as you have them in the first row of the CSV file.

Jmeter Variable from CSV

I am running a Jmeter Test Plan with an HTTP Request, to test the performance of a web service.
In my test, I need a variable named REF to be changed in the body data of my HTTP Request.
REF can have 3000 values. So I have created a CSV file with all these 3000 values, a CSV Data Set Config, and a parameter in Jmeter named REF, and I use it in my HTTP Request like that :
<measure>
<measureRef id="${REF}"></measureRef>
<measureTime>${__time(yyyy-MM-dd'T'HH:mm:ss)}</measureTime>
<measureVal>
<value>${__Random(1,100,)}</value>
</measureVal>
</measure>
As you can see it in this XML, I use the functions time and Random of Jmeter to have different requests everytime I run the test plan.
I would like yo run the test plan automatically for all the 3000 values of REF. To do that, I tried to configure the thread group like this : Number of Threads = 3000 and Loop Control = 1.
The problem is that the test takes 3 minutes to be done. So I would like to know if there is another way to do. The thing is that I need the test to be done for all the differents 3000 values in my CSV, and I don't see another way to do. I tried to put another measure in my Body Data, like this :
<measure>
<measureRef id="${REF}"></measureRef>
<measureTime>${__time(yyyy-MM-dd'T'HH:mm:ss)}</measureTime>
<measureVal>
<value>${__Random(1,100,)}</value>
</measureVal>
</measure>
<measure>
<measureRef id="${REF}"></measureRef>
<measureTime>${__time(yyyy-MM-dd'T'HH:mm:ss)}</measureTime>
<measureVal>
<value>${__Random(1,100,)}</value>
</measureVal>
</measure>
But the value of REF that is used this the same in the two measures, and what I want is to have different values picked in the CSV.
I also tried to configure the Thread Group like this : Number of Threads = 1 and Loop Control = 3000. But it's not working, I have a lot of errors...
I want the test to be runned quickly because to test my web service, I'd like to inject the 3000 values every 1 minute (by using a Flow Control Action maybe). And if the test runs in 3 minutes, I would be too long, and all the values would not be tested...
Thank you for your help !
If you are using 2 data in 1 request then you can use max 1500 virtual users if you are not repeating your data.
If it is fine then add 2 columns in csv file like REF1 and REF2 and each column will have 1500 data. Then you can use ${REF1} and ${REF2} in your requests.
If it takes you 3 minutes to add 3000 values with 3000 users and you need this to be done in 1 minute I can think of 2 possible causes:
Your application cannot process the requests faster. You can try increasing number of threads i.e. to 9000 in the Thread Group and see whether it helps or not. If you still have this 3 minutes processing time - it's your application to blame
JMeter is not capable of sending requests fast enough.
First of all make sure to follow JMeter Best Practices
Then make sure to monitor the health of the machine where JMeter is running using i.e. JMeter PerfMon Plugin as JMeter must have enough headroom to operate
If a single machine cannot produce 3000 requests per minute - you will have to go for Distributed Testing

All the executors are not being used when reading JSON(zipped .gz) in GCP from google dataproc spark cluster using spark-submit

I just got introduced to this wonderful world of Big Data and Cloud technology, using GCP(dataproc) and pyspark. I have ~5 GB size JSON file(zipped, gz file) containing ~5 million records, I need to read each row and process only those rows which satisfies a certain condition. I have my working code and I issued a spark-submit with --num-partitions=5 but still only one worker is used to carry out the action.
This is the spark-submit command I am using:
spark-submit --num-executors 5 --py-files /home/user/code/dist/package-0.1-py3.6.egg job.py
job.py:
path = "gs://dataproc-bucket/json-files/data_5M.json.gz"
mi = spark.read.json(path)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Dataproc config:
(I am using the free account for now, once I get working solution will add more cores and executors)
(Debian 9, Hadoop 2.9, Spark 2.4)
Master node:2 vCPU, 7.50 GB memory
Primary disk size: 32 GB
5 Worker nodes: 1 vCPU, 3.75 GB memory
Primary disk type: 32 GB
After spark-submit I can see in web UI that 5 executors were added but then only 1 executor remains active and perform all task and rest 4 are released.
I did my research and most of the questions talk about accessing data via JDBC.
Please suggest what I am missing here.
P.S. Eventually I would read 64 json files of 5 GB each, so might use 8 core * 100 workers.
Your best bet is to preprocess the input. Given a single input file, spark.read.json(... will create a single task to read and parse the JSON data as Spark cannot know ahead of time how to parallelize it. If your data is in line-delimited JSON format (http://jsonlines.org/), the best course of action would be to split it into manageable chunks beforehand:
path = "gs://dataproc-bucket/json-files/data_5M.json"
# read monolithic JSON as text to avoid parsing, repartition and *then* parse JSON
mi = spark.read.json(spark.read.text(path).repartition(1000).rdd)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Your initial step here (spark.read.text(...) will still bottleneck as a single task. If your data isn't line-delimited or (especially!) you anticipate you will need to work with this data more than once, you should figure out a way to turn your 5GB JSON file into 1000 5MB JSON files before getting Spark involved.
.gz files are not splittable, so they're read by one core and placed onto a single partition.
see Dealing with a large gzipped file in Spark for reference.

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

Can you create a disk and an instance in with one command in Google Compute Engine?

Currently, I'm creating a disk from a snapshot. Then I wait for 60 seconds and create an instance which will use that disk as its system disk. I'm using the gcloud utility for this.
Is there any way I can create the disk and the instance in one command?
Mix of copy-pasted Python code and pseudocode below:
cmd_create_disk = [GCLOUD, 'compute', 'disks', 'create', new_instance,
'--source-snapshot', GCE_RENDER_SNAPSHOT_VERSION,
'--zone', GCE_REGION, '--project', GCE_PROJECT]
# wait for 60 seconds
cmd_make_instance = [GCLOUD, 'compute', 'instances', 'create', new_instance,
'--disk', 'name='+new_instance+',boot=yes,auto-delete=yes',
'--machine-type', instance_type, '--network', GCE_NETWORK,
'--no-address', '--tags', 'render', '--tags', 'vpn',
'--tags', proj_tag, '--zone', GCE_REGION,
'--project', GCE_PROJECT]
The instance uses the disk as its system disk. Waiting for 60 seconds is quite arbitrary and I'd rather leave this up to GCE, making sure the instance is indeed started with the system disk.
When you delete an instance you can specify that the disk should also get deleted. In the same manner, I'd like to create an instance and specify the disk to be created from image.
The boot disk can be created automatically. You can specify the image to use for that using --image and --image-project flags in gcloud compute instances create command line. You'll need to make sure to create the image first though - your current command to create the disk seems to use a snapshot rather than an image.