Spark CSV GZip to Parquet?

Spark CSV GZip to Parquet? - csv

I am using Spark 2.3.1 PySpark (AWS EMR)
I am getting memory errors:
Container killed by YARN for exceeding memory limits
Consider boosting spark.yarn.executor.memoryOverhead
I have input of 160 files, each file approx 350-400 MBytes, each file is a CSV Gzip format.
To read the csv.gz files (with wildcard) and I use this Pyspark
dfgz = spark.read.load("s3://mybucket/yyyymm=201708/datafile_*.csv.gz",
format="csv", sep="^", inferSchema="false", header="false", multiLine="true", quote="^", nullValue="~", schema="id string,...."))
To save the data frame I use this (PySpark)
(dfgz
.write
.partitionBy("yyyymm")
.mode("overwrite")
.format("parquet")
.option("path", "s3://mybucket/mytable_parquet")
.saveAsTable("data_test.mytable")
)
One line of code to save all 160 files.
I tried this with 1 file and it works fine.
Total size for all 160 files (csv.gzip) is about 64 GBytes.
Each file as a pure CSV, when Unzipped is approx 3.5 GBytes. I am assuming Spark may unzip each file in RAM and then convert it to Parquet in RAM ??
I want to convert each csv.gzip file to Parquet format i.e. I want 160 Parquet files as output (ideally).
The task runs for a while and it seems to create 1 Parquet file for each CSV.GZ file. After some time it always fails with Yarn memory error.
I tried various settings for executors memory and memoryOverhead and all results in no change - jobs always fails. I tried memoryOverhead of up to 1-8 GB and executormemory of 8G.
Apart from manually breaking up input 160 files workload into many small workloads what else can I do?
Do I need a Spark cluster with a total RAM capacity of much greater than 64 GB?
I use 4 slave nodes, each has 8 CPU and 16 GB per node (slaves), plus one master of 4 CPU and 8 GB of RAM.
This is (with overhead) less than 64 GB of input gzip csv files I am trying to process but the files are evenly sized of 350-400 MBytes so I dont understand why Spark is throwing memory errors given it can easily process these 1 file at a time per executor, discard it and move on to next file. It does not appear to work this way. I feel it is trying to load all input csv.gzip files into memory but I have no way of knowing it (I am still new to Spark 2.3.1).
Late Update: I managed to get it to work with following memory config:
4 slave nodes, each 8 CPU and 16 GB of RAM
1 master node, 4 CPU and 8 GB of RAM:
spark maximizeResourceAllocation false
spark-defaults spark.driver.memoryOverhead 1g
spark-defaults spark.executor.memoryOverhead 2g
spark-defaults spark.executor.instances 8
spark-defaults spark.executor.cores 3
spark-defaults spark.default.parallelism 48
spark-defaults spark.driver.memory 6g
spark-defaults spark.executor.memory 6g
Needless to say - I cannot explain why this config worked!
Also this took 2 hours+ to process 64 GB of gzip data which seems slow even for a small 4+1 node cluster with total of 32+4 CPU and 64+8 GB of RAM. Perhaps S3 was the bottleneck....
FWIW I just did not expect to micro-manage a database cluster for memory, disk I/O or CPU allocation.
Update 2:
I just ran another load on same cluster with same config, a smaller load of 129 files of same sizes and this load failed with same Yarn memory errors.
I am very disappointed with Spark 2.3.1 memory management.
Thank you for any guidance

Related

What is reasonable top speed for reading a CSV file into a 2-dimensional array?

What is reasonable time to load a CSV file into a 2-dimensional array in memory, where # columns is fixed (406), and the number of rows are about 87,000? -- In Perl it is taking about 12 seconds from either HardDisk (SATA) or SSD. -- other languages OK if speed can greatly improved.
I expected the time to be much less!
Size on disk of the referenced CSV file is 302MB!
Snip of the interesting Perl below:
while ($iline = <$CSVFILE>)
{
chomp($iline);
#csv_values = split /,/,$iline;
# Create a HASH Key from csv_value[0], which is the CODE/label!
$hashname=$csv_values[0];
$Greeks{$hashname}=[#csv_values]; # Create the reference & copy the array!
}
For above, the majority of the time is consumed in the "split", and the Hash new key addition lines!
I tried a similar test in python (not my strong suite), and the performance was much much worse!
FYI: cpu is intel 3.2GHz i7-3930K wiht 32GB ram, 64-bit OS (win 10), for referenced performance.
Thanks for constructive ideas!

Hive query does not begin MapReduce process after starting job and generating Tracking URL

I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).

You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>

Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).

I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).

Training time on GeForce GTX Titan X with CUDA 7.5

I'm running the Caffe library on GeForce GTX Titan X with CUDA 7.5 (Ubuntu 14). I'm not sure whether Caffe is properly configured for my setup. My dataset consists of images with 256 x 256 pixels (3 channels), 100000 training / 10000 test samples. For the very first test I'm using AlexNet with new_height=256, new_width=256, crop_size=227. Running 1000 training iterations on one Titan X with batch_size=256 takes about 17 minutes... Is it not too slow for this hardware?
Any help and advices are kindly appreciated!

Running 1000 iterations on a batch of 256 images:
(256 height* 256 width* 256 batch size * 1000 iteration * 3 channels) bytes / ((1024*1024)MB * (17*60)seconds) = 47MBps compute speed.
The following may improve the performance:
If the original images are of bigger resolution, try to preprocess them to 256x256 thus reducing a lot of pixel reads from the harddisk.
Compile Caffe using Cudnn flag. This may lead to a 30% improvement in speed
Try creating an LMDB dataset of the input set and use the LMDB data for training.
Try using an SSD instead of a SATA harddisk.

No, it is not. Check out this link for Caffe performance and hardware configuration.

Cassandra .csv import error:batch too large

I'm trying to import data from a .csv file to Cassandra 3.2.1 via copy command.In the file are only 299 rows with 14 columns. I get the Error:
Failed to import 299 rows: InvalidRequest - code=2200 [Invalid query] message="Batch too large"
I used the following copy comand and tryied to increase the batch size:
copy table (Col1,Col2,...)from 'file.csv'
with delimiter =';' and header = true and MAXBATCHSIZE = 5000;
I think 299 rows are not too much to import to cassandra or i am wrong?

Adding the CHUNKSIZE keyword resolved the problem for me.
e.g.
copy event_stats_user from '/home/kiren/dumps/event_stats_user.csv ' with CHUNKSIZE=1 ;

The error you're encountering is a server-side error message, saying that the size (in term of bytes count) of your batch insert is too large.
This batch size is defined in the cassandra.yaml file:
# Log WARN on any batch size exceeding this value. 5kb per batch by default.
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
If you insert a lot of big columns (in size) you may reach quickly this threshold. Try to reduce MAXBATCHSIZE to 200.
More info on COPY options here

Only one node owns data in a Cassandra cluster

I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!

If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.

You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008