Hive query does not begin MapReduce process after starting job and generating Tracking URL - mysql

I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).

You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>

Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).

I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).

Related

Kafka Consumer - How to set fetch.max.bytes higher than the default 50mb?

I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.

Big binary data share between processes

I have a big binary data iof ip data about Xmb. Processes use binary do some search algorithm to lookup ip address. I have three method.
1. put in ets. but I suppose every read access will copy big binary to process. :(
2. put in gen_server state. processes use gen_server:call to get address.The short coming concurrency.
3. compile binary into beam. but when I compile get
eheap_alloc: Cannot allocate 1318267840 bytes of memory (of type "heap")
which the best practice of big data share in erlang?
Binaries over 64 bytes in size are stored as reference counted binaries and their data is stored outside the heap of any process. If such a binary is sent to any process, the underlying data is not duplicated. So, if you store such a binary in an ETS table and then access it from various processes, the underlying data will not be copied, only its reference count will be incremented/decremented. I'd suggest going with the ETS table solution.
Here's a demonstration of the memory usage at boot, after inserting a 100MB binary into an ETS table, and after fetching a copy of the binary into the shell process. The memory usage does not change after we have a copy binary stored in the shell process. The same would not be true if it was million character string (list of integers) that we were copying in from ETS or another process.
1> erlang:memory().
[{total,21912472},
{processes,5515456},
{processes_used,5510816},
{system,16397016},
{atom,223561},
{atom_used,219143},
{binary,844872},
{code,4808780},
{ets,301232}]
2> ets:new(foo, [named_table, set]).
foo
3> ets:insert(foo, {foo, binary:copy(<<".">>, 104857600)}).
true
4> erlang:memory().
[{total,127038632},
{processes,5600320},
{processes_used,5599952},
{system,121438312},
{atom,223561},
{atom_used,220445},
{binary,105770576},
{code,4908097},
{ets,308416}]
5> X = ets:lookup(foo, foo).
[{foo,<<"........................................................................................................"...>>}]
6> erlang:memory().
[{total,127511632},
{processes,6082360},
{processes_used,6081992},
{system,121429272},
{atom,223561},
{atom_used,220445},
{binary,105761504},
{code,4908097},
{ets,308416}]
You can find a lot more info about how to efficiently work with binaries in Erlang in the link above.

Cassandra .csv import error:batch too large

I'm trying to import data from a .csv file to Cassandra 3.2.1 via copy command.In the file are only 299 rows with 14 columns. I get the Error:
Failed to import 299 rows: InvalidRequest - code=2200 [Invalid query] message="Batch too large"
I used the following copy comand and tryied to increase the batch size:
copy table (Col1,Col2,...)from 'file.csv'
with delimiter =';' and header = true and MAXBATCHSIZE = 5000;
I think 299 rows are not too much to import to cassandra or i am wrong?
Adding the CHUNKSIZE keyword resolved the problem for me.
e.g.
copy event_stats_user from '/home/kiren/dumps/event_stats_user.csv ' with CHUNKSIZE=1 ;
The error you're encountering is a server-side error message, saying that the size (in term of bytes count) of your batch insert is too large.
This batch size is defined in the cassandra.yaml file:
# Log WARN on any batch size exceeding this value. 5kb per batch by default.
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
If you insert a lot of big columns (in size) you may reach quickly this threshold. Try to reduce MAXBATCHSIZE to 200.
More info on COPY options here

Kafka SparkStreaming configuration specify offsets/message list size

I am fairly new to both Kafka and Spark and trying to write a job (either Streaming or batch). I would like to read from Kafka a predefined number of messages (say x), process the collection through workers and then only start working on the next set of x messages. Basically each message in Kafka is 10 KB and I want to put 2 GB worth of messages in a single S3 file.
So is there any way of specifying the number of messages that the receiver fetches?
I have read that I can specify 'from offset' while creating DStream, but this use case is somewhat different. I need to be able to specify both 'from offset' and 'to offset'.
There's no way to set ending offset as the initial parameter (as you can for starting offset), but
you can use createDirectStream (the fourth overloaded version in the listing) which gives you the ability to get the offsets of the current micro batch using HasOffsetRanges (which gives you back OffsetRange).
That means that you'll have to compare values that you get from OffsetRange with your ending offset in every micro batch in order to see where you are and when to stop consuming from Kafka.
I guess you also need to think about the fact that each partition has its sequential offset. I assume it would be easiest if you could go a bit over 2GB, as much as it takes to finish the current micro-batch (could be couple of kB, depending on density of your messages), in order to avoid splitting the last batch on consumed and unconsumed part, which may require you to fiddle with offsets that Spark keeps in order to track what's consumed and what isn't.
Hope this helps.

Only one node owns data in a Cassandra cluster

I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here