How to Pretty Print Hive Output

How to Pretty Print Hive Output - mysql

How do I have Hive print out nicely formatted results, with column names and pleasantly space, such as mysql? For example:
$ hive -f performanceStatistics.hql
...
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201306211023_1053
Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 1
2013-09-04 17:30:56,092 Stage-1 map = 0%, reduce = 0%
2013-09-04 17:31:03,132 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 13.87 sec
...
MapReduce Total cumulative CPU time: 2 minutes 5 seconds 260 msec
Ended Job = job_201306211023_1053
MapReduce Jobs Launched:
Job 0: Map: 8 Reduce: 1 Cumulative CPU: 125.26 sec HDFS Read: 1568029694 HDFS Write: 93 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 5 seconds 260 msec
OK
19.866045211878546 0.023310810810810812 10 0 824821 25 1684.478659112734 0.16516737901191694
Time taken: 34.324 seconds
How do I get the results with the column names and good spacing? I would also like to have an extended view like mysql \G or \x in PostgreSQL.

Use
set hive.cli.print.header=true;
to print column names [1].
As for the spacing, the output is already tab separated so how you process it further is up to you.
[1] https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-CommandLineInterface

You can now also use the Beeline command line tool which outputs data in a pretty format. [0]
Should you want vertical output, like MySQL \G, you can set --outputformat=vertical.
[0] https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93NewCommandLineShell

Related

Hive query does not begin MapReduce process after starting job and generating Tracking URL

I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).

You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>

Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).

I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).

Cassandra .csv import error:batch too large

I'm trying to import data from a .csv file to Cassandra 3.2.1 via copy command.In the file are only 299 rows with 14 columns. I get the Error:
Failed to import 299 rows: InvalidRequest - code=2200 [Invalid query] message="Batch too large"
I used the following copy comand and tryied to increase the batch size:
copy table (Col1,Col2,...)from 'file.csv'
with delimiter =';' and header = true and MAXBATCHSIZE = 5000;
I think 299 rows are not too much to import to cassandra or i am wrong?

Adding the CHUNKSIZE keyword resolved the problem for me.
e.g.
copy event_stats_user from '/home/kiren/dumps/event_stats_user.csv ' with CHUNKSIZE=1 ;

The error you're encountering is a server-side error message, saying that the size (in term of bytes count) of your batch insert is too large.
This batch size is defined in the cassandra.yaml file:
# Log WARN on any batch size exceeding this value. 5kb per batch by default.
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
If you insert a lot of big columns (in size) you may reach quickly this threshold. Try to reduce MAXBATCHSIZE to 200.
More info on COPY options here

Plotting data using Gnuplot

I have a csv file includes two column
no. of packet size
1 60
2 70
3 400
4 700
.
.
.
1000000 60
where the first column is
the number of packet
, and the second column is
the size of packet in bytes.
the total number of packets in the csv file is one million. I need to plot histogram for this data file by:
xrange = [0, 5 , 10 , 15 ]
which denotes the packet size in bytes. The range [0] denotes the packet size less than 100 bytes, and [5] denotes the packet bytes less than 500 bytes and so on.
yrange = [ 10, 100, 10000, 100000000],
which denots the number of packets
Any help will be highly appreciated.

Don't quite remember exactly how this works, but the commands given in my Gnuplot in Action book for creating a histogram are
bin(x,s) = s*int(x/s)
plot "data-file" using (bin(1,0.1)):(1./(0.1*300)) smooth frequency with boxes
I believe smooth frequency is the command that's important to you, and you need to figure out what the using argument should be (possibly with a different function used).

This should do the job:
# binning function for arbitrary ranges, change as needed
bin(x) = x<100 ? 0 : x<500 ? 5 : x<2500 ? 10 : 15
# every occurence is counted as (1)
plot datafile using (bin($2)):(1) smooth freq with boxes
Im not really sure what you mean by "yrange [10 100 1000 ...]", do you want a logscaled ordinate?
Then just
set xrange [1:1e6]
set logscale y
before plotting.

Creating index takes too long time

About 2 months ago, I imported EnWikipedia data(http://dumps.wikimedia.org/enwiki/20120211/) into mysql.
After finished importing EnWikipedia data, I have been creating index in the tables of the EnWikipedia database in mysql for about 2 month.
Now, I have reached the point of creating index in "pagelinks".
However, it seems to take an infinite time to pass that point.
Therefore, I checked the time remaining to pass to ensure that my intuition was correct or not.
As a result, the expected time remaining was 60 days(assuming that I create index in "pagelinks" again from the beginning.)
My EnWikipedia database has 7 tables:
"categorylinks"(records: 60 mil, size: 23.5 GiB),
"langlinks"(records: 15 mil, size: 1.5 GiB),
"page"(records: 26 mil, size 4.9 GiB),
"pagelinks"(records: 630 mil, size: 56.4 GiB),
"redirect"(records: 6 mil, size: 327.8 MiB),
"revision"(records: 26 mil, size: 4.6 GiB) and "text"(records: 26 mil, size: 60.8 GiB).
My server is...
Linux version 2.6.32-5-amd64 (Debian 2.6.32-39),Memory 16GB, 2.39Ghz Intel 4 core
Is that common phenomenon for creating index to take so long days ?
Does anyone have a good solution to create index more quickly ?
Thanks in advance !
P.S: I made following operations for checking the time remaining.
References(Sorry,following page is written in Japanese): http://d.hatena.ne.jp/sh2/20110615
1st. I got records in "pagelink".
mysql> select count(*) from pagelinks;
+-----------+
| count(*) |
+-----------+
| 632047759 |
+-----------+
1 row in set (1 hour 25 min 26.18 sec)
2nd. I got the amount of records increased per minute.
getHandler_write.sh
#!/bin/bash
while true
do
cat <<_EOF_
SHOW GLOBAL STATUS LIKE 'Handler_write';
_EOF_
sleep 60
done | mysql -u root -p -N
command
$ sh getHandler_write.sh
Enter password:
Handler_write 1289808074
Handler_write 1289814597
Handler_write 1289822748
Handler_write 1289829789
Handler_write 1289836322
Handler_write 1289844916
Handler_write 1289852226
3rd. I computed the speed of recording.
According to the result of 2. ,the speed of recording is
7233 records/minutes
4th. Then the time remaining is
(632047759/7233)/60/24 = 60 days

Those are pretty big tables, so I'd expect the indexing to be pretty slow. 630 million records is a LOT of data to index. One thing to look at is partitioning, with data sets that large, without correctly partitioned tables, performance will be sloooow. Here's some useful links:
using partioning on slow indexes you could also try looking at the buffer size settings for building the indexes (the default is 8MB, do for your large table that's going to slow you down a fair bit. buffer size documentation

How to profile a web service?

I'm currently developing an practice application in node.js. This applications consists of a JSON REST web service which allows two services.
Insert log (a PUT request to /log, with the message to log)
Last 100 logs (a GET request to /log, that returns the latest 100 logs)
The current stack is formed by a node.js server that has the application logic and a mongodb database that takes care of the persistence. To offer the JSON REST web services I'm using the node-restify module.
I'm currently executing some stress tests using apache bench (using 5000 requests with a concurrency of 10) and get the following results:
Execute stress tests
1) Insert log
Requests per second: 754.80 [#/sec] (mean)
2) Last 100 logs
Requests per second: 110.37 [#/sec] (mean)
I'm surprised of the difference there is in performance, the query I'm executing uses an index. Interestingly enough it seems that the JSON output generation seems to get all the time on deeper tests I have performed.
Can node applications be profiled in detail?
Is this behaviour normal? Retrieving data takes so much more than inserting data?
EDIT:
Full test information
1) Insert log
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Server Software: log-server
Server Hostname: localhost
Server Port: 3010
Document Path: /log
Document Length: 0 bytes
Concurrency Level: 10
Time taken for tests: 6.502 seconds
Complete requests: 5000
Failed requests: 0
Write errors: 0
Total transferred: 2240634 bytes
Total PUT: 935000
HTML transferred: 0 bytes
Requests per second: 768.99 [#/sec] (mean)
Time per request: 13.004 [ms] (mean)
Time per request: 1.300 [ms] (mean, across all concurrent requests)
Transfer rate: 336.53 [Kbytes/sec] received
140.43 kb/s sent
476.96 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 3
Processing: 6 13 3.9 12 39
Waiting: 6 12 3.9 11 39
Total: 6 13 3.9 12 39
Percentage of the requests served within a certain time (ms)
50% 12
66% 12
75% 12
80% 13
90% 15
95% 24
98% 26
99% 30
100% 39 (longest request)
2) Last 100 logs
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Server Software: log-server
Server Hostname: localhost
Server Port: 3010
Document Path: /log
Document Length: 4601 bytes
Concurrency Level: 10
Time taken for tests: 46.528 seconds
Complete requests: 5000
Failed requests: 0
Write errors: 0
Total transferred: 25620233 bytes
HTML transferred: 23005000 bytes
Requests per second: 107.46 [#/sec] (mean)
Time per request: 93.057 [ms] (mean)
Time per request: 9.306 [ms] (mean, across all concurrent requests)
Transfer rate: 537.73 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 28 93 16.4 92 166
Waiting: 26 85 18.0 86 161
Total: 29 93 16.4 92 166
Percentage of the requests served within a certain time (ms)
50% 92
66% 97
75% 101
80% 104
90% 113
95% 121
98% 131
99% 137
100% 166 (longest request)
Retrieving data from the database
To query the database I use the mongoosejs module. The log schema is defined as:
{
date: { type: Date, 'default': Date.now, index: true },
message: String
}
and the query I execute is the following:
Log.find({}, ['message']).sort('date', -1).limit(100)

Can node applications be profiled in detail?
Yes. Use node --prof app.js to create a v8.log, then use linux-tick-processor, mac-tick-processor or windows-tick-processor.bat (in deps/v8/tools in the node src directory) to interpret the log. You have to build d8 in deps/v8 to be able to run the tick processor.
Here's how I do it on my machine:
apt-get install scons
cd ~/development/external/node-0.6.12/deps/v8
scons arch=x64 d8
cd ~/development/projects/foo
node --prof app.js
D8_PATH=~/development/external/node-0.6.12/deps/v8 ~/development/external/node-0.6.12/deps/v8/tools/linux-tick-processor > profile.log
There are also a few tools to make this easier, including node-profiler and v8-profiler (with node-inspector).
Regarding your other question, I would like some more information on how you fetch your data from Mongo, and what the data looks like (I agree with beny23 that it looks like a suspiciously low amount of data).

I strongly suggest taking a look at the DTrace support of Restify. It will likely become your best friend when profiling.
http://mcavage.github.com/node-restify/#DTrace

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to Pretty Print Hive Output - mysql

Use set hive.cli.print.header=true; to print column names [1]. As for the spacing, the output is already tab separated so how you process it further is up to you. [1] https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-CommandLineInterface

Related

Hive query does not begin MapReduce process after starting job and generating Tracking URL

Cassandra .csv import error:batch too large

Plotting data using Gnuplot

Creating index takes too long time

How to profile a web service?

Categories

Resources