Unicode character with flume - csv

I'm trying to put a CSV file into HDFS using flume, file contains some unicode characters also.
Once the file is there in HDFS I tried to view the content, but unable to see the records properly.
File content
Name age sal msg
Abc 21 1200 Lukè éxample àpple
Xyz 23 1400 er stîget ûf mit grôzer
Output in console
I did hdfs dfs -get /flume/events/csv/events.1234567
Below is the output
Name,age,sal,msg
Abc,21,1200,Luk��xample��pple
Xyz,23,1400,er st�get �f mit gr�zer
Does flume supports Unicode characters? If not how it can be handled

Yes Flume does support Unicode character. You can read your Unicode file using flume and transfer data to HDFS. This looks like some other issue.Change hdfs.fileType to DataStream and see if you can properly read output.
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/user/shashi/unicode/french.txt
a1.sources.r1.restart = true
#sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.fileType = DataStream
#channel
a1.channels.c1.type = memory
#connect
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Here is a smaple configuration that i have used.

Related

Python csv data logging doesn't work in while loop

I have been trying to log the data received from the Arduino through USB port and the strange thing is that the code works on my mac just fine but on windows it won't write it. At the start I expected the initial writing "DATA" but it didn't even write that. And when I commented out the entire loop it worked (It says "DATA" in the csv file).
import serial
count = 1
port = serial.Serial('COM4', baudrate=9600, bytesize=8)
log = open("data_log.csv", "w")
log.write("DATA")
log.write("\n")
while 1:
value = str(port.read(8), 'utf-8')
value = value.replace('\r', '').replace('\n', '')
if value.strip():
log.write(str(count))
log.write(',')
log.write(value)
log.write('\n')
print(count)
count += 1
print(value)
\n = CR (Carriage Return) // Used as a new line character in Unix
\r = LF (Line Feed) // Used as a new line character in Mac OS
\n\r = CR + LF // Used as a new line character in Windows
I think it's not working in windows because you need to look for a CR LF.
Might try using Environment.NewLine as it will act as any of the above depending on the operating system.

Splitting a csv file into multiple files

I have a csv file of 150500 rows and I want to split it into multiple files containing 500 rows (entries)
I'm using Jupyter and I know how to open and read the file. However, I don't know how to specify an output_path to record the newly created files from splitting the big one.
I have found this code online but once again since I don't know what is my output_path I don't know how to use it. Moreover, for this block of code I don't understand how we specify the input file.
import os
def split(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = reader.next()
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
My file name is DataSet2.csv and it's in the same file in jupyter as my ipynb notebook is running.
number_of_small_files = 301
lines_per_small_file = 500
largeFile = open('large.csv', 'r')
header = largeFile.readline()
for i in range(number_of_small_files):
smallFile = open(str(i) + '_small.csv', 'w')
smallFile.write(header) # This line copies the header to all small files
for x in range(lines_per_small_file):
line = largeFile.readline()
smallFile.write(line)
smallFile.close()
largeFile.close()
This will create many small files in the same directory. About 301 of them. They will be named from 0_small.csv to 300_small.csv.
Using standard unix utilities:
cat DataSet2.csv | tail -n +2 | split -l 500 --additional-suffix=.csv output_
This pipeline takes the original file, strips off the first line with 'tail -n +2', and then splits the rest into 500 line chunks that are put into files with names that start with 'output_' and end with '.csv'

Error while querying Hive table ( Twitter data uploaded using Flume )

I am trying to analyse Twitter Data using Cloudera. Currently, I am able to stream Twitter Data into HDFS via Flume but I am experiencing issues when trying to query data using SQL from Hive table getting following exception:
java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
Does this mean that the data was loaded into Hive but cannot be queried or was it not loaded into Hive at all?
My flume.conf file is
TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = nmmRpbWjQPAViWlJLjkJuq7mO
TwitterAgent.sources.Twitter.consumerSecret = *****
TwitterAgent.sources.Twitter.accessToken = *****
TwitterAgent.sources.Twitter.accessTokenSecret = *****
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100
#TwitterAgent.sources.Twitter.keywords = Canada, TTC,ttc, Toronto, Free, and, Apache,city, City, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql
TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100
TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume-ng/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume-ng/data/
I have added added JAR file "hive-serdes-1.0-SNAPSHOT.jar"
ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
my .avsc location is '/home/cloudera/twitterDataAvroSchema.avsc' and having bellow code-
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["long","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}
]
}
Used commend bellow to create hive table
CREATE TABLE my_tweets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='file:///home/cloudera/twitterDataAvroSchema.avsc') ;
Used folloing command to upload data to Hive table
LOAD DATA INPATH '/user/hive/warehouse/tweets/FlumeData.*' OVERWRITE INTO TABLE my_tweets;
== output ===
Loading data to table robin.my_tweets
Table robin.my_tweets stats: [numFiles=1, numRows=0, totalSize=421380, rawDataSize=0]
OK
Time taken: 1.928 seconds
Got Error while Trying SQL from
ERROR
hive> select user_location from robin.my_tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
Time taken: 1.247 seconds
I am using Cloureda
version=2.6.0-cdh5.5.0
Any assistance on this issue is appreciated.
Thanks
Robin

How to read gzip-File with fopen in octave 3.2.4?

I need to open a gzip file with fopen. The manual (help fopen) explains to add b and z to the mode string:
[f, msg] = fopen('file.gz', 'rbz')
results to the error:
f = -1
msg =
rb and r work separately, but not with z. Do i misunderstand the manual?
An example file can be generated by
echo -e "1,2\n2,3\n3,4\n4,3\n5,5" | gzip > file.gz
The octave version 3.2.4 is caused by my operating system: Ubuntu 12.04.3 LTS
function data = zcatcsvfile(filename, firstline)
data = [];
[status, content] = system(cstrcat('zcat ', filename, ' | tail -n +', num2str(firstline)));
data = str2num(content);
endfunction
Use this function to read a gzipped file filename and read as first line firstline. If the file has a header of 5 lines:
data=zcatcsvfile('data.gz', 6)

How can i get flume-ng to store logs in JSON format

I have a Flume consolidator that writes an entry from a custom log to s3 bucket in AWS
the problem i am having is, it is not storing it in JSON format. I am using flume-ng (flume 1.2.0) as i have upgraded from flume-og (really just flume 0.9.4-cdh3u3). When i was using flume (og one), i had it default to moving logs in JSON format without any params set. Is it possible for flume-ng to parse log and set it to JSON format?
Any help is much appreciated. Thank you
my setup config is below
agent.sources = source1
agent.sinks = sink1
agent.channels = channel1
agent.sources.source1.type = netcat
agent.sources.source1.bind = localhost
agent.sources.source1.port = 4555
agent.sinks.sink1.type=hdfs
agent.sinks.sink1.hdfs.path = s3://KEY:SECRET#BUCKET/flume/apache/incoming
agent.sinks.sink1.hdfs.filePrefix = log-file-
agent.channels.channel1.type = memory
agent.channels.channel1.capacity = 1000
agent.channels.channel1.transactionCapactiy = 100
agent.sources.source1.channels = channel1
agent.sinks.sink1.channel = channel1