Error while querying Hive table ( Twitter data uploaded using Flume ) - json

I am trying to analyse Twitter Data using Cloudera. Currently, I am able to stream Twitter Data into HDFS via Flume but I am experiencing issues when trying to query data using SQL from Hive table getting following exception:
java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
Does this mean that the data was loaded into Hive but cannot be queried or was it not loaded into Hive at all?
My flume.conf file is
TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = nmmRpbWjQPAViWlJLjkJuq7mO
TwitterAgent.sources.Twitter.consumerSecret = *****
TwitterAgent.sources.Twitter.accessToken = *****
TwitterAgent.sources.Twitter.accessTokenSecret = *****
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100
#TwitterAgent.sources.Twitter.keywords = Canada, TTC,ttc, Toronto, Free, and, Apache,city, City, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql
TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100
TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume-ng/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume-ng/data/
I have added added JAR file "hive-serdes-1.0-SNAPSHOT.jar"
ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
my .avsc location is '/home/cloudera/twitterDataAvroSchema.avsc' and having bellow code-
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["long","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}
]
}
Used commend bellow to create hive table
CREATE TABLE my_tweets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='file:///home/cloudera/twitterDataAvroSchema.avsc') ;
Used folloing command to upload data to Hive table
LOAD DATA INPATH '/user/hive/warehouse/tweets/FlumeData.*' OVERWRITE INTO TABLE my_tweets;
== output ===
Loading data to table robin.my_tweets
Table robin.my_tweets stats: [numFiles=1, numRows=0, totalSize=421380, rawDataSize=0]
OK
Time taken: 1.928 seconds
Got Error while Trying SQL from
ERROR
hive> select user_location from robin.my_tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
Time taken: 1.247 seconds
I am using Cloureda
version=2.6.0-cdh5.5.0
Any assistance on this issue is appreciated.
Thanks
Robin

Related

Is there a faster way to upload data from R to MySql?

I am using the following code to upload a new table into a mysql database.
library(RMySql)
library(RODBC)
con <- dbConnect(MySQL(),
user = 'user',
password = 'pw',
host = 'amazonaws.com',
dbname = 'db_name')
dbSendQuery(con, "CREATE TABLE table_1 (
var_1 VARCHAR(50),
var_2 VARCHAR(50),
var_3 DOUBLE,
var_4 DOUBLE);
")
channel <- odbcConnect("db name")
sqlSave(channel, dat = df, tablename = "tb_name", rownames = FALSE, append =
TRUE)
The full data set is 68 variables and 5 million rows. It is taking over 90 minutes to upload 50 thousand rows to MySql. Is there a more efficient way to upload the data to MySql. I originally tried dbWriteTable() but this would result in an error message saying the connection to the database was lost.
Consider a CSV export from R for an import into MySQL with LOAD DATA INFILE:
...
write.csv(df, "/path/to/filename.csv", row.names=FALSE)
dbSendQuery(con, "LOAD DATA LOCAL INFILE '/path/to/filename.csv'
INTO TABLE mytable
FIELDS TERMINATED by ','
ENCLOSED BY '"'
LINES TERMINATED BY '\\n'")
You could try to disable the mysql query log:
dbSendQuery(con, "SET GLOBAL general_log = 'off'")
I can't tell if your mysql user account has the appropriate permissions to do that, or if it conflicts with your business needs.
Off the top of my head: Otherwise you could try to send the data in say 1000-row batches, using a for- loop in your Rscript, and maybe option verbose = true in your call to sqlSave
If you send the data in a single batch, Mysql might try to run the INSERT as a single transaction ("all-or-nothing") and if it fails it goes into recovery or just fails after inserting some random number of rows.

odoo 9 migrate binary field db to filestore

Odoo 9 custom module binary field attachment=True parameter added later after that new record will be stored in filesystem storage.
Binary Fields some old records attachment = True not used, so old record entry not created in ir.attachment table and filesystem not saved.
I would like to know how to migrate old records binary field value store in filesystem storage?. How to create/insert records in ir_attachment row based on old records binary field value? Is any script available?
You have to include the postgre bin path in pg_path in your configuration file. This will restore the file store that contains the binary fields
pg_path = D:\fx\upsynth_Postgres\bin
I'm sure that you no longer need a solution to this as you asked 18 months ago, but I have just had the same issue (many gigabytes of binary data in the database) and this question came up on Google so I thought I would share my solution.
When you set attachment=True the binary column will remain in the database, but the system will look in the filestore instead for the data. This left me unable to access the data from the Odoo API so I needed to retrieve the binary data from the database directly, then re-write the binary data to the record using Odoo and then finally drop the column and vacuum the table.
Here is my script, which is inspired by this solution for migrating attachments, but this solution will work for any field in any model and reads the binary data from the database rather than from the Odoo API.
import xmlrpclib
import psycopg2
username = 'your_odoo_username'
pwd = 'your_odoo_password'
url = 'http://ip-address:8069'
dbname = 'database-name'
model = 'model.name'
field = 'field_name'
dbuser = 'postgres_user'
dbpwd = 'postgres_password'
dbhost = 'postgres_host'
conn = psycopg2.connect(database=dbname, user=dbuser, password=dbpwd, host=dbhost, port='5432')
cr = conn.cursor()
# Get the uid
sock_common = xmlrpclib.ServerProxy ('%s/xmlrpc/common' % url)
uid = sock_common.login(dbname, username, pwd)
sock = xmlrpclib.ServerProxy('%s/xmlrpc/object' % url)
def migrate_attachment(res_id):
# 1. get data
cr.execute("SELECT %s from %s where id=%s" % (field, model.replace('.', '_'), res_id))
data = cr.fetchall()[0][0]
# Re-Write attachment
if data:
data = str(data)
sock.execute(dbname, uid, pwd, model, 'write', [res_id], {field: str(data)})
return True
else:
return False
# SELECT attachments:
records = sock.execute(dbname, uid, pwd, model, 'search', [])
cnt = len(records)
print cnt
i = 0
for res_id in records:
att = sock.execute(dbname, uid, pwd, model, 'read', res_id, [field])
status = migrate_attachment(res_id)
print 'Migrated ID %s (attachment %s of %s) [Contained data: %s]' % (res_id, i, cnt, status)
i += 1
cr.close()
print "done ..."
Afterwards, drop the column and vacuum the table in psql.

Error :JsonStorage in Pig Local mode

I am running my Pigscript in Local mode in eclipse.
when I try to store the output in JsonStorage.
Exception in thread "main" java.lang.RuntimeException: Cannot instantiate:org.apache.pig.builtin.JsonStorage
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:4976)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1351)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:706)
at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.PigServer.registerScript(PigServer.java:407)
at com.paypal.debugpig.DebugPig.main(DebugPig.java:13)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.pig.builtin.JsonStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:458)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:470)
... 14 more
PigScript :
REGISTER C:/path/to/jar/pig.jar;
REGISTER C:/path/to/jar/UpperUDf/UpperUDf_fat.jar;
A = LOAD 'C:/path/to/data/file/student.txt' using PigStorage('\t') AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name) ,age, gpa ;
Store B into 'output_student_Json' using org.apache.pig.builtin.JsonStorage();
when I dump or store the ouput in text file its working and but issues occurs when I try to store in JSON format.
Any pointers appreciated
Thank you
I have verified it, and it is working for me if i am using the below line of code for storing output into json file format.
store B into 'json_output' using JsonStorage();

Unicode character with flume

I'm trying to put a CSV file into HDFS using flume, file contains some unicode characters also.
Once the file is there in HDFS I tried to view the content, but unable to see the records properly.
File content
Name age sal msg
Abc 21 1200 Lukè éxample àpple
Xyz 23 1400 er stîget ûf mit grôzer
Output in console
I did hdfs dfs -get /flume/events/csv/events.1234567
Below is the output
Name,age,sal,msg
Abc,21,1200,Luk��xample��pple
Xyz,23,1400,er st�get �f mit gr�zer
Does flume supports Unicode characters? If not how it can be handled
Yes Flume does support Unicode character. You can read your Unicode file using flume and transfer data to HDFS. This looks like some other issue.Change hdfs.fileType to DataStream and see if you can properly read output.
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/user/shashi/unicode/french.txt
a1.sources.r1.restart = true
#sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.fileType = DataStream
#channel
a1.channels.c1.type = memory
#connect
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Here is a smaple configuration that i have used.

How can i get flume-ng to store logs in JSON format

I have a Flume consolidator that writes an entry from a custom log to s3 bucket in AWS
the problem i am having is, it is not storing it in JSON format. I am using flume-ng (flume 1.2.0) as i have upgraded from flume-og (really just flume 0.9.4-cdh3u3). When i was using flume (og one), i had it default to moving logs in JSON format without any params set. Is it possible for flume-ng to parse log and set it to JSON format?
Any help is much appreciated. Thank you
my setup config is below
agent.sources = source1
agent.sinks = sink1
agent.channels = channel1
agent.sources.source1.type = netcat
agent.sources.source1.bind = localhost
agent.sources.source1.port = 4555
agent.sinks.sink1.type=hdfs
agent.sinks.sink1.hdfs.path = s3://KEY:SECRET#BUCKET/flume/apache/incoming
agent.sinks.sink1.hdfs.filePrefix = log-file-
agent.channels.channel1.type = memory
agent.channels.channel1.capacity = 1000
agent.channels.channel1.transactionCapactiy = 100
agent.sources.source1.channels = channel1
agent.sinks.sink1.channel = channel1