MapReduce error when selecting column from JSON file in Cosmos - json

The problem is the following:
After having created a table with Cygnus 0.2.1, I receive a MapReduce error when trying to select a column from Hive. If we see the files created in hadoop by Cygnus, we can see that the format used is JSON. This problem didn't appear in previous versions of Cygnus as it was creating hadoop files in CSV format.
In order to test it, I left 2 tables created reading from each format. You can compare and see the error with the following queries:
SELECT entitytype FROM fiware_ports_meteo; (it fails, created with 0.2.1 in JSON format)
SELECT entitytype FROM fiware_test_table; (it works, created with 0.2 in CSV format)
The path to the HDFS files are, respectively:
/user/fiware/ports/meteo
/user/fiware/testTable/
I suspect the error comes from parsing the JSON file by the MapReduce job since the CSV format works as expected.
How can this issue be avoided?

You simply have to add the Json serde to the Hive classpath. As a not priviledged user, you can do that from the Hive CLI:
hive> ADD JAR /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar;
If you have developed a remote Hive client, you can perform the same operation as any other query execution. Let's say you are using Java:
Statement stmt = con.createStatement();
stmt.executeQuery(“ADD JAR /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar”);
stmt.close();

Related

Found more columns than expected column count in Azure data factory while reading CSV stored in ADLS

I am exporting F&O D365 data to ADLS in CSV format. Now, I am trying to read the CSV stored in ADLS and copy into Azure Synapse dedicated SQL pool table using Azure data factory. However, I can create the pipeline and it's working for few tables without any issue. But it's failing for one table (salesline) because of mismatch in number of column.
Below is the CSV format sample, there is no column name(header) in CSV because it's exported from F&O system and column name stored in salesline.CDM.json file.
5653064010,,,"2022-06-03T20:07:38.7122447Z",5653064010,"B775-92"
5653064011,,,"2022-06-03T20:07:38.7122447Z",5653064011,"Small Parcel"
5653064012,,,"2022-06-03T20:07:38.7122447Z",5653064012,"somedata"
5653064013,,,"2022-06-03T20:07:38.7122447Z",5653064013,"someotherdata",,,,test1, test2
5653064014,,,"2022-06-03T20:07:38.7122447Z",5653064014,"parcel"
5653064016,,,"2022-06-03T20:07:38.7122447Z",5653064016,"B775-92",,,,,,test3
I have created ADF pipeline using copy data activity to copy the data from ADLS(CSV) to Synapse SQL table however I am getting below error.
Operation on target Copy_hs1 failed: ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'SALESLINE_00001.csv' with row number 4: found more columns than expected column count 6.,Source=Microsoft.DataTransfer.Common,'
Column mapping looks like below- Because CSV first row has 6 column so it's appearing 6 only while importing schema.
I have repro’d with your sample data and got the same error while copying the file using the copy data activity.
Alternatively, I have tried to copy the file using data flow and was able to load the data without any errors.
Source file:
Data flow:
Source dataset: only the first 6 columns are read as the first row contains only 6 columns in the file.
Source transformation: connect source dataset in source transformation.
Source preview:
Sink transformation: Connect sink to synapse dataset.
Settings:
Mappings:
Sink output:
After running the data flow, data is loaded to the sink synapse table.
Change my csv to xlsx help me to solve this problem in Copy Activity ADF.
1.From Copy data settings set "Fault Tolerance" = "Skip Incompatible rows"
skip incompatible rows
2.From Dataset connection settings set Escape character to Double quotes
Escape character

Hive 3.x causing error for compressed (bz2) json in external table

I have some JSON data (about 60GB) that I have to load in Hive external table. I am using Hive 3.x with Hadoop 3.x. The schema of table is as follows:
CREATE TABLE people(a string, liid string, link string, n string, t string, e string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE LOCATION '/data/db/';
I have also loaded the jar for serde as follows:
ADD JAR /usr/hive/lib/hive-hcatalog-core-3.1.2.jar;
If I copy a simple text json (or load) then DML queries (select etc.) works fine. As data file is very large and thus I have compressed it (20GB now). I have loaded this compressed file into Hive table (created above).
hive> select * from people;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Field name expected
Time taken: 0.096 seconds
hive>
It is working fine with uncompressed data. What is the issue with this ?
I have tried some solutions like this but not successful
I found the solution myself. Actual the issue was there are two columns that are arrays in json. They should be mapped to ARRAY in hive. The sample I taken for schema did not contain these array. Hence, by changing the field type to array<<string>> for one column solved my issue.

AWS Athena output result.json to s3 - CREATE TABLE AS / INSERT INTO SELECT?

Is it anyhow possible to write the results of an AWS Athena query to a results.json within an s3 bucket?
My first idea was to use INSERT INTO SELECT ID, COUNT(*) ... or INSERT OVERWRITE but this seems not be supported according Amazon Athena DDL Statements and tdhoppers Blogpost
Is it anyhow possible to CREATE TABLE with new data with AWS Athena?
Is there any work around with AWS Glue?
Anyhow possible to trigger an lambda function with the results of Athena?
(I'm aware of S3 Hooks)
It would not matter to me to overwrite the whole json file / table and always create a new json, since it is very limited statistics I aggregate.
I do know AWS Athena automatically writes the results to an S3 bucket as CSV. However I like to do simple aggregations and write the outputs directly to a public s3 so that an spa angular application in the browser is able to read it. Thus JSON Format and a specific path is important to me.
The work around for me with glue. Use Athena jdbc driver for running the query and load result in a dataframe. Then save the dataframe as the required format on specified S3 location.
df=spark.read.format('jdbc').options(url='jdbc:awsathena://AwsRegion=region;UID=your-access-key;PWD=your-secret-access-key;Schema=database name;S3OutputLocation=s3 location where jdbc drivers stores athena query results',
driver='com.simba.athena.jdbc42.Driver',
dbtable='(your athena query)').load()
df.repartition(1).write.format("json").save("s3 location")
Specify query in format dbtable='(select * from foo)'
Download jar from here and store it in S3.
While configuring etl job on glue specify s3 location for jar in Jar lib path.
you can get Athena to create data in s3 by using a "create table as select" (CTAS) query. In that query you can specify where and in what format you want the created table to store its data.
https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
For json, the example you are looking for is:
CREATE TABLE ctas_json_unpartitioned
WITH (
format = 'JSON',
external_location = 's3://my_athena_results/ctas_json_unpartitioned/')
AS SELECT key1, name1, address1, comment1
FROM table1;
this would result in single lines json format

Twitter Json data not getting queried in Hive

I trying to do twitter sentiment analysis using Flume, Hadoop and Hive.
I am following this article . I was able to get tweets to HDFS successfully by using Flume. This is my Twitter-agent configuration.
#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1
#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer-secret>
Twitter-agent.sources.source1.accessToken=<access-token>
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata
#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100
#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=10000
Twitter-agent.sinks.sink1.batchSize=1000
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text
Then I created a table just like in the article and the table was created. When I query the table,it gives error like this
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#31228d83; line: 1, column: 2]
Time taken: 0.914 seconds
I tried other queries like select count(id) from tweets but it shows a lot of errors.
This is the one of FlumeData file(tweets) present in HDFS.
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable;�#z_�>��<���N ����{"in_reply_to_status_id_str":"613363183034601472","in_reply_to_status_id":613363183034601472,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":"604605328","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":"AlexiBlue","id_str":"613363262760034304","in_reply_to_user_id":604605328,"favorite_count":0,"id":613363262760034304,"text":"#AlexiBlue good morning ☺️","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172237","entities":{"urls":[],"hashtags":[],"user_mentions":[{"indices":[0,10],"screen_name":"AlexiBlue","id_str":"604605328","name":"Alexi Blue ★","id":604605328}],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":null,"friends_count":1175,"profile_image_url_https":"https://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","listed_count":6,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","default_profile_image":false,"favourites_count":31695,"description":"PIZZA & TACOS ARE LIFE. #flippinfamily #rudunation #ABNation #5quad #7squad #SamCollinsisbaeaf","created_at":"Sun Mar 09 02:40:15 +0000 2014","is_translator":false,"profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","protected":false,"screen_name":"Sonja_Campbell1","id_str":"2379671544","profile_link_color":"3B94D9","id":2379671544,"geo_enabled":true,"profile_background_color":"C0DEED","lang":"en","profile_sidebar_border_color":"C0DEED","profile_text_color":"333333","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","time_zone":null,"url":null,"contributors_enabled":false,"profile_background_tile":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/2379671544/1434956813","statuses_count":17254,"follow_request_sent":null,"followers_count":871,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"Sonita✨","location":"","profile_sidebar_fill_color":"DDEEF6","notifications":null}}
Can anyone help me with this?
SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. And json is one format supported among many. I can see the serde exception and json in your error message. So its something to do with marshalling and unmarshalling of json data present in hive table column. Identify into which column are you adding json data. Happy Coding
Download "hive-json-serde.jar" and Add it to hive shell before you query any table which contains SerDe data such json, etc.
You will have to do this everytime you open Hive shell.
You need to download and add hive-serdes-1.0-SNAPSHOT.jar in your hive shell which has the JSON serde by clodera. Then you need to create a table based on your required columns.
For example
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
To perform sentiment analysis tweet_id and the tweeted_text are enough. Now if you use
select * from load_tweets;
Then you can see the data in your hive table containing tweet_id and the tweet_text.
You can refer to the below link in which sentiment analysis has been clearly explained with screen shots.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.