slick db.run does not insert action - mysql

I'm trying to do a simple insert in a MySql table using Slick. As you can see in the debug output below, the code gets executed but the values do not get inserted to the database.
This is the Database.scala code:
//import slick.jdbc.JdbcBackend._
import slick.dbio.DBIOAction
import slick.driver.MySQLDriver.api._
import slick.lifted.TableQuery
import java.sql.Timestamp
class Database {
val url = "jdbc:mysql://username=root:password=xxx#localhost/playdb"
val db = Database.forURL(url, driver = "com.mysql.jdbc.Driver")
val emrepo = TableQuery[EmailMessageTable]
override def finalize() {
db.close()
super.finalize()
}
protected class EmailMessageTable(tag: Tag) extends Table[EmailMessage](tag, "email_message") {
def id = column[Option[Long]]("id", O.AutoInc, O.PrimaryKey)
def email = column[String]("email")
def subject = column[String]("subject")
def body = column[String]("body")
def datain = column[Timestamp]("datain")
def email_id= column[Long]("email_id")
def * = (id, email, subject, body, datain, email_id) <> ((EmailMessage.apply _).tupled, EmailMessage.unapply)
def ? = (id.get.?, email.?, subject.?, body.?, datain.?, email_id.?).shaped.<>({ r => import r._; _1.map(_ =>
EmailMessage.tupled((_1, _2.get, _3.get, _4.get, _5.get, _6.get))) }, (_: Any) =>
throw new Exception("Inserting into ? projection not supported."))
}
def insert(m: EmailMessage) {
db.run(
(emrepo += m)
)
}
}
The calling code:
def toDatabase(m: EmailMessage): EmailMessage = {
val db = new Database()
println("HIT")
db.insert(m)
println("HIT 2")
println(m)
m
}
The case class object that is inserted into the database:
import java.sql.Timestamp
case class EmailMessage(
id: Option[Long],
email: String,
subject:String,
body:String,
datain: Timestamp,
email_id: Long
)
DEBUG output, showing the call done to Slick, and Slick debug output:
HIT
2016-09-06 16:08:41:563 -0300 [run-main-0] DEBUG slick.compiler.QueryCompiler - Source:
| TableExpansion
| table s2: Table email_message
| columns: TypeMapping
| 0: ProductNode
| 1: Path s2.id : Option[Long']
| 2: Path s2.email : String'
| 3: Path s2.subject : String'
| 4: Path s2.body : String'
| 5: Path s2.datain : java.sql.Timestamp'
| 6: Path s2.email_id : Long'
2016-09-06 16:08:41:587 -0300 [run-main-0] DEBUG slick.compiler.AssignUniqueSymbols - Detected features: UsedFeatures(false,true,false,false)
2016-09-06 16:08:41:597 -0300 [run-main-0] DEBUG slick.compiler.QueryCompiler - After phase assignUniqueSymbols:
| TableExpansion
| table s3: Table email_message
| columns: TypeMapping
| 0: ProductNode
| 1: Path s3.id : Option[Long']
| 2: Path s3.email : String'
| 3: Path s3.subject : String'
| 4: Path s3.body : String'
| 5: Path s3.datain : java.sql.Timestamp'
| 6: Path s3.email_id : Long'
2016-09-06 16:08:41:605 -0300 [run-main-0] DEBUG slick.compiler.QueryCompiler - After phase inferTypes: (no change)
2016-09-06 16:08:41:624 -0300 [run-main-0] DEBUG slick.compiler.QueryCompiler - After phase insertCompiler:
| ResultSetMapping : Vector[(String', String', String', java.sql.Timestamp', Long')]
| from s5: Insert allFields=[id, email, subject, body, datain, email_id] : (String', String', String', java.sql.Timestamp', Long')
| table s6: Table email_message : Vector[#t4<UnassignedType>]
| linear: ProductNode : (String', String', String', java.sql.Timestamp', Long')
| 1: Path s6.email : String'
| 2: Path s6.subject : String'
| 3: Path s6.body : String'
| 4: Path s6.datain : java.sql.Timestamp'
| 5: Path s6.email_id : Long'
| map: TypeMapping : Mapped[(Option[Long'], String', String', String', java.sql.Timestamp', Long')]
| 0: ProductNode : (Option[Long'], String', String', String', java.sql.Timestamp', Long')
| 1: InsertColumn id : Option[Long']
| 2: InsertColumn email : String'
| 0: Path s5._1 : String'
| 3: InsertColumn subject : String'
| 0: Path s5._2 : String'
| 4: InsertColumn body : String'
| 0: Path s5._3 : String'
| 5: InsertColumn datain : java.sql.Timestamp'
| 0: Path s5._4 : java.sql.Timestamp'
| 6: InsertColumn email_id : Long'
| 0: Path s5._5 : Long'
2016-09-06 16:08:41:638 -0300 [run-main-0] DEBUG slick.compiler.CodeGen - Compiling server-side and mapping with server-side:
| Insert allFields=[id, email, subject, body, datain, email_id] : (String', String', String', java.sql.Timestamp', Long')
| table s6: Table email_message : Vector[#t4<UnassignedType>]
| linear: ProductNode : (String', String', String', java.sql.Timestamp', Long')
| 1: Path s6.email : String'
| 2: Path s6.subject : String'
| 3: Path s6.body : String'
| 4: Path s6.datain : java.sql.Timestamp'
| 5: Path s6.email_id : Long'
2016-09-06 16:08:41:673 -0300 [run-main-0] DEBUG slick.relational.ResultConverterCompiler - Compiled ResultConverter
| TypeMappingResultConverter
| child: ProductResultConverter
| 1: CompoundResultConverter
| 2: SpecializedJdbcResultConverter$$anon$1 idx=1, name=email : String'
| 3: SpecializedJdbcResultConverter$$anon$1 idx=2, name=subject : String'
| 4: SpecializedJdbcResultConverter$$anon$1 idx=3, name=body : String'
| 5: SpecializedJdbcResultConverter$$anon$1 idx=4, name=datain : java.sql.Timestamp'
| 6: BaseResultConverter$mcJ$sp idx=5, name=email_id : Long'
2016-09-06 16:08:41:675 -0300 [run-main-0] DEBUG slick.compiler.CodeGen - Compiled server-side to:
| CompiledStatement "insert into `email_message` (`email`,`subject`,`body`,`datain`,`email_id`) values (?,?,?,?,?)" : (String', String', String', java.sql.Timestamp', Long')
2016-09-06 16:08:41:681 -0300 [run-main-0] DEBUG slick.compiler.QueryCompiler - After phase codeGen:
| ResultSetMapping : Vector[(String', String', String', java.sql.Timestamp', Long')]
| from s5: CompiledStatement "insert into `email_message` (`email`,`subject`,`body`,`datain`,`email_id`) values (?,?,?,?,?)" : (String', String', String', java.sql.Timestamp', Long')
| map: CompiledMapping : Mapped[(Option[Long'], String', String', String', java.sql.Timestamp', Long')]
| converter: TypeMappingResultConverter
| child: ProductResultConverter
| 1: CompoundResultConverter
| 2: SpecializedJdbcResultConverter$$anon$1 idx=1, name=email : String'
| 3: SpecializedJdbcResultConverter$$anon$1 idx=2, name=subject : String'
| 4: SpecializedJdbcResultConverter$$anon$1 idx=3, name=body : String'
| 5: SpecializedJdbcResultConverter$$anon$1 idx=4, name=datain : java.sql.Timestamp'
| 6: BaseResultConverter$mcJ$sp idx=5, name=email_id : Long'
2016-09-06 16:08:41:682 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - ------------------- Phase: Time ---------
2016-09-06 16:08:41:702 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - assignUniqueSymbols: 32,729098 ms
2016-09-06 16:08:41:703 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - inferTypes: 7,924984 ms
2016-09-06 16:08:41:703 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - insertCompiler: 18,786989 ms
2016-09-06 16:08:41:703 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - codeGen: 57,406605 ms
2016-09-06 16:08:41:704 -0300 [run-main-0] DEBUG slick.compiler.QueryCompilerBenchmark - TOTAL: 116,847676 ms
2016-09-06 16:08:41:709 -0300 [run-main-0] DEBUG slick.backend.DatabaseComponent.action - #1: SingleInsertAction [insert into `email_message` (`email`,`subject`,`body`,`datain`,`email_id`) values (?,?,?,?,?)]
HIT 2
EmailMessage(None,fernando#localhost,Me,teste daqui para ali rapido.,2016-09-06 16:08:41.099,1)
2016-09-06 16:08:41:746 -0300 [AsyncExecutor.default-1] DEBUG slick.jdbc.JdbcBackend.statement - Preparing statement: insert into `email_message` (`email`,`subject`,`body`,`datain`,`email_id`) values (?,?,?,?,?)
[success] Total time: 18 s, completed 06/09/2016 16:08:41
The value does not get inserted to the database. Why?

The most probable thing is that the the statement 'db.insert(m)' is asynchronous (it returns a Future), and you program is being finished before the Future ends, try to place a sleep or wait the future ends.
You can try something like this:
val result = db.insert(m)
Await.result(result, Duration.Inf)
...
I had a similar problem before which you can see here: How to configure Slick 3.1.1 for PostgreSQL? It seems to ignore my config parameters while running plain sql queries

Related

setValue using an array of schema values, each schema with an array of fields, and each field with an array of "inner" fields

[Edited, for simplicity]
I want to set the value of a few cells in a Google Sheet from the values retrieved from a schema list (from a Google Workspace domain), using "AdminDirectory.Schemas.list('my_customer').schemas". So far I only achieve partial solutions...
In detail, I want each of the Google Sheet cells U4, U5, U6, and so on (one cell for each schema), to contain a single schema, with the entire array of fields, and the entire array of inner fields, as follows (below is the expected content of a single cell):
________________ Cell U4 ________________
| |
|⏵ Display Name: Test Schema 1 |
|⏵ Safe_Name: Test_Schema_1 |
|⏵ Fields: |
| • Field Nr.1 of 3: |
| ◦ Display Name: Field 1: |
| ◦ Field_Name: Field1: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
| |
| • Field Nr.2 of 3: |
| ◦ Display Name: Field 2: |
| ◦ Field_Name: Field2: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
| |
| • Field Nr.3 of 3: |
| ◦ Display Name: Field 3: |
| ◦ Field_Name: Field3: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
|_______________________________________|
The next cell might be slightly different, as each schema will have a different number of fields, and each field has five inner fields (the one seen in the example above).
So far, the best I wasn't able to achieve better than this:
________________ Cell U4 ________________
| |
|⏵ Display Name: Test Schema 1 |
|⏵ Safe_Name: Test_Schema_1 |
|⏵ Fields: |
| • Field Nr.1 of 3: |
| ◦ Display Name: Field 1: |
| ◦ Field_Name: Field1: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
| |
|_______________ Cell U6 _______________|
| |
|⏵ Display Name: Test Schema 1 |
|⏵ Safe_Name: Test_Schema_1 |
|⏵ Fields: |
| • Field Nr.2 of 3: |
| ◦ Display Name: Field 2: |
| ◦ Field_Name: Field2: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
| |
|_______________ Cell U6 _______________|
| |
|⏵ Display Name: Test Schema 1 |
|⏵ Safe_Name: Test_Schema_1 |
|⏵ Fields: |
| • Field Nr.3 of 3: |
| ◦ Display Name: Field 3: |
| ◦ Field_Name: Field3: |
| ◦ Field Type: BOOL: |
| ◦ Multi-valued? :false: |
| ◦ Accessible by: ADMINS_AND_SELF |
|_______________________________________|
As you can see, the schema "Display Name" and "Schema Name" is repeated in every cell, and the fields spread into the follwoing cells, until the end of the fields loop. When there are no more fields in that schema, the same happens for the next one. What I want is to have everything related to each schema in a single cell. So, in short, what I need it to be able to join or concatenate the results of the fields loop (more details after the Script 2).
ㅤ
SCRIPT 1:
[removed for simplicity. Kept this reference out of respect for who may have read it before]
ㅤ
SCRIPT 2:
function listSchemaB() {
const sheet = SpreadsheetApp.getActive().getSheetByName("Domain Schema");
const schemaLength = AdminDirectory.Schemas.list('my_customer').schemas.length;
for(var i=0;i<schemaLength;i++) {
var data = AdminDirectory.Schemas.list('my_customer').schemas[i];
var fieldsLenght = data.fields.length;
var schemaTitles = "⏵ Display Name: " + data.displayName + "\n\⏵ Safe_Name: " + data.schemaName + "\n\⏵ Fields:";
for(var x=0;x<fieldsLenght;x++) {
var schemaFields = ("\n\ • Field Nr." + (x+1) + " of " + (fieldsLenght+1) + ":\n\ ◦ Display Name: " + data.fields[x].displayName + ":\n\ ◦ Field_Name: " + data.fields[x].fieldName + ":\n\ ◦ Field Type: " + data.fields[x].fieldType + ":\n\ ◦ Multi-valued? :" + data.fields[x].multiValued + ":\n\ ◦ Accessible by: " + data.fields[x].readAccessType).concat("");
}
sheet.getRange(i+4,21).setValue(schemaTitles + schemaFields);
}
}
This one almost works, but I get the results from the loop with the x variable all separated from each other, so they all go to a different cell when I use "setValue", and I can't find a way to merge/join/concatenate the results from the inner loop into a single cell.
ㅤ
SCRIPT 3:
[removed for simplicity. Kept this reference out of respect for #doubleunary, who tried to help based on this script]
Additionally - but secondary, for now -, I'd also like to know how I can use the output of "console.log(something)" as a variable to use with "setValue", to push the result to a Google Sheet.
Note: "console.log(ret)" gives me a perfect result but I can't find a way to use the logged result inside the "setValue", to push the result to the Google Sheet.
It is not entirely clear what your desired result is, but given that console.log(ret) gives what you want, try this:
function loopSchemaC() {
const sheet = SpreadsheetApp.getActive().getSheetByName('Domain Schema');
const data = AdminDirectory.Schemas.list('my_customer').schemas;
const output = [];
data.forEach(schema => {
const ret = {};
ret.displayName = schema.displayName;
ret.schemaName = schema.schemaName;
ret.fields = [];
for (let f of schema.fields) {
const obj = {};
obj.readAccessType = f.readAccessType;
obj.displayName = f.displayName;
obj.fieldType = f.fieldType;
obj.fieldName = f.fieldName;
obj.multiValued = f.multiValued;
ret.fields.push(obj);
}
output.push([JSON.stringify(ret, null, 2)]);
});
sheet.getRange('U1')
.offset(0, 0, output.length, output[0].length)
.setValues(output);
}

How to programmatically check for duplicate row before insertion into MySQL database

I am a database newbie and am learning on python3.7 and mysql using end of day stock data. I've managed to programmatically load data into the database. However, I want to avoid inserting duplicate rows. I am parsing a text file line by line.
Here is my code so far.
import pymysql
import pandas as pd
import sys
ticker_file = 'C:/testfile.txt'
# Read the text file and add , to the end of the line.
def fun_read_file(ticker_file):
host = 'localhost'
user = 'user'
password = 'password'
db = 'trading'
with open(ticker_file, 'r') as f:
for line in f:
# Do something with 'line'
stripped = line.strip('\n\r')
value1,value2,value3,value4,value5,value6,value7 = stripped.split(',')
print(value1,value2,value3,value4,value5,value6,value7)
# Call the csv_to_mysql function
csv_to_mysql(host, user, password, db, value1, value2, value3, value4, value5, value6, value7)
def csv_to_mysql(host, user, password, db, value1, value2, value3, value4, value5, value6, value7):
'''
This function load a csv file to MySQL table according to
the load_sql statement.
'''
load_sql = 'INSERT INTO asx (Symbol,Date,Open,High,Low,Close,Volume) VALUES (%s, %s, %s, %s, %s, %s, %s)'
args = [value1, value2, value3, value4, value5, value6, value7]
print('You are in csv_to_mysql')
print(args)
try:
con = pymysql.connect(host=host,
user=user,
password=password,
db=db,
autocommit=True,
local_infile=1)
print('Connected to DB: {}'.format(host))
# Create cursor and execute Load SQL
cursor = con.cursor()
cursor.execute(load_sql, args)
print('Successfully loaded the table from csv.')
con.close()
except Exception as e:
print('Error: {}'.format(str(e)))
sys.exit(1)
# Execution the script
fun_read_file(ticker_file)
And here is the current data in the table called asx:
mysql> select * from asx;
+--------+------------+--------+--------+--------+--------+---------+
| Symbol | Date | Open | High | Low | Close | Volume |
+--------+------------+--------+--------+--------+--------+---------+
| 14D | 2019-01-11 | 0.2950 | 0.2950 | 0.2750 | 0.2750 | 243779 |
| 14D | 2019-01-11 | 0.2950 | 0.2950 | 0.2750 | 0.2750 | 243779 |
| 14D | 2019-01-11 | 0.2950 | 0.2950 | 0.2750 | 0.2750 | 243779 |
| 14DO | 2019-01-11 | 0.0700 | 0.0700 | 0.0700 | 0.0700 | 0 |
| 1AD | 2019-01-11 | 0.2400 | 0.2400 | 0.2400 | 0.2400 | 0 |
| 1AG | 2019-01-11 | 0.0310 | 0.0320 | 0.0310 | 0.0310 | 719145 |
| 1AL | 2019-01-11 | 0.9100 | 0.9100 | 0.9100 | 0.9100 | 0 |
| 1ST | 2019-01-11 | 0.0280 | 0.0280 | 0.0280 | 0.0280 | 0 |
| 3DP | 2019-01-11 | 0.0500 | 0.0560 | 0.0500 | 0.0520 | 3919592 |
+--------+------------+--------+--------+--------+--------+---------+
9 rows in set (0.02 sec)
As you can see, the first three rows of data are all duplicates.
I have a ton of these files to import, and chances of duplicate rows are high.
Is there a way to check that the row I will be inserting doesn't already exist in the table?
Checking the Symbol and Date values should be enough to ensure uniqueness for this dataset. But I am unsure of how to accomplish this.
Thanks in advance for your help.
Added for clarification:
Thanks very much for your input so far.
I've read the primary key responses and have follow up questions regarding them.
My understanding is that primary keys need to be unique inside a table. Due to the nature of End of Day Stock Data I may end up with the following rows.
+--------+------------+--------+--------+--------+--------+---------+
| Symbol | Date | Open | High | Low | Close | Volume |
+--------+------------+--------+--------+--------+--------+---------+
| 14D | 2019-01-12 | 0.3000 | 0.4950 | 0.2950 | 0.4900 | 123456 |
| 14D | 2019-01-11 | 0.2950 | 0.2950 | 0.2750 | 0.2750 | 243779 |
| 14D | 2019-01-11 | 0.2950 | 0.2950 | 0.2750 | 0.2750 | 243779 |
| 14DO | 2019-01-11 | 0.0700 | 0.0700 | 0.0700 | 0.0700 | 0 |
| 1AD | 2019-01-11 | 0.2400 | 0.2400 | 0.2400 | 0.2400 | 0 |
As you can see Symbol 14D will have a row for each date. The data in row 1 is valid. However, rows 2 and 3 are duplicates. I would need to remove either row 2 or 3 in order to keep the table accurate.
In this scenario, should I still make Symbol and Date Primary Keys?
I suggest you to read INSERT IGNORE, ON DUPLICATE KEY UPDATE keywords for MySQL and also look into PRIMARY KEY and UNIQUE constraints.
Here is a quick link that can solve your problem:
Mysql Handling Duplicates
If you still have questions, I can answer them.
I am still a beginner in Python, but I know databases. What I would do is first do a SELECT query to verify if a record with the given Symbol and Date exists in the MySQL table, and only perform the INSERT if the SELECT returned 0 rows. You should also consider making these two columns your primary key for that table. That will ensure that no duplicates are inserted (but inserting a duplicate might raise an exception which must be handled).
Thanks for the heads up about how to answer correctly.
I ended up creating a new function called check_row and used a select statement to check whether the row already exists. In this dataset, I only need to check whether a row in the table already contains value1(Symbol) and value2(Date) in order to keep the data accurate.
Thank you tutiplain for pointing me in this direction.
query = 'SELECT COUNT(*) from asx WHERE Symbol = %s AND Date = %s'
args = [value1, str_query_value2]
Here is the full code below.
import pymysql
import pandas as pd
import sys
ticker_file = 'C:/test.txt'
# Read the text file and add , to the end of the line.
def fun_read_file(ticker_file):
#load_sql = "LOAD DATA INFILE 'C:/test.txt' INTO TABLE asx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';"
host = 'localhost'
user = 'user'
password = 'password'
db = 'trading'
with open(ticker_file, 'r') as f:
for line in f:
# Do something with 'line'
stripped = line.strip('\n\r')
value1,value2,value3,value4,value5,value6,value7 = stripped.split(',')
print(value1,value2,value3,value4,value5,value6,value7)
# Call the check_row function
check_row(host, user, password, db, value1, value2, value3, value4, value5, value6, value7)
# Insert row into table
def csv_to_mysql(host, user, password, db, value1, value2, value3, value4, value5, value6, value7):
'''
This function load a csv file to MySQL table according to
the load_sql statement.
'''
load_sql = 'INSERT INTO asx (Symbol,Date,Open,High,Low,Close,Volume) VALUES (%s, %s, %s, %s, %s, %s, %s)'
args = [value1, value2, value3, value4, value5, value6, value7]
try:
con = pymysql.connect(host=host,
user=user,
password=password,
db=db,
autocommit=True,
local_infile=1)
print('Connected to DB: {}'.format(host))
# Create cursor and execute Load SQL
cursor = con.cursor()
cursor.execute(load_sql, args)
print('Successfully loaded the table from csv.')
con.close()
except Exception as e:
print('Error: {}'.format(str(e)))
sys.exit(1)
# Check for duplicate row before insertion into table
def check_row(host, user, password, db, value1, value2, value3, value4, value5, value6, value7):
# Manipulate the value2 (date) string first 20190111 into 2019-01-11
str_value2 = value2
year = str_value2[:4]
day = str_value2[-2:]
month = str_value2[4:6]
str_query_value2 = year + '-' + month + '-' + day
print(str_query_value2)
# Select statement to query whether row already exists
query = 'SELECT COUNT(*) from asx WHERE Symbol = %s AND Date = %s'
args = [value1, str_query_value2]
try:
con = pymysql.connect(host=host,
user=user,
password=password,
db=db,
autocommit=True,
local_infile=1)
print('Connected to DB: {}'.format(host))
# Create cursor and execute Load SQL
cursor = con.cursor()
cursor.execute(query, args)
print('Successfully queried the asx table.')
result = cursor.fetchall()
print(result)
# Fetchall method outputs a tuple. Access first item of the first tuple.
int_result = result[0][0]
print(int_result)
con.close()
if int_result >= 1:
# Exit this function, 0 means clean exit, 1 means abort
exit(0)
else:
# Call the csv_to_mysql function
csv_to_mysql(host, user, password, db, value1, value2, value3, value4, value5, value6, value7)
except Exception as e:
print('Error: {}'.format(str(e)))
sys.exit(1)
# Execution the script
fun_read_file(ticker_file)

how to extract data from json using oracle text index

I have a table, which has an Oracle text index. I created the index because I need an extra fast search. The table contains JSON data. Oracle json_textcontains works very poorly so I tried to play with CONTAINS (json_textcontains is rewritten to CONTAINS actually if we have a look into query plan).
I want to find all jsons by given class_type and id of value but Oracle looks all over JSON without looking that class_type and id should be in one JSON section i.e. it deals with JSON not like structured data but like a huge string.
Well formatted JSON looks like this:
{
"class":[
{
"class_type":"ownership",
"values":[{"nm":"id","value":"1"}]
},
{
"class_type":"country",
"values":[{"nm":"id","value":"640"}]
},
,
{
"class_type":"features",
"values":[{"nm":"id","value":"15"},{"nm":"id","value":"20"}]
}
]
}
The second one which shouldn't be found looks like this:
{
"class":[
{
"class_type":"ownership",
"values":[{"nm":"id","value":"18"}]
},
{
"class_type":"country",
"values":[{"nm":"id","value":"11"}]
},
,
{
"class_type":"features",
"values":[{"nm":"id","value":"7"},{"nm":"id","value":"640"}]
}
]
}
Please see how to reproduce what I'm trying to achieve:
create table perso.json_data(id number, data_val blob);
insert into perso.json_data
values(
1,
utl_raw.cast_to_raw('{"class":[{"class_type":"ownership","values":[{"nm":"id","value":"1"}]},{"class_type":"country","values":[{"nm":"id","value":"640"}]},{"class_type":"features","values":[{"nm":"id","value":"15"},{"nm":"id","value":"20"}]}]}')
);
insert into perso.json_data values(
2,
utl_raw.cast_to_raw('{"class":[{"class_type":"ownership","values":[{"nm":"id","value":"18"}]},{"class_type":"country","values":[{"nm":"id","value":"11"}]},{"class_type":"features","values":[{"nm":"id","value":"7"},{"nm":"id","value":"640"}]}]}')
)
;
commit;
ALTER TABLE perso.json_data
ADD CONSTRAINT check_is_json
CHECK (data_val IS JSON (STRICT));
CREATE INDEX perso.json_data_idx ON json_data (data_val)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('section group CTXSYS.JSON_SECTION_GROUP SYNC (ON COMMIT)');
select *
from perso.json_data
where ctxsys.contains(data_val, '(640 INPATH(/class/values/value)) and (country inpath (/class/class_type))')>0
The query returns 2 rows but I expect to get only the record where id = 1.
How can I use a full text index with the ability to search without the error I highlighted, without using JSON_TABLE?
There is no options to put data in relational format.
Thanks in advance.
Please don't use the text index directly to try to solve this kind of problem. It's not what it's designed for..
In 12.2.0.1.0 this should work for you (and yes it does use a specialized version of the text index under the covers, but it also applies selective post filtering to ensure the results are correct)..
SQL> create table json_data(id number, data_val blob)
2 /
Table created.
SQL> insert into json_data values(
2 1,utl_raw.cast_to_raw('{"class":[{"class_type":"ownership","values":[{"nm":"id","value":"1"}]},{"class_type":"cou
ntry","values":[{"nm":"id","value":"640"}]},{"class_type":"features","values":[{"nm":"id","value":"15"},{"nm":"id","valu
e":"20"}]}]}')
3 )
4 /
1 row created.
Execution Plan
----------------------------------------------------------
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 100 | 1 (0)| 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | JSON_DATA | | | | |
--------------------------------------------------------------------------------------
SQL> insert into json_data values(
2 2,utl_raw.cast_to_raw('{"class":[{"class_type":"ownership","values":[{"nm":"id","value":"18"}]},{"class_type":"co
untry","values":[{"nm":"id","value":"11"}]},{"class_type":"features","values":[{"nm":"id","value":"7"},{"nm":"id","value
":"640"}]}]}')
3 )
4 /
1 row created.
Execution Plan
----------------------------------------------------------
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 100 | 1 (0)| 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | JSON_DATA | | | | |
--------------------------------------------------------------------------------------
SQL> commit
2 /
Commit complete.
SQL> ALTER TABLE json_data
2 ADD CONSTRAINT check_is_json
3 CHECK (data_val IS JSON (STRICT))
4 /
Table altered.
SQL> CREATE SEARCH INDEX json_SEARCH_idx ON json_data (data_val) for JSON
2 /
Index created.
SQL> set autotrace on explain
SQL> --
SQL> set lines 256 trimspool on pages 50
SQL> --
SQL> select ID, json_query(data_val, '$' PRETTY)
2 from JSON_DATA
3 /
ID
----------
JSON_QUERY(DATA_VAL,'$'PRETTY)
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
----------------
1
{
"class" :
[
{
"class_type" : "ownership",
"values" :
[
{
"nm" : "id",
"value" : "1"
}
]
},
{
"class_type" : "country",
"values" :
[
{
"nm" : "id",
"value" : "640"
}
]
},
{
"class_type" : "features",
"values" :
[
{
"nm" : "id",
"value" : "15"
},
{
"nm" : "id",
"value" : "20"
}
]
}
]
}
2
{
"class" :
[
ID
----------
JSON_QUERY(DATA_VAL,'$'PRETTY)
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
----------------
{
"class_type" : "ownership",
"values" :
[
{
"nm" : "id",
"value" : "18"
}
]
},
{
"class_type" : "country",
"values" :
[
{
"nm" : "id",
"value" : "11"
}
]
},
{
"class_type" : "features",
"values" :
[
{
"nm" : "id",
"value" : "7"
},
{
"nm" : "id",
"value" : "640"
}
]
}
]
}
Execution Plan
----------------------------------------------------------
Plan hash value: 3213740116
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 4030 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| JSON_DATA | 2 | 4030 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
SQL> select ID, to_clob(data_val)
2 from json_data
3 where JSON_EXISTS(data_val,'$?(exists(#.class?(#.values.value == $VALUE && #.class_type == $TYPE)))' passing '640'
as "VALUE", 'country' as "TYPE")
4 /
ID TO_CLOB(DATA_VAL)
---------- --------------------------------------------------------------------------------
1 {"class":[{"class_type":"ownership","values":[{"nm":"id","value":"1"}]},{"class_
type":"country","values":[{"nm":"id","value":"640"}]},{"class_type":"features","
values":[{"nm":"id","value":"15"},{"nm":"id","value":"20"}]}]}
Execution Plan
----------------------------------------------------------
Plan hash value: 3248304200
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2027 | 4 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| JSON_DATA | 1 | 2027 | 4 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX | JSON_SEARCH_IDX | | | 4 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(JSON_EXISTS2("DATA_VAL" FORMAT JSON , '$?(exists(#.class?(#.values.value
== $VALUE && #.class_type == $TYPE)))' PASSING '640' AS "VALUE" , 'country' AS "TYPE"
FALSE ON ERROR)=1)
2 - access("CTXSYS"."CONTAINS"("JSON_DATA"."DATA_VAL",'{640} INPATH
(/class/values/value) and {country} INPATH (/class/class_type)')>0)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
SQL> select ID, TO_CLOB(DATA_VAL)
2 from JSON_DATA d
3 where exists (
4 select 1
5 from JSON_TABLE(
6 data_val,
7 '$.class'
8 columns (
9 CLASS_TYPE VARCHAR2(32) PATH '$.class_type',
10 NESTED PATH '$.values.value'
11 columns (
12 "VALUE" VARCHAR2(32) path '$'
13 )
14 )
15 )
16 where CLASS_TYPE = 'country' and "VALUE" = '640'
17 )
18 /
ID TO_CLOB(DATA_VAL)
---------- --------------------------------------------------------------------------------
1 {"class":[{"class_type":"ownership","values":[{"nm":"id","value":"1"}]},{"class_
type":"country","values":[{"nm":"id","value":"640"}]},{"class_type":"features","
values":[{"nm":"id","value":"15"},{"nm":"id","value":"20"}]}]}
Execution Plan
----------------------------------------------------------
Plan hash value: 1621266031
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2027 | 32 (0)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS FULL | JSON_DATA | 2 | 4054 | 3 (0)| 00:00:01 |
|* 3 | FILTER | | | | | |
|* 4 | JSONTABLE EVALUATION | | | | | |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT 0 FROM JSON_TABLE( :B1, '$.class' COLUMNS(
"CLASS_TYPE" VARCHAR2(32) PATH '$.class_type' NULL ON ERROR , NESTED PATH
'$.values.value' COLUMNS( "VALUE" VARCHAR2(32) PATH '$' NULL ON ERROR ) ) )
"P" WHERE "CTXSYS"."CONTAINS"(:B2,'({country} INPATH (/class/class_type))
and ({640} INPATH (/class/values/value))')>0 AND "P"."CLASS_TYPE"='country'
AND "P"."VALUE"='640'))
3 - filter("CTXSYS"."CONTAINS"(:B1,'({country} INPATH
(/class/class_type)) and ({640} INPATH (/class/values/value))')>0)
4 - filter("P"."CLASS_TYPE"='country' AND "P"."VALUE"='640')
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
SQL>

Load multi-line JSON data into HIVE table

I have a JSON data which is a multi-line JSON. I have created a hive table to load that data into it. I have another JSON which is a single-line JSON record. When I load the single-line JSON record to its hive table and try to query, it works fine. But when I load the multi-line JSON into its hive table, it gives below exception:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeExcep‌​tion: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: java.io.ByteArrayInputStream#8b89b3a; line: 1, column: 0]) at [Source: java.io.ByteArrayInputStream#8b89b3a; line: 1, column: 3]
Below is my JSON data:
{
"uploadTimeStamp" : "1486631318873",
"PDID" : "123",
"data" : [ {
"Data" : {
"unit" : "rpm",
"value" : "0"
},
"EventID" : "E1",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
}, {
"Data" : {
"heading" : "N",
"loc3" : "false",
"loc" : "14.022425",
"loc1" : "78.760587",
"loc4" : "false",
"speed" : "10"
},
"EventID" : "E2",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.1",
"pii" : { }
}, {
"Data" : {
"x" : "1.1",
"y" : "1.2",
"z" : "2.2"
},
"EventID" : "E3",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
}, {
"EventID" : "E4",
"Data" : {
"value" : "50",
"unit" : "percentage"
},
"Version" : "1.0",
"Timestamp" : 1486631318873,
"PDID" : "123",
"Timezone" : 330
}, {
"Data" : {
"unit" : "kmph",
"value" : "70"
},
"EventID" : "E5",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
} ]
}
I am using /hive/lib/hive-hcatalog-core-0.13.0.jar
Below is my create table statement:
create table test7(
uploadtime bigint,
pdid string,
data array<
struct<Data:struct<
unit:string,
value:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
heading:string,
Location:string,
latitude:bigint,
longitude:bigint,
Location2:string,
speed:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
unit:string,
value:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
x:int,
y:int,
z:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
heading:string,
loc3:string,
latitude:bigint,
longitude:bigint,
loc4:string,
speed:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>
>
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'/xyz/abc/';
Edit:
Adding the single line JSON and new table create stmt with error:
{"uploadTimeStamp":"1487183800905","PDID":"123","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"event1","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}},{"Data":{"heading":"N","loc1":"false","latitude":"16.032425","longitude":"80.770587","loc2":"false","speed":"10"},"EventID":"event2","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.1","pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"event3":"AccelerometerInfo","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}},{"EventID":"event4","Data":{"value":"50","unit":"percentage"},"Version":"1.0","Timestamp":1487183800905,"PDID":"123","Timezone":330},{"Data":{"unit":"kmph","value":"70"},"EventID":"event5","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}}]}
create table test1(
uploadTimeStamp string,
PDID string,
data array<struct<
Data:struct<unit:string,value:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
Data:struct<heading:string,loc1:string,latitude:double,longitude:double,loc2:string,speed:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
Data:struct<x:float,y:float,z:float>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
EventID:string,
Data:struct<value:int,unit:percentage>,
Version:float,
TimeS:bigint,
PDID:string,
Timezone:int>,
struct<
Data:struct<unit:string,value:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>
>
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'/ABC/XYZ/';
MismatchedTokenException(320!=313)
...
...
...
FAILED: ParseException line 11:10 mismatched input '<>' expecting < near 'struct' in struct type
Sample data
{"uploadTimeStamp":"1486631318873","PDID":"123","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}},{"Data":{"heading":"N","loc3":"false","loc":"14.022425","loc1":"78.760587","loc4":"false","speed":"10"},"EventID":"E2","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.1","pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Version":"1.0","Timestamp":1486631318873,"PDID":"123","Timezone":330},{"Data":{"unit":"kmph","value":"70"},"EventID":"E5","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}}]}
add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
create external table myjson
(
uploadTimeStamp string
,PDID string
,data array
<
struct
<
Data:struct
<
unit:string
,value:string
,heading:string
,loc3:string
,loc:string
,loc1:string
,loc4:string
,speed:string
,x:string
,y:string
,z:string
>
,EventID:string
,PDID:string
,`Timestamp`:bigint
,Timezone:smallint
,Version:string
,pii:struct<dummy:string>
>
>
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile
location '/tmp/myjson'
;
select * from myjson
;
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| myjson.uploadtimestamp | myjson.pdid | myjson.data |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1486631318873 | 123 | [{"data":{"unit":"rpm","value":"0","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E1","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":"N","loc3":"false","loc":"14.022425","loc1":"78.760587","loc4":"false","speed":"10","x":null,"y":null,"z":null},"eventid":"E2","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.1","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":"1.1","y":"1.2","z":"2.2"},"eventid":"E3","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":"percentage","value":"50","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E4","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":null},{"data":{"unit":"kmph","value":"70","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E5","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}}] |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
select j.uploadTimeStamp
,j.PDID
,d.val.EventID
,d.val.PDID
,d.val.`Timestamp`
,d.val.Timezone
,d.val.Version
,d.val.Data.unit
,d.val.Data.value
,d.val.Data.heading
,d.val.Data.loc3
,d.val.Data.loc
,d.val.Data.loc1
,d.val.Data.loc4
,d.val.Data.speed
,d.val.Data.x
,d.val.Data.y
,d.val.Data.z
from myjson j
lateral view explode (data) d as val
;
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
| j.uploadtimestamp | j.pdid | eventid | pdid | timestamp | timezone | version | unit | value | heading | loc3 | loc | loc1 | loc4 | speed | x | y | z |
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
| 1486631318873 | 123 | E1 | 123 | 1486631318873 | 330 | 1.0 | rpm | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 1486631318873 | 123 | E2 | 123 | 1486631318873 | 330 | 1.1 | NULL | NULL | N | false | 14.022425 | 78.760587 | false | 10 | NULL | NULL | NULL |
| 1486631318873 | 123 | E3 | 123 | 1486631318873 | 330 | 1.0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1.1 | 1.2 | 2.2 |
| 1486631318873 | 123 | E4 | 123 | 1486631318873 | 330 | 1.0 | percentage | 50 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 1486631318873 | 123 | E5 | 123 | 1486631318873 | 330 | 1.0 | kmph | 70 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
Was having the same issue, then decided to create a custom input format which can extract the multiline(pretty print) json records.
This JsonRecordReader can read a multiline JSON record in Hive. It is extracting the record based on balancing of curly braces - { and }. So the content between first '{' to the balanced last '}' is considered as one complete record. Below is the code snippet:
public static class JsonRecordReader implements RecordReader<LongWritable, Text> {
public static final String START_TAG_KEY = "jsoninput.start";
public static final String END_TAG_KEY = "jsoninput.end";
private byte[] startTag = "{".getBytes();
private byte[] endTag = "}".getBytes();
private long start;
private long end;
private FSDataInputStream fsin;
private final DataOutputBuffer buffer = new DataOutputBuffer();
public JsonRecordReader(FileSplit split, JobConf jobConf) throws IOException {
// uncomment the below lines if you need to get the configuration
// from JobConf:
// startTag = jobConf.get(START_TAG_KEY).getBytes("utf-8");
// endTag = jobConf.get(END_TAG_KEY).getBytes("utf-8");
// open the file and seek to the start of the split:
start = split.getStart();
end = start + split.getLength();
Path file = split.getPath();
FileSystem fs = file.getFileSystem(jobConf);
fsin = fs.open(split.getPath());
fsin.seek(start);
}
#Override
public boolean next(LongWritable key, Text value) throws IOException {
if (fsin.getPos() < end) {
AtomicInteger count = new AtomicInteger(0);
if (readUntilMatch(false, count)) {
try {
buffer.write(startTag);
if (readUntilMatch(true, count)) {
key.set(fsin.getPos());
// create json record from buffer:
String jsonRecord = new String(buffer.getData(), 0, buffer.getLength());
value.set(jsonRecord);
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text();
}
#Override
public long getPos() throws IOException {
return fsin.getPos();
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return ((fsin.getPos() - start) / (float) (end - start));
}
private boolean readUntilMatch(boolean withinBlock, AtomicInteger count) throws IOException {
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching start/end tag:
if (b == startTag[0]) {
count.incrementAndGet();
if (!withinBlock) {
return true;
}
} else if (b == endTag[0]) {
count.getAndDecrement();
if (count.get() == 0) {
return true;
}
}
// see if we've passed the stop point:
if (!withinBlock && count.get() == 0 && fsin.getPos() >= end)
return false;
}
}
}
This input format can be used along with the JSON Serde supplied by hive to read the multiline JSON file.
CREATE TABLE books (id string, bookname string, properties struct<subscription:string, unit:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'JsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The working code with samples is here: https://github.com/unayakdev/hive-json

Warning on Cygnus avoids data persistence on Cosmos

This is a cygnus agent log:
16 Sep 2015 12:30:19,820 INFO [521330370#qtp-1739580287-1] (com.telefonica.iot.cygnus.handlers.OrionRestHandler.getEvents:236) - Received data (<notifyContextRequest><subscriptionId>55f932e6c06c4173451bbe1c</subscriptionId><originator>localhost</originator>...<contextAttribute><name>utctime</name><type>string</type><contextValue>2015-9-16 9:37:52</contextValue></contextAttribute></contextAttributeList></contextElement><statusCode><code>200</code><reasonPhrase>OK</reasonPhrase></statusCode></contextElementResponse></contextResponseList></notifyContextRequest>)
16 Sep 2015 12:30:19,820 INFO [521330370#qtp-1739580287-1] (com.telefonica.iot.cygnus.handlers.OrionRestHandler.getEvents:258) - Event put in the channel (id=1145647744, ttl=0)
16 Sep 2015 12:30:19,820 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:184) -
16 Sep 2015 12:30:19,820 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1442395508-572-0000013907)
We conserve the same configuration than this question:
Fiware Cygnus Error.
Although the Cygnus agent receives data correctly from the Context Broker suscription, Cosmos doesn't receive any data.
Thanks in advance, again :)
Independentely of the reason that leaded you to comment the grouing rules part (nevertheless, I think it was because my own wrong advice at https://jira.fiware.org/browse/HELC-986 :)), that part cannot be commented and must be added to the configuration:
# Source interceptors, do not change
cygnusagent.sources.http-source.interceptors = ts gi
# TimestampInterceptor, do not change
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
# GroupinInterceptor, do not change
cygnusagent.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.GroupingInterceptor$Builder
# Grouping rules for the GroupingInterceptor, put the right absolute path to the file if necessary
# See the doc/design/interceptors document for more details
cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file = /usr/cygnus/conf/grouping_rules.conf
Once added that part, most probably another problem will arise: the performance of Cygnus will be very poor (that was the reason I wrongly adviced the user at https://jira.fiware.org/browse/HELC-986 to comment the grouping feature, in an attempt to increase the performance by removing processing steps). The reason is the latest version of Cygnus (0.8.2) is not ready to deal with the HiveServer2 running in the Cosmos side (this server was recently upgraded from old HiveServer1 to HiveServer2) and each persistence operation delays for a lot. For instance:
time=2015-09-21T12:42:30.405CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[150] : Starting transaction (1442832138-143-0000000000)
time=2015-09-21T12:42:30.407CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[236] : Received data ({ "subscriptionId" : "51c0ac9ed714fb3b37d7d5a8", "originator" : "localhost", "contextResponses" : [ { "contextElement" : { "attributes" : [ { "name" : "temperature", "type" : "centigrade", "value" : "26.5" } ], "type" : "Room", "isPattern" : "false", "id" : "Room1" }, "statusCode" : { "code" : "200", "reasonPhrase" : "OK" } } ]})
time=2015-09-21T12:42:30.409CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[258] : Event put in the channel (id=1966649489, ttl=10)
time=2015-09-21T12:42:30.462CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=process | comp=Cygnus | msg=com.telefonica.iot.cygnus.sinks.OrionSink[128] : Event got from the channel (id=1966649489, headers={fiware-servicepath=rooms, destination=other_rooms, content-type=application/json, fiware-service=deleteme2, ttl=10, transactionId=1442832138-143-0000000000, timestamp=1442832150410}, bodyLength=460)
time=2015-09-21T12:42:30.847CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=persist | comp=Cygnus | msg=com.telefonica.iot.cygnus.sinks.OrionHDFSSink[330] : [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (deleteme2/rooms/other_rooms/other_rooms.txt), Data ({"recvTimeTs":"1442832150","recvTime":"2015-09-21T10:42:30.410Z","entityId":"Room1","entityType":"Room","attrName":"temperature","attrType":"centigrade","attrValue":"26.5","attrMd":[]})
time=2015-09-21T12:42:31.529CEST | lvl=INFO | trans=1442832138-143-0000000000 | function=provisionHiveTable | comp=Cygnus | msg=com.telefonica.iot.cygnus.backends.hdfs.HDFSBackendImpl[185] : Creating Hive external table=frb_deleteme2_rooms_other_rooms_row
(a big timeout)
A workaround is to configure as hive_host an unreachable IP address such as fake.cosmos.lab.fiware.org:
time=2015-09-21T12:44:58.278CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[150] : Starting transaction (1442832280-746-0000000001)
time=2015-09-21T12:44:58.280CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[236] : Received data ({ "subscriptionId" : "51c0ac9ed714fb3b37d7d5a8", "originator" : "localhost", "contextResponses" : [ { "contextElement" : { "attributes" : [ { "name" : "temperature", "type" : "centigrade", "value" : "26.5" } ], "type" : "Room", "isPattern" : "false", "id" : "Room1" }, "statusCode" : { "code" : "200", "reasonPhrase" : "OK" } } ]})
time=2015-09-21T12:44:58.280CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=getEvents | comp=Cygnus | msg=com.telefonica.iot.cygnus.handlers.OrionRestHandler[258] : Event put in the channel (id=1640732993, ttl=10)
time=2015-09-21T12:44:58.283CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=process | comp=Cygnus | msg=com.telefonica.iot.cygnus.sinks.OrionSink[128] : Event got from the channel (id=1640732993, headers={fiware-servicepath=rooms, destination=other_rooms, content-type=application/json, fiware-service=deleteme3, ttl=10, transactionId=1442832280-746-0000000001, timestamp=1442832298280}, bodyLength=460)
time=2015-09-21T12:44:58.527CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=persist | comp=Cygnus | msg=com.telefonica.iot.cygnus.sinks.OrionHDFSSink[330] : [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (deleteme3/rooms/other_rooms/other_rooms.txt), Data ({"recvTimeTs":"1442832298","recvTime":"2015-09-21T10:44:58.280Z","entityId":"Room1","entityType":"Room","attrName":"temperature","attrType":"centigrade","attrValue":"26.5","attrMd":[]})
time=2015-09-21T12:44:59.148CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=provisionHiveTable | comp=Cygnus | msg=com.telefonica.iot.cygnus.backends.hdfs.HDFSBackendImpl[185] : Creating Hive external table=frb_deleteme3_rooms_other_rooms_row
time=2015-09-21T12:44:59.304CEST | lvl=ERROR | trans=1442832280-746-0000000001 | function=doCreateTable | comp=Cygnus | msg=com.telefonica.iot.cygnus.backends.hive.HiveBackend[77] : Runtime error (The Hive table cannot be created. Hive query='create external table frb_deleteme3_rooms_other_rooms_row (recvTimeTs bigint, recvTime string, entityId string, entityType string, attrName string, attrType string, attrValue string, attrMd array<string>) row format serde 'org.openx.data.jsonserde.JsonSerDe' location '/user/frb/deleteme3/rooms/other_rooms''. Details=Could not establish connection to fake.cosmos.lab.fiware.org:10000/default?user=frb&password=llBl3dQsMhX2sEPtPuf3izUGS92RZo: java.net.UnknownHostException: fake.cosmos.lab.fiware.org)
time=2015-09-21T12:44:59.305CEST | lvl=WARN | trans=1442832280-746-0000000001 | function=provisionHiveTable | comp=Cygnus | msg=com.telefonica.iot.cygnus.backends.hdfs.HDFSBackendImpl[210] : The HiveQL external table could not be created, but Cygnus can continue working... Check your Hive/Shark installation
time=2015-09-21T12:44:59.305CEST | lvl=INFO | trans=1442832280-746-0000000001 | function=process | comp=Cygnus | msg=com.telefonica.iot.cygnus.sinks.OrionSink[193] : Finishing transaction (1442832280-746-0000000001)
This will allow Cygnus to continue, despite the Hive tables are not automatically created, a minor problem (anyway, they would have never been created because of the current incompatibility with HiveServer2). Of course, this will be fixed in Cygnus 0.9.0 (it will be released at the end of September 2015).