Error at defining a specific field in a Hive Query

Error at defining a specific field in a Hive Query - fiware

I have an Orion Context Broker connected with Cosmos by cygnus.
It works ok, I mean I send new elements to Context Broker and cygnus send them to Cosmos and save them in files.
The problem I have is when I try to do some searchs.
I start hive, and I see that there are some tables created related with the files that cosmos have created, so I launch some querys.
The simple one works fine:
select * from Table_name;
Hive doesn't launch any mapReduce jobs.
but when I want to filter, join, count, or get only some fields. That is what happens:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = JOB_NAME, Tracking URL = JOB_DETAILS_URL
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -kill JOB_NAME
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-07-08 14:35:12,723 Stage-1 map = 0%, reduce = 0%
2015-07-08 14:35:38,943 Stage-1 map = 100%, reduce = 100%
Ended Job = JOB_NAME with errors
Error during job, obtaining debugging information...
Examining task ID: TASK_NAME (and more) from job JOB_NAME
Task with the most failures(4):
-----
Task ID:
task_201409031055_6337_m_000000
URL: TASK_DETAIL_URL
-----
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
I have found that the files that are created by Cygnus has difference between tha other files, because in the case of the cygnus, they have to be deserialize with a jar.
So, I have the doubt if in those cases I have to apply any MapReduce method or if is already any general method to do this.

Before executing any Hive sentence, do this:
hive> add jar /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar;
If you are using Hive through JDBC, execute is as any other sentence:
Connection con = ...
Statement stmt = con.createStatement();
stmt.executeQuery("add jar /usr/local/hive-0.9.0-shark-0.8.0-bin/lib/json-serde-1.1.9.3-SNAPSHOT.jar");
stmt.close();
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("select ...");

Related

Pyramid + SQLAlchemy + Zope App returns wrong results with raw SQL

I have a Pyramid 2.X + SQLAlchemy + Zope App created using the official CookieCutter.
There is a table called "schema_b.table_a" with 0 records.
In the below view count(*) should be more than 0 but it returns 0
#view_config(route_name='home', renderer='myproject:templates/home.jinja2')
def my_view(request):
# Call external REST API. This uses HTTP requests. The API inserts in schema_b.table_a
call_thirdparty_api()
mark_changed(request.dbsession)
sql = "SELECT count(*) FROM schema_b.table_a"
total = request.dbsession.execute(sql).fetchone()
print(total) # Total is 0
return {}
On the other hand, the following code returns the correct count(*):
#view_config(route_name='home', renderer='myproject:templates/home.jinja2')
def my_view(request):
engine = create_engine(request.registry.settings.get("sqlalchemy.url"), poolclass=NullPool)
connection = engine.connect()
# Call external REST API. This uses HTTP requests. The API inserts in table_a
call_thirdparty_api()
sql = "SELECT count(*) FROM schema_b.table_a"
total = connection.execute(sql).fetchone()
print(total) # Total is not 0
connection.invalidate()
engine.dispose()
return {}
It seems that request.session is not able to see the data inserted by the external REST API but it is not clear to me why or how to correct it.

Pyramid and Zope provide transaction managers that extend transactions to far beyond databases. In your example I think a transaction was started in mysql when the request was received on the server by the pyramid_tm package, their documentation states:
"At the beginning of a request a new transaction is started using the request.tm.begin() function."
https://docs.pylonsproject.org/projects/pyramid_tm/en/latest/index.html
Because mysql supports consistent nonblocking reads on the transaction you join when calling request.dbsession.execute you query a snapshot of the database made at the start of the transaction. When you use the normal SQLAlchemy function to execute the query a new transaction is created and the expected result is returned.
https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html
This is very confusing in this situation. But I must admit it's impressive how well it seems to work.

AWS RDS MySQL retrieving data for certain table times out

Strange problem with my database that I host using AWS RDS. For a certain table, I sometimes suddenly get timeouts for almost all queries. Interestingly, for the other tables, there are almost no time outs (after 150.000 ms which is the max I have set for the lambda, after that it terminates) while they contain similar data.
This is the Lambda (the function that gets the data from the database) log:
15:38:10 Connecting db: jdbc:mysql://database.rds.amazonaws.com:3306/database_name Connected
15:38:10 Connection retrieved for matches_table matches, proceeding to statement
15:38:10 Statement created, proceeding to executing SQL
15:40:35 END RequestId: 410f7edf-0f48-45df-b509-a9b822fa5c1c
15:40:35 REPORT RequestId: 410f7edf-0f48-45df-b509-a9b822fa5c1c Duration: 150083.43 ms Billed Duration: 150000 ms Memory Size: 1024 MB Max Memory Used: 115 MB
15:40:35 2019-06-04T15:40:35.514Z 410f7edf-0f48-45df-b509-a9b822fa5c1c Task timed out after 150.08 seconds
And this is Java code that I use:
LinkedList<Object> matches = new LinkedList<Object>();
try {
String sql = db_conn.getRetrieveAllMatchesSqlSpecificColumn(userid, websiteid, profileid, matches_table, "matchid");
Connection conn = db_conn.getConnection();
System.out.println("Connection retrieved for matches_table " +matches_table+", proceeding to statement");
Statement st = conn.createStatement();
System.out.println("Statement created, proceeding to executing SQL");
// execute the query, and get a java resultset
ResultSet rs = st.executeQuery(sql);
System.out.println("SQL executed, now iterating to resultset");
// iterate through the java resultset
st.close();
} catch (SQLException ex) {
Logger.getLogger(AncestryDnaSQliteJDBC.class.getName()).log(Level.SEVERE, null, ex);
}
return matches;
A couple of months ago I did a big database resources upgrade and some removal of unwanted data and that more or less fixed it. But if I look at the current stats, it looks ok. Plenty of RAM (1GB) available, no swap used, enough cpu credits.
So I am not sure if this is a MySQL problem or a database problem linked to RDW AWS. Any suggestions?

Alright, it turned out to be an AWS specific thing. Turns out, there is some kind of IO credit system linked to the database. Interestingly, the chart that describes the number of credits left is not available in the default monitoring view of AWS RDS. You have to dive into CloudWatch and find it quite hidden. By increasing the allocated storage for this database, you earn more credits and by doing so I fixed the problem.

SQL Alchemy timing out when executing large query

I have a large query to execute through SQL Alchemy which has approximately 2.5 million rows. It's connecting to a MySQL database. When I do:
transactions = Transaction.query.all()
It eventually times out around ten minutes. And gets this error: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
I've tried setting different parameters when doing create_engine like:
create_engine(connect_args={'connect_timeout': 30})
What do I need to change so the query will not timeout?
I would also be fine if there is a way to paginate the results and go through them that way.

Solved by pagination:
page_size = 10000 # get x number of items at a time
step = 0
while True:
start, stop = page_size * step, page_size * (step+1)
transactions = sql_session.query(Transaction).slice(start, stop).all()
if transactions is None:
break
for t in transactions:
f.write(str(t))
f.write('\n')
if len(transactions) < page_size:
break
step += 1
f.close()

extracting field from JSON using HIVE

I have one issue in my hive code. I want to extract JSON data from using HIVE.Following is the sample json format
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
I want to get the following fields
ver
type
vehicle
ts
founder
state
the issue is founder and state is in one array "version"
can anybody help how to get rid of this?
some times instead of versionmedified something else may come
eg:
some times my data will be like
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
adding some sample data below:
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
i need to put this data into various tables based on version if it is "BOX" the put in one table if it is "GAP" put another...

you can use json serde to fetch all fields
Just follow below steps
1.Download json serde from http://www.congiu.net/hive-json-serde/1.3/
2.Add json serde Jar
hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar;
Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path
Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar]
3.create table
CREATE TABLE json_serde_table (
Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
4.load json file into table
hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table;
Loading data to table default.json_serde_table
Table default.json_serde_table stats: [numFiles=1, totalSize=234]
OK
Time taken: 0.877 seconds
5.Fire below query to get result
hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table;
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-12 17:06:44,990 Stage-1 map = 0%, reduce = 0%
2017-04-12 17:06:53,361 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1491484583384_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4891 HDFS Write: 50 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
1 ns Mh-3412 2000-04-01T00:00:00.171Z 3.0 Florida
Time taken: 19.745 seconds, Fetched: 1 row(s)

Matlab Database Toolbox - Warning: com.mysql.jdbc.Connection#6e544a45 is not serializable

I'm connecting to a MySQL database through the Matlab Database Toolbox in order to run the same query over and over again within 2 nested for loops. After each iteration I get this warning:
Warning: com.mathworks.toolbox.database.databaseConnect#26960369 is not serializable
In Import_Matrices_DOandT_julaugsept_inflow_nomettsed at 476
Warning: com.mysql.jdbc.Connection#6e544a45 is not serializable
In Import_Matrices_DOandT_julaugsept_inflow_nomettsed at 476
Warning: com.mathworks.toolbox.database.databaseConnect#26960369 not serializable
In Import_Matrices_DOandT_julaugsept_inflow_nomettsed at 476
Warning: com.mysql.jdbc.Connection#6e544a45 is not serializable
In Import_Matrices_DOandT_julaugsept_inflow_nomettsed at 476
My code is basically structured like this:
%Server
host =
user =
password =
dbName =
%# JDBC parameters
jdbcString = sprintf('jdbc:mysql://%s/%s', host, dbName);
jdbcDriver = 'com.mysql.jdbc.Driver';
%# Create the database connection object
conn = database(dbName, user , password, jdbcDriver, jdbcString);
setdbprefs('DataReturnFormat', 'numeric');
%Loop
for SegmentNum=3:41;
for tl=1:15;
tic;
sqlquery=['giant string'];
results = fetch(conn, sqlquery);
(some code here that saves the results into a few variables)
save('inflow.mat');
end
end
time = toc
close(conn);
clear conn
Eventually, after some iterations the code will crash with this error:
Error using database/fetch (line 37)
Query execution was interrupted
Error in Import_Matrices_DOandT_julaugsept_inflow_nomettsed (line
466)
results = fetch(conn, sqlquery);
Last night it errored after 25 iterations. I have about 600 iterations total I need to do, and I don't want to have to keep checking back on it every 25. I've heard there can be memory issues with database connection objects...is there a way to keep my code running?

Let's take this one step at a time.
Warning: com.mathworks.toolbox.database.databaseConnect#26960369 is not serializable
This comes from this line
save('inflow.mat');
You are trying to save the database connection. That doesn't work. Try specifying the variables you wish to save only, and it should work better.
There are a couple of tricks to excluding the values, but honestly, I suggest you just find the most important variables you wish to save, and save those. But if you wish, you can piece together a solution from this page.
save inflow.mat a b c d e

Try wrapping the query in a try catch block. Whenever you catch an error reset the connection to the database which should free up the object.
nQuery = 100;
while(nQuery>0)
try
query_the_database();
nQuery = nQuery - 1;
catch
reset_database_connection();
end
end

The ultimate main reason for this is that database connection objects are TCP/IP ports and multiple processes cannot access the same port. That is why database connection object are not serialized. Ports cannot be serialized.
Workaround is to create a connection with in the for loop.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Error at defining a specific field in a Hive Query - fiware

Related

Pyramid + SQLAlchemy + Zope App returns wrong results with raw SQL

AWS RDS MySQL retrieving data for certain table times out

SQL Alchemy timing out when executing large query

extracting field from JSON using HIVE

Matlab Database Toolbox - Warning: com.mysql.jdbc.Connection#6e544a45 is not serializable

Categories

Resources