I have an external table with one column - data, where the data is json object
when I'm running the following hive query
hive> select get_json_object(data, "$.ev") from data_table limit 3;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201212171824_0218, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0218
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0218
2013-01-24 10:41:37,271 Stage-1 map = 0%, reduce = 0%
....
2013-01-24 10:41:55,549 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201212171824_0218
OK
2
2
2
Time taken: 21.449 seconds
But when I'm running the sum aggregation the result is strange
hive> select sum(get_json_object(data, "$.ev")) from data_table limit 3;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201212171824_0217, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201212171824_0217
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201212171824_0217
2013-01-24 10:39:24,485 Stage-1 map = 0%, reduce = 0%
.....
2013-01-24 10:41:00,760 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201212171824_0217
OK
9.4031522E7
Time taken: 100.416 seconds
Could anyone explain me why is that? And what should I do in for that works properly?
Hive seems to be taking the values in your JSON as floats instead of ints, and it looks like your table is pretty big so Hive is probably using the "exponent" notation for big float numbers, so 9.4031522E7 probably means 94031522.
If you want to make sure you're doing a sum over int, you can cast the field of your JSON to int and the sum should be able to return you an int:
$ hive -e "select sum(get_json_object(dt, '$.ev')) from json_table"
8.806305E7
$ hive -e "select sum(cast(get_json_object(dt, '$.ev') as int)) from json_table"
88063050
Related
I am using the following approach to make db calls,
for record in records:
num = "'"+str(record['Number'])+"'"
id = "'"+str(record['Id'])+"'"
query = """select col2_text,col3_text from table where id= {} and num = {} and is_active = 'Y';""".format(id,num)
Since it is iteration where total number of DB calls is equal to the number of records. I want to optimize my call and make minimum number of DB calls, ideally in a single call.
You can reduce the number of DB calls to a single one. You might want to have a look at the SQL-IN operator.
You could do the following:
values = ""
for record in records:
num = "'"+str(record['Number'])+"'"
id = "'"+str(record['Id'])+"'"
values += "({},{}),".format(num, id)
values = values[:-1]
query = """select col2_text,col3_text from table where (id, num) in ({}) and is_active = 'Y';""".format(values)
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
I am completely new to power bi. I have the sample data table as below.
I want to calculate the percentage of PASS percentage(Total Number of Pass cases/Total Number of Cases) and FAIL percentage (Total Number of Fail Cases/Total Number of Cases) status for each day. I tried as below but it's not giving me the expected results.
Blocked by date = Calculate (
countrows (TableName),
(TableName[TestDate].[Date],[Total BLOCKED])
)
Here Total BLOCKED is the measure I created to filter the status.
Where am I going wrong here? How should I calculate the percentage of PASS and FAIL status for each day?
It would be similar to this:
The total number of statuses measure:
status_total =
VAR passed = COUNT(Table_1[status])
RETURN IF( ISBLANK(passed), 0, passed)
Passed measure:
passed =
VAR passed = CALCULATE(
COUNTROWS(Table_1),
FILTER(Table_1, Table_1[Status] = "PASS")
)
RETURN IF( ISBLANK(passed), 0, passed)
Failed measure:
failed =
VAR failed = CALCULATE(
COUNTROWS(Table_1),
FILTER(Table_1, Table_1[Status] = "FAIL")
)
RETURN IF( ISBLANK(failed), 0, failed )
PASS ratio:
passed % = DIVIDE([passed], [status_total],0)
FAIL ratio:
failed % = DIVIDE([failed], [status_total],0)
Result:
Those three separate measure, of course, can be combined into one if needed
you can also use quick measure "divide" and that would be filter based...count of that col "status" with filter "pass", divided by count of col "status" without any filter....
We have a table that has 13m rows,It's name and surname fields are nil by default, when we are trying to push some data, it stops running after 1.2m query. We looped with 10k row each because of ram issue.
The algorithm is,
$i = 0;
until $i > 13000 do
b = Tahsil.where("NO < ?",(10000*($i+1))).offset(10000*$i)
b.each do |a|
a.name = Generator('name')
a.surname = Generator('surname')
a.save
end
$i += 1
end
Ruby on Rails has some methods build in that you might want to use:
Tahsil.find_each do |tahsil|
tahsil.update(name: Generator('name'), surname: Generator('surname'))
end
find_each iterates through all records in batches (with a default batch size of 1000). update updates a record.
I am trying to increase the speed of UPDATE using jdbc. There are 300000 rows, and I want to change the column of X of each row to 5. But I needed to make them sorted according to their primary keys. Because there are 400000 rows and I want to change 300000 rows from top.
Here is what I did so far:
First I tried to send every query when it is created using prepared stetament
long time = System.currentTimeMillis();
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date (time)));
for(int i=number-1; i>=0; i--){
try{
pStmt.setInt(1, (int) hashMap.get(i));
time = System.currentTimeMillis();
result = pStmt.executeUpdate();
time = System.currentTimeMillis() - time;
totalTime += time;
}catch(Exception e){}
}
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date (System.currentTimeMillis())));
time = System.currentTimeMillis() - time;
But it took many time. About 900 sec.
Then I tried to add them to a batch, using addBatch()
java.sql.PreparedStatement pStmt2 = conn.prepareStatement(query);
for(int i=0; i<number; i++){
pStmt2.setInt(1, (int) hashMap.get(i));
pStmt2.addBatch();
}
long time = System.currentTimeMillis();
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date (time)));
pStmt2.executeBatch();
System.out.println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(new Date (System.currentTimeMillis())));
time = System.currentTimeMillis() - time;
It was better but not that better. Only there was a 50 sec. difference.
And I tried to execute blocks. That means: there are 300000 rows and for every 30000 I added queries to a batch, then execute them. And I did this 10 times.
It was much better but still too slow. It was about 800 sec
Then I want to do this process with small rows such as 10000, in one line:
Update tabel1 set outcome=3 order by key DESC limit 10000.
which takes first 10000 rows, and change every outcome cells of them. This took about 24sec. This time is so close to other tests when I tried themn with 10000 rows.
Then I tried to do this process with using cursor. Updating 10000 rows only took 4 secs.
The result is interesting.
But I want to update them using jdbc. Is there any other ways to do it? And why cursor is too speed?