I am trying to build a system that will track vehicle fuelings, and have run into a problem with one report; determining fuel efficiency in distance/fuel. Sample data is:
odometer
fuel
partial_fillup
61290
10.3370
0
61542
6.4300
0
61735
4.3600
0
61994
7.5000
0
62242
5.4070
0
62452
8.1100
0
62713
5.7410
1
62876
9.4850
0
63243
6.1370
1
63499
10.7660
0
Where odometer is the total distance the vehicle has traveled, fuel is the number of gallons or liters put in, and partial_fillup is a boolean meaning the fuel tank was not completely filled if non-zero.
If the user fills the tank each time the query I can use is:
set #a = null;
select
odometer,
odometer-previousOdometer distance,
fuel,
(odometer-previousOdometer)/fuel mpg,
partial_fillup
from
(
select
#a as previousOdometer,
#a:=odometer,
odometer,
fuel/1000 fuel,
partial_fillup
from fuel
where
vehicle_id =1
and odometer >= 61290
order by odometer
) as readings
where readings.previousOdometer is not null;
However, when the user only partially fills the tank, the correct procedure would be to subtract the last full fueling from current odometer reading, then divide by the sum of all fuel since the previous odometer reading, so at odometer 63499, the calculate would be (63499-62876)/(10.7660+6.1370)
This will get the average used on the last ride:
select
odometer,
odometer-lag(odometer) over (order by odometer) as distance,
fuel,
(odometer-lag(odometer) over (order by odometer))/fuel as mpg
from fuel
output:
odometer
distance
fuel
mpg
61290
10.3370
61542
252
6.4300
39.1913
61735
193
4.3600
44.2661
61994
259
7.5000
34.5333
62242
248
5.4070
45.8665
62452
210
8.1100
25.8940
62713
261
5.7410
45.4625
62876
163
9.4850
17.1850
63243
367
6.1370
59.8012
63499
256
10.7660
23.7786
Or you can calculate the total drive distance, and the total amount of fuel used:
select
distance,
sum_fuel,
distance/sum_fuel as mpg
from (
select
f.odometer,
f.odometer-(select min(odometer) from fuel) as distance,
fuel,
sum_fuel
from fuel f
inner join (
select
odometer,
sum(fuel) over (order by R) as sum_fuel
from (
select
odometer,
fuel,
row_number() over (order by odometer) R
from fuel) x
) x on x.odometer = f.odometer
) x2
which will get next output, which will get closer to an average after a longer time of measurement:
distance
sum_fuel
mpg
0
10.3370
0.0000
252
16.7670
15.0295
445
21.1270
21.0631
704
28.6270
24.5922
952
34.0340
27.9720
1162
42.1440
27.5721
1423
47.8850
29.7170
1586
57.3700
27.6451
1953
63.5070
30.7525
2209
74.2730
29.7416
DBFIDDLE
I was able to figure it out after studying Luuk's answer. I'm sure there is a more efficient way to do this; I am not used to using variables in SQL. But, the answers are correct in the test data.
set #oldOdometer = null;
set #totalFuel = 0;
select
s.odometer,
format(fuel, 3) fuel,
s.distance,
format( distance / fuel, 2) as mpg
from (
select
partial_fillup as partial,
odometer,
(fuel+#totalFuel) as fuel,
#totalFuel as totalFuel,
#oldOdometer oldOdometer,
if ( partial_fillup, null,odometer - #oldOdometer ) as distance,
#totalFuel := if ( partial_fillup, #totalFuel + fuel, 0) as pastFuel,
#oldOdometer := if (partial_fillup,#oldOdometer,odometer ) as runningOdometer
from
fuel
order by
odometer ) s
where s.distance is not null
order by s.odometer
limit 1,999;
limit 1,999 simply there to skip the first row returned, since there is not enough data to calculate distance or mpg. On my copy of MySQL, doing this means you do not need to initialize the two variables (you don't have to include the set commands at the beginning), so it works with my reporting tool very well. If you do initialize them, you do not need the limit statement. Works assuming you don't have more than 999 rows returned.
The below scenario needs to be implemented in SQL:-
Apply group by "sl.No", compare "date" column to the current date (6/17/2022) and select a row to represent the group with below conditions
If all "Date" is in the future then pick a date that is nearer to the current date
If "Date" is in past then pick a date which is nearer to the current date
Here is the sample data -
Sl.No
Date
status
flag
1
8/25/2022
1
Y
1
6/17/2022
0
N
1
8/24/2022
0
Y
1
6/20/2022
1
N
2
6/28/2019
1
N
2
6/11/2019
1
N
2
6/30/2019
1
Y
3
7/25/2023
1
Y
3
6/17/2023
0
Y
3
8/14/2022
0
N
3
8/5/2023
0
N
Expected output
Sl.No
Date
status
flag
1
6/20/2022
1
N
2
6/30/2019
1
Y
3
8/14/2022
0
Y
Please refer below query -
with cte as
(
select slno, dates, status, flag,abs(datediff(now(),dates)) ddiff from near_date
), cte_1 as
(
select slno,min(ddiff) mdiff
from cte
where ddiff > 0
group by slno
)
select cte.slno,cte.dates, cte.status, cte.flag
from cte join cte_1
on cte.slno = cte_1.slno
and cte.ddiff = cte_1.mdiff
DB fiddle here.
IF you can fit the entire group into the memory of an executor then here's a solution for you.
import pyspark.sql.functions
target = "6/17/2022"
df = spark.createDataFrame( data=[
(1,"8/25/2022",1,"Y"),
(1,"6/17/2022",0,"N"),
(1,"8/24/2022",0,"Y"),
(1,"6/20/2022",1,"N"),
(2,"6/28/2019",1,"N"),
(2,"6/11/2019",1,"N"),
(2,"6/30/2019",1,"Y"),
(3,"7/25/2023",1,"Y"),
(3,"6/17/2023",0,"Y"),
(3,"8/14/2022",0,"N"),
(3,"8/5/2023",0,"N"),],
schema = ["Sl_No","Date","status","flag"]
).withColumn("date", f.to_date(f.col("Date"), "MM/dd/yyyy") #convert to date
).withColumn("target", f.to_date(f.lit(target), "MM/dd/yyyy")) #add target row
grpd = df.groupby("Sl_No"
).agg(
f.reverse(f.sort_array( # sort descending into the correct order by field date.It is second column but all "Sl_No" are the same so this still works. Structs are sorted in order of their columns
f.collect_list( # collect all rows for the group into an array MUST FIT IN MEMORY
f.struct( # create a struct so we can keep all data from each row together.
*[f.col(column) for column in df.columns] # shorthand to pass varArgs of all columns
)
))).alias("grouped_rows")
)
grpd.select(
f.when(
f.col("grouped_rows")[0].Date > f.col("grouped_rows")[0].target, # first condition of the problem
f.expr("sort_array(filter( grouped_rows, x -> x.Date > x.target ))")[0] # correct sort and remove items in the past target]
).otherwise( # second condition
f.expr("filter( grouped_rows, x -> x.Date <= x.target )")[0] # already sorted descending so this works.
).alias("rep")
).select(
f.col("rep.*") # turn columns of struct into columns of table
).show()
+-----+----------+------+----+----------+
|Sl_No| date|status|flag| target|
+-----+----------+------+----+----------+
| 1|2022-06-20| 1| N|2022-06-17|
| 3|2022-08-14| 0| N|2022-06-17|
| 2|2019-06-30| 1| Y|2022-06-17|
+-----+----------+------+----+----------+
Explain for this code has 1 shuffle:
== Physical Plan ==
Project [CASE WHEN (grouped_rows#886[0].Date > grouped_rows#886[0].target) THEN sort_array(filter(grouped_rows#886, lambdafunction((lambda x#993.Date > lambda x#993.target), lambda x#993, false)), true)[0] ELSE filter(grouped_rows#886, lambdafunction((lambda x#994.Date <= lambda x#994.target), lambda x#994, false))[0] END.Sl_No AS Sl_No#996L, CASE WHEN (grouped_rows#886[0].Date > grouped_rows#886[0].target) THEN sort_array(filter(grouped_rows#886, lambdafunction((lambda x#993.Date > lambda x#993.target), lambda x#993, false)), true)[0] ELSE filter(grouped_rows#886, lambdafunction((lambda x#994.Date <= lambda x#994.target), lambda x#994, false))[0] END.date AS date#997, CASE WHEN (grouped_rows#886[0].Date > grouped_rows#886[0].target) THEN sort_array(filter(grouped_rows#886, lambdafunction((lambda x#993.Date > lambda x#993.target), lambda x#993, false)), true)[0] ELSE filter(grouped_rows#886, lambdafunction((lambda x#994.Date <= lambda x#994.target), lambda x#994, false))[0] END.status AS status#998L, CASE WHEN (grouped_rows#886[0].Date > grouped_rows#886[0].target) THEN sort_array(filter(grouped_rows#886, lambdafunction((lambda x#993.Date > lambda x#993.target), lambda x#993, false)), true)[0] ELSE filter(grouped_rows#886, lambdafunction((lambda x#994.Date <= lambda x#994.target), lambda x#994, false))[0] END.flag AS flag#999, CASE WHEN (grouped_rows#886[0].Date > grouped_rows#886[0].target) THEN sort_array(filter(grouped_rows#886, lambdafunction((lambda x#993.Date > lambda x#993.target), lambda x#993, false)), true)[0] ELSE filter(grouped_rows#886, lambdafunction((lambda x#994.Date <= lambda x#994.target), lambda x#994, false))[0] END.target AS target#1000]
+- ObjectHashAggregate(keys=[Sl_No#708L], functions=[collect_list(named_struct(Sl_No, Sl_No#708L, date, date#716, status, status#710L, flag, flag#711, target, 19160), 0, 0)])
+- Exchange hashpartitioning(Sl_No#708L, 200)
+- ObjectHashAggregate(keys=[Sl_No#708L], functions=[partial_collect_list(named_struct(Sl_No, Sl_No#708L, date, date#716, status, status#710L, flag, flag#711, target, 19160), 0, 0)])
+- *(1) Project [Sl_No#708L, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date) AS date#716, status#710L, flag#711]
+- Scan ExistingRDD[Sl_No#708L,Date#709,status#710L,flag#711]
Explain for SQL provided for other answer: (has 6 shuffles)
== Physical Plan ==
*(6) Project [slno#980L, date#716, status#710L, flag#711]
+- *(6) SortMergeJoin [slno#980L, ddiff#981], [slno#984L, mdiff#982], Inner
:- *(2) Sort [slno#980L ASC NULLS FIRST, ddiff#981 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(slno#980L, ddiff#981, 200)
: +- *(1) Project [Sl_No#708L AS slno#980L, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date) AS date#716, status#710L, flag#711, abs(datediff(19160, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date))) AS ddiff#981]
: +- *(1) Filter (isnotnull(Sl_No#708L) && isnotnull(abs(datediff(19160, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date)))))
: +- Scan ExistingRDD[Sl_No#708L,Date#709,status#710L,flag#711]
+- *(5) Sort [slno#984L ASC NULLS FIRST, mdiff#982 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(slno#984L, mdiff#982, 200)
+- *(4) Filter isnotnull(mdiff#982)
+- *(4) HashAggregate(keys=[slno#984L], functions=[min(ddiff#985)])
+- Exchange hashpartitioning(slno#984L, 200)
+- *(3) HashAggregate(keys=[slno#984L], functions=[partial_min(ddiff#985)])
+- *(3) Project [Sl_No#708L AS slno#984L, abs(datediff(19160, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date))) AS ddiff#985]
+- *(3) Filter ((abs(datediff(19160, cast(cast(unix_timestamp(Date#709, MM/dd/yyyy, Some(America/Toronto)) as timestamp) as date))) > 0) && isnotnull(Sl_No#708L))
+- Scan ExistingRDD[Sl_No#708L,Date#709,status#710L,flag#711]
From my sql query I'm getting output as datetime.datetime(2020, 9, 22, 0, 0)
query = '''SELECT checkin_date FROM `table1`
WHERE checkin_date BETWEEN %s AND %s'''
cursor.execute(query,(startDate, endDate)
results = cursor.fetchall()
#results:
#[(datetime.datetime(2020, 9, 22, 0, 0), datetime.datetime(2020, 9, 24, 0, 0))]
for res in results:
## When I print type I get correct result
print(type(res[0]) ## <type 'datetime.datetime'>
##when I compare with another datetime.date (currentDate variable)
if res[0] < currentDate:
## I get error `TypeError: can't compare datetime.datetime to datetime.date` *which is expected*
## But when I use .date()
if res[0].date() < currentDate:
## I get `TypeError: can't compare datetime.date to unicode`
I tried converting currentDate to datetime.datetime, but still doesn't work. Can't seem to figure out what's the issue here.
To force your query to spit out the date format you want, change it to this:
SELECT DATE_FORMAT(checkin_date, '%Y-%c-%d')
FROM table1
WHERE DATE(checkin_date) BETWEEN %s AND %s
To make it able to use an index on your checkin_date column, change it to this.
SELECT DATE_FORMAT(checkin_date, '%Y-%c-%d')
FROM table1
WHERE checkin_date >= DATE(%s)
AND checkin_date < DATE(%s) + INTERVAL 1 DAY
Try this
splitting a datetime column into year, month and week
SELECT Year(checkin_date), Month(Checkin_date), Day(Checkin_date),
FORMAT(GETDATE(),'HH'), FORMAT(GETDATE(),'mm')
FROM table1
WHERE (CAST(checkin_date AS DATE) BETWEEN '2018-01-01' AND '2020-01-01')
Note: Use 'HH' for 24 hours format and 'hh' for 12.
How can I remove unreal data that I'm getting after several joins that I ran.
my entire Query is:
SELECT
distinct vortex_dbo.vw_public_material_location.material_name
,vw_public_request_material_location_mir.material_request_id
,vw_public_request_material_location_mir.parttype_name
,operation_code
,vw_public_request_material_location_mir.result_name
,vw_public_request_material_location_mir.qdf_number
, requestor
,[vortex_hvc].[vortex_dbo].[material_request].created_by
,[vortex_hvc].[vortex_dbo].[material_request].created_datetime as time1
,[vortex_hvc].[vortex_dbo].[material_request].distribution_list
,[vortex_hvc].[vortex_dbo].[material_request].recipient_name
, DATEPART(WW,[vortex_hvc].[vortex_dbo].[material_request].created_datetime) as WW
,vw_public_request_material_location_mir.product_code_name
,task_name
,vw_public_request_material_location_mir.full_location_name
FROM [vortex_hvc].[vortex_dbo].[vw_public_request_material_location_mir]
left join request on vw_public_request_material_location_mir.material_request_id = request.request_key
left join vortex_dbo.material_request on vw_public_request_material_location_mir.material_request_id = vortex_dbo.material_request.material_request_id
left join vortex_dbo.vw_public_material_location on vw_public_request_material_location_mir.last_result_id = vortex_dbo.vw_public_material_location.last_result_id
left join vortex_dbo.vw_public_material_history on vw_public_request_material_location_mir.material_request_id like (substring(vw_public_material_history.comments,12,6))
where (vw_public_request_material_location_mir.qdf_number not like 'null' and vw_public_request_material_location_mir.qdf_number not like '')
and vw_public_request_material_location_mir.product_code_name like 'LAKE%'
and vw_public_request_material_location_mir.task_id not like 'null'
and (vw_public_request_material_location_mir.result_name like 'bin 100' or vw_public_request_material_location_mir.result_name like 'bin 01'
or vw_public_request_material_location_mir.result_name like 'bin 02' or vw_public_request_material_location_mir.result_name like 'pass')
and (requestor like 'BUGANIM, RINAT' and employee_name like 'BUGANIM, RINAT')
and ( DateDiff(DD,[vortex_hvc].[vortex_dbo].[material_request].created_datetime, getdate()) < 180)
and (concat('',substring(vortex_dbo.vw_public_material_location.comments,12,6)) like vw_public_request_material_location_mir.material_request_id
or vortex_dbo.vw_public_material_location.comments like 'Changed by Matrix Transaction Handler' or vortex_dbo.vw_public_material_location.comments like 'Unit Ownership:%')
and (unit_number = vortex_dbo.vw_public_material_location.material_name or unit_number is null)
and vortex_dbo.vw_public_material_location.material_name like 'D7QM748200403'
order by vortex_dbo.vw_public_material_location.material_name desc
The results I'm getting are:
two rows that only the 2nd one contains true data.
material_name material_request_id parttype_name operation_code result_name qdf_number requestor created_by time1 WW product_code_name task_name full_location_name
D7QM748200403 332160 H6 4GXDCV K Y 7295 BIN 01 Q1T5 BUGANIM, RINAT SMS_Interface 2017-12-03 20:27:30.327 49 CANNON LAKE Y 2+2 PPV-M SAMPLE: QDF INVENTORY
D7QM748200403 332176 H6 4GXDCV K Y 7295 BIN 01 Q1T5 BUGANIM, RINAT SMS_Interface 2017-12-03 21:02:33.247 49 CANNON LAKE Y 2+2 PPV-M SAMPLE: QDF INVENTORY
What can I do in order to retrieve true data only?, I have more cases like this.
Thanks!!
i have a table named item with four attribute name,code,class,value
now i want to group them in following way:
group a: name='A',code=11,class='high',value between( (5300 and 5310),(7100 and 7200),(8210 and 8290))
group b: name='b',code=11,class='high',value between( (1300 and 1310),(2100 and 2200),(3210 and 3290))
how can i do it?
You might want to try something like this:
SELECT
CASE
WHEN code = 11 AND
class = 'high' AND
(code BETWEEN 5300 AND 5310 OR
code BETWEEN 7100 AND 7200 OR
code BETWEEN 8210 AND 8290)
THEN 'A'
WHEN code = 11 AND
class = 'high' AND
(code BETWEEN 1300 AND 1310 OR
code BETWEEN 2100 AND 2200 OR
code BETWEEN 3210 AND 3290)
THEN 'B'
ELSE Unknown
END AS name,
*
FROM your_table
ORDER BY name
You might wish to change ORDER BY to GROUP BY and you should be aware that BETWEEN includes both endpoints.
First group
select * from item
where name LIKE 'A'
and code LIKE '11'
and class LIKE 'high'
and (value BETWEEN 5300 AND 5310 OR value BETWEEN 7100 AND 7200 OR value BETWEEN 8210 AND 8290)
the same idea for group b