logical error in mysql query while selecting data - mysql

Table pc -
code model speed ram hd cd price
1 1232 500 64 5.0 12x 600.0000
10 1260 500 32 10.0 12x 350.0000
11 1233 900 128 40.0 40x 980.0000
12 1233 800 128 20.0 50x 970.0000
2 1121 750 128 14.0 40x 850.0000
3 1233 500 64 5.0 12x 600.0000
4 1121 600 128 14.0 40x 850.0000
5 1121 600 128 8.0 40x 850.0000
6 1233 750 128 20.0 50x 950.0000
7 1232 500 32 10.0 12x 400.0000
8 1232 450 64 8.0 24x 350.0000
9 1232 450 32 10.0 24x 350.0000
Desired output -
model speed hd
1232 450 10.0
1232 450 8.0
1232 500 10.0
1260 500 10.0
Query 1 -
SELECT model, speed, hd
FROM pc
WHERE cd = '12x' AND price < 600
OR
cd = '24x' AND price < 600
Query 2 -
SELECT model, speed, hd
FROM pc
WHERE cd = '12x' OR cd = '24x'
AND price < 600
Query 1 is definitely working correctly, however when i tried to reduce the query to use price at once only, it is not showing the correct result..let me know what I am missing in the logic.
Find the model number, speed and hard drive capacity of the PCs having
12x CD and prices less than $600 or having 24x CD and prices less than
$600.

Since AND comes before OR, your query is being interpreted as:
WHERE (cd = '12x') OR ((cd = '24x') AND (price < 600))
Or, in words: All PCs having 12x CD, or PCs < $600 having 24x CD
You need to use parentheses to specify order of operations:
WHERE (cd = '12x' OR cd = '24x') AND price < 600
Or, you can use IN:
WHERE cd IN ('12x', '24x') AND price < 600
See Also: http://dev.mysql.com/doc/refman/5.5/en/operator-precedence.html

In your table may contain duplicate rows or coloumns try by using group by clause as shown below where you will get the soloution and let me know the output after trying these thanks...
SELECT model, speed, hd
FROM PC
WHERE cd IN ('12x','24x') AND price < 600
group by Model,speed,hd

Try using IN:
SELECT model, speed, hd
FROM PC
WHERE cd IN ('12x','24x') AND price < 600
Good luck.

Related

How to diagnose cause of slow JDBC write operations while running PySpark job on AWS Glue

I've been stuck on this one for a couple of days so any help is greatly appreciated. I have a Spark ETL job running on AWS Glue and am having a hard time optimizing write performance.
Summary of the job:
A large dataset is being read from S3 (~5 million records)
A smaller dataset is being read from an RDS MySQL instance (hundreds of records)
The task at hand calls for a pairwise comparison of each record from both datasets. I went with a cross join (cartesian product) which itself seems to work fine. I know that this is an expensive operation but I haven't been able to figure out how to avoid expanding the dataset in that manor in order to do this pairwise comparison. It's just the nature of the problem. The good news is that all of the core transformation work is done on a row by row basis. There are no groupbys or aggregate functions required. There is 1 numerical filter that is executed after the cross join and subsequent computations. If I use .show() as an action to make all of these operations run, it takes around 40 minutes. Not bad.
The problem arises when I go to write this data to MySQL using the built-in Spark JDBC connector as such:
` joined_df.write.format("jdbc") \
.option("url", url_write) \
.option("batchsize", 100) \
.option("dbtable","{0}.{1}".format(database, table)) \
.option("user", user).option("password", password).mode("append").save()`
After 4 hours, not a single record is written.
When I run the job over a smaller sample of data, the writes go slowly but they do occur so database connectivity isn't the issue. I have also tried modifying the connection string to set useServerPrepStmts=false and rewriteBatchedStatements=true and using a small batch size of 100.
I am using 10 DPUs with the G1X worker type. This provides 36 cores in total. I used the 3X rule of thumb to pick 108 as the number of partitions which I apply to the larger dataset before converting the DynamicFrame into a Dataframe as such:
df1 = glue_context.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = "test_tbl1",
push_down_predicate = "(upload_date >= CAST('{0}' AS TIMESTAMP))".format(lookback_date)
).repartition(108).toDF()`
I know that 10 DPUs is relatively light on compute but I have a feeling there is something else is going on here since no records are being written after hours. I have also tried using various number of partitions. I set up Spark UI to get some additional metrics. I am including the metrics for my latest attempt that timed out after 4 hours with zero records written. I would have posed a screenshot if my ranking allowed it (Sorry that this is so ugly). My question is, what should I be focused on here that may help get me to the answer of why the writes are hanging? Any wisdom would be greatly appreciated.
Executor ID Address Status RDD Blocks Storage Memory Disk Used Cores Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time (GC Time) Input Shuffle Read Shuffle Write
driver 172.31.33.169:37095 Active 0 0.0 B / 5.8 GiB 0.0 B 0 0 0 0 0 0.0 ms (0.0 ms) 0.0 B 0.0 B 0.0 B
1 172.31.45.233:45821 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.5 min (4 s) 91.1 MiB 0.0 B 306.3 MiB
2 172.31.47.118:44467 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 5 9 1.2 min (3 s) 82.1 MiB 0.0 B 281.3 MiB
3 172.31.42.73:45151 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.4 min (4 s) 89.7 MiB 0.0 B 303.4 MiB
4 172.31.43.123:34465 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 4 8 1.3 min (5 s) 79.5 MiB 0.0 B 274.1 MiB
5 172.31.40.44:46117 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 4 7 1.5 min (5 s) 89.1 MiB 0.0 B 304.6 MiB
6 172.31.36.20:42763 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 4 8 1.1 min (3 s) 21.9 MiB 0.0 B 74.3 MiB
7 172.31.32.240:45533 Active 0 0.0 B / 5.8 GiB 0.0 B 4 4 0 7 11 1.3 min (4 s) 91.6 MiB 59 B 312.4 MiB
8 172.31.36.49:34471 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 7 10 1.5 min (5 s) 101.6 MiB 0.0 B 347.4 MiB
9 172.31.43.177:40385 Active 0 0.0 B / 5.8 GiB 0.0 B 4 3 0 5 8 1.4 min (5 s) 85.7 MiB 0.0 B 294.1 MiB

Error with anova_test : contrasts can be applied only to factors with 2 or more levels

I'm trying to run a repeated measures ANOVA using the rstatix package but I'm experiencing an error that doesn't make sense to me.
My dataframe is this:
id Group Time startle
<fct> <fct> <fct> <dbl>
1 55 WT SGH S1_120db 5.24
2 102 WT SGH S2_120db 7.12
3 167 WT SGH S3_120db 9.64
4 226 WT SGH S4_120db 20.7
5 278 WT SGH S5_120db 15.4
6 345 WT SGH S6_120db 10.8
7 394 WT SGH S7_120db 15.1
8 456 WT SGH S8_120db 9.52
9 508 WT SGH S9_120db 10.4
10 571 WT SGH S10_120db 12.8
And I would like to analyse differences of startle between Groups within Time. My ANOVA code is:
res5 <- anova_test(
data = maleP120,
dv = startle,
wid = id,
between = Group,
within = Time)
However, when I run this code I get this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I have looked it up but most people who ask about this error usually have NAs in their datasets or they only have one level in their factor variable(s). I can attest that I have no NAs and all of my factor variables have more than two level. If I run:
sapply(lapply(maleP120, unique), length)
I get this output:
id Group Time startle
754 3 13 719
I have no idea why this ANOVA isn't working. Any help would be much appreciated.

Want to Calculate Hightest Pay Value For a table?

I want a list of employees who
have worked on the activity that has the highest Total Pay value.
don't use code such as …where actid = 151…ect
• Note: Total Pay worked for an activity is the sum of the (Total Hours Worked * matching
Hourly Rate)
(e.g. Total Pay for Activity 151 is 10.5 hrs # $50.75 + 11.5 hrs # $25 + 3hrs # $33,)
You must use a subquery in your
solution.
ACTID HRSWORKED HOURLYRATE Total Pay
163 10 45.5 455
163 8 45.5 364
163 6 45.5 273
151 5 50.75 253.75
151 5.5 50.75 279.125
155 10 30 300
155 10 30 300
165 20 25 500
155 10 30 300
155 8 27 216
151 11.5 25 287.5
151 1 33 33
151 1 33 33
151 1 33 33
You time and effort much appreciated. Thanks !!
Without knowledge of the schema, I can only provide a possible sketch (you'll have to compute total pay and provide all necessary JOINs and predicates):
SELECT DISTINCT(employee id) -- reconfigure if more then just employee id
FROM <table(s)>
[WHERE...]
{ WHERE | AND } total pay = (SELECT MAX(total pay)
FROM <table(s)>
[WHERE...]);
I used DISTINCT because it's possible to have more than one activity with the same MAX value and overlapping employees. If you're including ACTID in the output, then you won't need DISTINCT because the same employee shouldn't be on a project twice (unless they are tracked by roles on a project in which case a single employee might have multiple roles - it all depends on the data set).

Cannot print out the latest results of table

I have the following table:
NAMES:
Fname | stime | etime | Ver | Rslt
x 4 5 1.01 Pass
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 4 8 1.01 Fail
y 9 10 1.01 Fail
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 1 2 1.01 Fail
m 4 6 1.01 Fail
The result I am trying to output is:
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 4 6 1.01 Fail
What the result means:
Fnames are an example of tests that are run. Each test was run on different platforms of software (The version numbers) Some tests were run on the same platform twice: It passed the first time and failed the second time or vice versa. My required output is basically the latest result of each case for each version. So basically the results above are all unique by their combination of Fname and Ver(sion), and they are selected by the latest etime from the unique group.
The query I have so far is:
select Fname,stime,max(etime),ver,Rslt from NAMES group by Fname,Rslt;
This however, does not give me the required output.
The output I get is (wrong):
x 4 10 1.01 Fail
x 6 7 1.02 Pass
y 4 12 1.01 Pass
y 10 14 1.02 Fail
m 1 6 1.01 Fail
Basically it takes the max time, but it does not really print the correct data out, it prints the max time, but it prints the initial time of the whole unique group of data, instead of the initial time of that particular test (record).
I have tried so long to fix this, but I seem to be going no where. I have a feeling there is a join somewhere in here, but I tried that too, no luck.
Any help is appreciated,
Thank you.
Use a subquery to get the max ETime by FName and Ver, then join your main table to it:
SELECT
NAMES.FName,
NAMES.STime,
NAMES.ETime,
NAMES.Ver,
NAMES.Rslt
FROM NAMES
INNER JOIN (
SELECT FName, Ver, MAX(ETime) AS MaxETime
FROM NAMES
GROUP BY FName, Ver
) T ON NAMES.FName = T.FName AND NAMES.Ver = T.Ver AND NAMES.ETime = T.MaxETime
You could first find which is the latests=max(etime) for each case for each version ?
select Fname,Ver,max(etime) from NAMES group by Fname,Ver;
From there you would display the whole thing via joining it again?
select *
from
NAMES
inner join
(select Fname,Ver,max(etime) as etime from NAMES group by Fname,Ver ) sub1
using (Fname,Ver,etime)
order by fname,Ver;

MySQL 10x slower on one server compared to another

I have a live server and my dev server, and I am finding that queries on my LIVE (not dev) server run 10x slower, even though the live server is more powerful and they are both running comparable load. It's not a database structure thing because I load the backup from the live server into my dev server.
Does anybody have any ideas on where I could look for the discrepancy? Could it be a MySQL config thing? Where should I start looking?
Live Server:
mysql> SELECT count(`Transaction`.`id`) as count, sum(`Transaction`.`amount`) as sum, sum(Transaction.citiq_margin+rounding + Transaction.citiq_margin_vat) as revenue FROM `transactions` AS `Transaction` LEFT JOIN `meters` AS `Meter` ON (`Transaction`.`meter_id` = `Meter`.`id`) LEFT JOIN `units` AS `Unit` ON (`Meter`.`unit_id` = `Unit`.`id`) WHERE (NOT (`Unit`.`building_id` IN ('1', '85')) AND NOT (`Transaction`.`state` >= 90)) AND DAY(`Transaction`.`created`) = DAY(NOW()) AND YEAR(`Transaction`.`created`) = YEAR(NOW()) AND (MONTH(`Transaction`.`created`)) = MONTH(NOW());
+-------+---------+---------+
| count | sum | revenue |
+-------+---------+---------+
| 413 | 3638550 | 409210 |
+-------+---------+---------+
1 row in set (2.62 sec)
[root#mises ~]# uptime
17:11:57 up 55 days, 1 min, 1 user, load average: 0.45, 0.56, 0.60
Dev Server (result count is different because of slight time delay from backup):
mysql> SELECT count(`Transaction`.`id`) as count, sum(`Transaction`.`amount`) as sum, sum(Transaction.citiq_margin+rounding + Transaction.citiq_margin_vat) as revenue FROM `transactions` AS `Transaction` LEFT JOIN `meters` AS `Meter` ON (`Transaction`.`meter_id` = `Meter`.`id`) LEFT JOIN `units` AS `Unit` ON (`Meter`.`unit_id` = `Unit`.`id`) WHERE (NOT (`Unit`.`building_id` IN ('1', '85')) AND NOT (`Transaction`.`state` >= 90)) AND DAY(`Transaction`.`created`) = DAY(NOW()) AND YEAR(`Transaction`.`created`) = YEAR(NOW()) AND (MONTH(`Transaction`.`created`)) = MONTH(NOW());
+-------+---------+---------+
| count | sum | revenue |
+-------+---------+---------+
| 357 | 3005550 | 338306 |
+-------+---------+---------+
1 row in set (0.22 sec)
[www#smith test]$ uptime
18:11:53 up 12 days, 1:57, 4 users, load average: 0.91, 0.75, 0.62
Live Server (2 x Xeon Quadcore):
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
stepping : 2
cpu MHz : 2395.000
cache size : 12288 KB
physical id : 0
siblings : 8
core id : 10
cpu cores : 4
Dev Server (1 x Quadcore)
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q8300 # 2.50GHz
stepping : 10
microcode : 0xa07
cpu MHz : 1998.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
Live Server:
CentOS 5.7
MySQL ver 5.0.95
Dev Server:
ArchLinux
MySQL ver 5.5.25a
The obvious first thing to check would be your MySql configuration file to make sure you are utilizing an appropriate amount of memory for queries.. such as key_buffer, sort_buffer, etc... There are far smarter people than me out there who have entire blogs dedicated to configuring MySql.
You can also prepend your query with "explain" to see what is taking the most time... but that might just be something for general use later on.
In reality, your "live" server has caching capabilities and twice the number of cores to make these queries, and it likely has enough horsepower and memory to explain the difference in query times between the servers.
So, I ran the same database and queries on a Virtual Machine running Centos, 1 CPU and 512MB of memory: it provides the answer to that query in 0.3 seconds; system load is 0.4 :/
The only real difference seems to be that I am running Mysql 5.5 on that server. And it seems that there really is a 10x performance improvement in my case from Mysql 5.0 to Mysql 5.5.
I will only know for sure once I have migrated my live servers from Mysql 5.0 to Mysql 5.5, I will confirm the results once I have done that.