Duplicates issue - duplicates

I have a problem with duplicates.
Actually what I need is only the see duplicates but my table has many variables something like the below:
a b c d e
32 ayi dam som kem
32 ayi dam som tws
32 ayi dam tsm tws
12 mm ds de ko
12 mm tmm to ko
I am trying to keep if 'a' 'b' 'c' and 'd' variables are same. So I need only first 2 columns. I try to do this
proc sort data=al nodupkey dupout=dups;
by a b c d;
run;
any idea if this works?

In SAS 9.3+ you can do this very easily with the new nouniquekey option.
proc sort data=have nouniquekey out=want;
by a b c d;
run;
That removes any rows which are unique and leaves duplicates.
If you have an earlier version of SAS, you can do something fairly simple after a regular sort.
So, after sorting as in your example above but without nodupkey:
data want;
set have;
by a b c d;
if not (first.d and last.d);
run;
That removes records that are the first AND last record based on a b c d.

Another option could be Proc SQL, no pre-sort needed:
proc sql;
create table want as
select * from have
group by a, b, c, d
having count(*)>1; /*This is to tell SAS only to keep those dups*/
quit;

Related

Counting the number of occurrences of unique values using Pig Latin

I am trying to figure out top 5 of the most downloaded RStudio packages on December 1, 2019 (from http://cran-logs.rstudio.com/) using Apache Pig Latin. The columns I need are 'r_os' and 'package'. Here is my code:
A = load '2019-12-01.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
B = FOREACH A GENERATE r_os,package;
C = DISTINCT B;
D = GROUP C BY package;
result = FOREACH C GENERATE flatten($0), COUNT($1) as package_distr;
I'm getting the following result, which is wrong:
(magrittr,10)
(htmltools,10)
(httr,10)
(lubridate,10)
(ellipsis,10)
The number of occurrences should be more, not 10. My desired output should look approximately like:
(magrittr,10000)
(htmltools,9876)
(httr,8700)
(lubridate,5320)
(ellipsis,3000)
Any idea what I'm doing wrong?
result = FOREACH D GENERATE group, COUNT(C) as package_distr;
?
group being the package name, and C being the name of the resulting bag when you grouped C, which we then count.

How to use like operator for remove whitespace and similar words at the same time?

Below is my query from which I'm excluding some characters from query. It's working for me.
SELECT
b.TRANS_DETAILS_ID,
b.DEBIT_AMOUNT,
b.ENTITY_NAME,
REPLACE(b.DESCRIPTION, 'SP Z O.O.', '') AS DESCRIPTION,
DATE_FORMAT(a.TRANSACTION_DATE_TIME, '%d-%m-%Y') AS TRANS_DATE
FROM bank_book_transaction_master a,
bank_book_transaction_details b
WHERE a.TRANSACTION_DATE_TIME
BETWEEN '$from_date_srch' AND '$to_date_srch'
AND DEBIT_CREDIT_FLAG = 2
AND a.ORG_ID = '$org_id'
AND a.BANK_ID = '$bank_id'
AND a.TRANSACTION_ID = b.TRANS_MASTER_ID
But my problem is in my database SP Z O.O. these characters are storing differently sometimes. In some columns it's SP Z O.o. or sp z O.O. or SP Z O.O.. My query removes only if charaters are similar like SP Z O.O.. But I want to add all conditions like small letters whitespace etc.
I tried with LIKE %SP Z O.O.% but it shows me result in 1 and 0.

mySQL, two-dimensional into one-dimensional

I've forgotten whatever I used to know about pivots, but this seems to me the reverse. Suppose I have a set of items A, B, C, D, … and a list of attributes W, X, Y, Z. I have in a spreadsheet something like
A B C D
W 1 P 3 Q
X 5 R 7 S
Y T 2 U 4
Z D 6 F 7
where the value of attribute X for item B is 'P'. In order to do some statistics on comparisons, I'd like to change it from table to list, i.e.,
W A 1
X A 5
Y A T
Z A D
W B P
X B R
Y C U
Z C F
W D Q
X D S
Y B 2
Z B 6
Etc.
I can easily write a nested loop macro in the spreadsheet to do it, but is there an easy way to import it into mySQL in the desired format? Queries to get the statistics needed are simple in SQL (and formulas not very hard in a spreadsheet) if the data is in the second format.
Since there apparently isn't a "spreadsheet" tag, I used "excel." :-)
There are a lot of questions that looked similar at first glance, but the five I looked at all wanted to discard one of the indices (A-D or W-Z), i.e. creating something like
W 1
W P
X 5
X R
EDITED
You can use PowerQuery to unpivot tables. See the answer by teylyn for the following question. I have Office 365 and didn't need to install the plugin first. The functionality was already available.
Convert matrix to 3-column table ('reverse pivot', 'unpivot', 'flatten', 'normalize')
Another way to unpivot data without using VBA is with PowerQuery, a free add-in for Excel 2010 and higher, available here: http://www.microsoft.com/en-us/download/details.aspx?id=39379
...
Click the column header of the first column to select it. Then, on the Transform ribbon, click the Unpivot Columns drop-down and select Unpivot other columns.
...
OLD ANSWER
If you import the spreadsheet as is, you can run a query to output the correct format.
For a fixed, small number of items, you can use UNION for each of the columns.
SELECT attr, 'A' AS 'item', A AS 'value'
FROM sheet
UNION
SELECT attr, 'B' AS 'item', B AS 'value'
FROM sheet
UNION
SELECT attr, 'C' AS 'item', C AS 'value'
FROM sheet
UNION
SELECT attr, 'D' AS 'item', D AS 'value'
FROM sheet;
Working example: http://sqlfiddle.com/#!9/c274e7/7

Big Query : how to retrieve values in field 1 corresponding to field 2

I am fairly new to SQL, Big Query
I have a dataset and I want to retrieve values in column 2 corresponding to the values in column 1 if they satisfy certain conditions. I want to know how to do that. I am using Big Query Platform
Example Dataset D :
Col 1 ; Col 2
A ; 1
B ; 2
C ; 3
D ; 4
E ; 5
Query to retrieve values of col1, col2 such that col2 >2
Expected Output :
C ; 3
D ; 4
E ; 5
I am using big query platform.
According to me,
SELECT col1,col2
FROM [D]
WHERE col2>2
will give col1 and col2 as outputs where col2>2 but the values in col2 may or may not be the ones corresponding to col1.
Am I wrong ? If so, please suggest a query to get necessary output.
If you don't have a row A;5, it won't ever exist in your return. The only time you need to worry about the mismatch is if you're doing a join between one data set of {A, B, C, D, E} and another of {1, 2, 3, 4, 5}. Then every possible combination from A;1, A;2... to ...E;4, E;5 would be output, and filtering on col2 > 2 would produce A;3, B;3, C;3, ..., etc. But that isn't how your data is set up in your question, so don't worry. If you wonder how a select query will work, it's usually okay to just run it, unless it will take hours and consume tons of resources and you have a budget... but it seems more like you're doing homework.
Also don't ask for homework help on stack overflow.

How to select a column names of a table in mysql based on the value it contains

Hi I have a table with name test. it got 7 columns id , a , b , c , d , e , f. All this columns contains either 1 or 0. Now i want make a query where i can choose only those columns whose value is 1.
Something like this:
select (condition) from test where id = 5;
because i have a hotel table with 50 columns out of which 11 columns contains either 1 or 0 representing the facilities of the hotel. I want to make a query which just tells what are the facilities of the hotel.
Any help would be great.
select id, (a*64)+(b*32)+(c*16)+(d*8)+(e*4)+(f*2)+(g*1)
from test
this number you can reverse it to convert to a 7 digit binary code.
examples:
18 = 0010010 , 1000000 = 64
using sql you can select rows, NOT columns
If it is that what you want you can bulid your query like this
select id, a, b, c, d -- columns to select
from test -- table
where (a = 1 or b=1 or c = 1 or d = 1) -- these are the conditions