Counting the number of occurrences of unique values using Pig Latin - csv

I am trying to figure out top 5 of the most downloaded RStudio packages on December 1, 2019 (from http://cran-logs.rstudio.com/) using Apache Pig Latin. The columns I need are 'r_os' and 'package'. Here is my code:
A = load '2019-12-01.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
B = FOREACH A GENERATE r_os,package;
C = DISTINCT B;
D = GROUP C BY package;
result = FOREACH C GENERATE flatten($0), COUNT($1) as package_distr;
I'm getting the following result, which is wrong:
(magrittr,10)
(htmltools,10)
(httr,10)
(lubridate,10)
(ellipsis,10)
The number of occurrences should be more, not 10. My desired output should look approximately like:
(magrittr,10000)
(htmltools,9876)
(httr,8700)
(lubridate,5320)
(ellipsis,3000)
Any idea what I'm doing wrong?

result = FOREACH D GENERATE group, COUNT(C) as package_distr;
?
group being the package name, and C being the name of the resulting bag when you grouped C, which we then count.

Related

Count or read only actual entries

How can I count or read only the actual entries of a column, as distinct from non-zero entries?
In other words, if I have the file:
4000,1,5221,0
4001,0,5222,1
4002,3,,,
column 4 has 2 actual entries, whereof one vanishes. I can count entries like so:
R = csvread("bugtest.csv");
for i = 1:4
VanishingColEntries(i) = numel (find (R(:,i) ==0));
NonVanishingColEntries(i) = nnz(R(:,i));
endfor
VanishingColEntries
NonVanishingColEntries
yielding:
octave:2> nument
VanishingColEntries =
0 1 1 2
NonVanishingColEntries =
3 2 2 1
But, I dont know how to extract the number of "actual" entries, that is the sum of non zero and explicitly zero entries!
csvread is only for numeric data. If csvread encounters an entry which is not strictly numeric, it checks if the string starts with a number, and uses that as the result (e.g. 1direction, 2pac, 7up will result in 1,2,7 ). 'Empty' entries here are effectively considered to be an empty string, which is parsed as the number 0. However, there are some special strings, like nan and inf which are parsed specially.
If you can / are happy to preprocess your csv file, then you can replace all empty entries with the string nan (without quotes). csvread will then treat this string specially and replace it with an actual nan value in the resulting numerical matrix. You can then use this with isnan to count the number of nan / non-nan entries as follows:
R = csvread( 'bugtest.csv' );
% Count nan / non-nan entries along rows
VanishingColEntries = sum( isnan( R ), 1 )
NonVanishingColEntries = sum( ~isnan( R ), 1 )
If you do not have the luxury of preprocessing your csv file (or you simply want to process it programmatically throughout, without the need for human intervention), then you can use the csv2cell function from the io package instead, and process the resulting cell to get what you want, e.g.
pkg load io
C = csv2cell( 'bugtest.csv' )
% Convert cells with empty strings to nan
for i = 1 : numel(C), if ischar(C{i}), C{i} = nan; endif, endfor
% Convert numeric cell array (nan is a valid number) to a matrix
R = cell2mat( C );
You can then use isnan as before to get your result.

MySQL query seems to give different results when wrapped in perl DBI

The query below works fine in when I try it in the console.
mysql> SELECT COUNT(l.ID), a.MAX_PER_PRD, a.PLURAL, d.TIME_DENOM FROM logro l, challenge c, lib_accomp_type a, lib_deporte d WHERE l.PERIOD=3 AND l.GAME_ID=2 AND l.PLR_ID=3 AND l.ACC_TYPE_ID=11 AND a.sport=d.ID AND c.ACC_TYPE_ID=a.ID AND l.ACC_TYPE_ID=c.ACC_TYPE_ID;
+-------------+-------------+--------------------+------------+
| COUNT(l.ID) | MAX_PER_PRD | PLURAL | TIME_DENOM |
+-------------+-------------+--------------------+------------+
| 0 | 3 | general commodities| quarter |
+-------------+-------------+--------------------+------------+
1 row in set (0.01 sec)
However, when I wrap it in a perl->DBI statement handle and fetch it with $sth->fetchrow_array the second value is undefined.
my $q = "SELECT COUNT(l.ID), a.MAX_PER_PRD, a.PLURAL, d.TIME_DENOM
FROM logro l, challenge c, lib_accomp_type a, lib_deporte d
WHERE l.PERIOD=?
AND l.GAME_ID=?
AND l.PLR_ID=?
AND l.ACC_TYPE_ID=?
AND a.sport=d.ID
AND c.ACC_TYPE_ID=a.ID
AND l.ACC_TYPE_ID=c.ACC_TYPE_ID";
my $sth = $dbh->prepare($q);
$sth->execute(3, 2, 3, 11);
my ($CNT, $MAX, $ANAMEP, $TD) = $sth->fetchrow_array;
print "COUNT: ", $CNT;
print "MAX: ", $MAX;
$ perl test_sql2.pl
Use of uninitialized value $MAX in print at test_sql2.pl line 29.
COUNT: 0MAX:
Any idea as to what I could be doing wrong?
Depending on your mysql client/library version, mysql handles this situation differently.
For mysql <= 5.6, see Group By Handling
For mysql = 5.7, see Group By Handling
You query has a count of 0, yet there are values returned for the other columns. That doesn't seem to make any sense. It seems that running through perl is actually doing the logical thing and mysql is just populating the MAX_PER_PRD, PLURAL, TIME_DENOM columns with arbitrary values.
The main issue here, is that you referencing non-aggregated columns without them being part of a group by clause.
Perhaps if you include a sample data set, it could help us get to the result your are looking for.

Duplicates issue

I have a problem with duplicates.
Actually what I need is only the see duplicates but my table has many variables something like the below:
a b c d e
32 ayi dam som kem
32 ayi dam som tws
32 ayi dam tsm tws
12 mm ds de ko
12 mm tmm to ko
I am trying to keep if 'a' 'b' 'c' and 'd' variables are same. So I need only first 2 columns. I try to do this
proc sort data=al nodupkey dupout=dups;
by a b c d;
run;
any idea if this works?
In SAS 9.3+ you can do this very easily with the new nouniquekey option.
proc sort data=have nouniquekey out=want;
by a b c d;
run;
That removes any rows which are unique and leaves duplicates.
If you have an earlier version of SAS, you can do something fairly simple after a regular sort.
So, after sorting as in your example above but without nodupkey:
data want;
set have;
by a b c d;
if not (first.d and last.d);
run;
That removes records that are the first AND last record based on a b c d.
Another option could be Proc SQL, no pre-sort needed:
proc sql;
create table want as
select * from have
group by a, b, c, d
having count(*)>1; /*This is to tell SAS only to keep those dups*/
quit;

Selecting from multiple tables with multiple criteria

I have a MySQL database with three tables: sample, method, compound.
sample has the following columns: id(PK)(int), date(date), compound_id(int), location(varchar), method(int), value(float)
method has the following columns: id(PK)(int), label(varchar)
And compound has: id(PK)(int), name(varchar), unit(varchar)
I am trying to generate a SQL command that only pulls in the unique row for the following criteria:
Date (sample.date)
Compound Name (compound.name)
Location (sample.location)
Method (sample.method)
However, I want to substitute in the labels for some of the sample columns instead of the numbers:
sample.compound_id is matched to compound.id which has a corresponding compound.name and compound.unit
The first SQL command I tried to query was:
SELECT sample.id, sample.date, compound.name, sample.location, method.label, sample.value, compound.unit
FROM sample, compound, method
WHERE sample.date = "2011-11-03"
AND compound.name = "Zinc (Dissolved)"
AND sample.location = "13.0"
AND method.id = 1;
The output from the above command:
id date name location label value unit
1 2011-11-03 Zinc (Dissolved) 13.0 (1) Indivi... 378.261 μg/L
5 2011-11-03 Zinc (Dissolved) 13.0 (1) Indivi... 197.917 μg/L
9 2011-11-03 Zinc (Dissolved) 13.0 (1) Indivi... 92.4051 μg/L
But when I look at sample and compare sample.id to what was returned:
id date compound_id location method value
1 2011-11-03 13 13.0 1 378.261
5 2011-11-03 14 13.0 1 197.917
9 2011-11-03 47 13.0 1 92.4051
Where compound.id 47 corresponds to compound.id 47 and compound.name "Zinc (Dissolved)". Compound IDs #13 and #14 are "Copper (Dissolved)" and "Copper (Total)", respectively.
So it seems to be returning rows that meet the criteria for sample.date and sample.location without regard to compound.name. Given the above criteria, I know that my database should only return one row, but instead I get some sample.id rows that have a completely different sample.compound_id than the matching compound.name that I specified.
I would like to end up with the columns that are SELECTed in the first line to end up in the same order as I wrote them. This code is for a little database viewer/reporter program I'm writing in Python/Tkinter and relies on the columns being uniform. The code that I use to initialize the data for the program works as I expect:
SELECT sample.id, sample.date, compound.name, sample.location, method.label, sample.value, compound.unit
FROM sample, compound, method
WHERE sample.compound_id = compound.id
AND sample.method = method.id;
Which puts out each unique line in sample with the substitutions for sample.compound_id to compound.name and sample.method to method.label and adds in the compound.unit at the end.
Question #1: How do I need to restructure my query so that it only returns the row that meets that specific criteria?
Question #2: Eventually I'm going to need to specify multiple sample.locations at one time. Is that as simple as adding an OR statement for each individual location that I need?
SELECT sample.id, sample.date, compound.name, sample.location, method.label, sample.value, compound.unit
FROM sample
INNER JOIN compound ON compound.id = sample.compound_id
INNER JOIN method ON method.id = sample.method
WHERE sample.date = '2011-11-03'
AND compound.name = 'Zinc (Dissolved)'
AND sample.location = "13.0"
AND method.id = 1;
Now that I have the first question figured out, I figured out my second question:
SELECT sample.id, sample.date, compound.name, sample.location, method.label, sample.value, compound.unit
FROM sample
INNER JOIN compound ON compound.id = sample.compound_id
INNER JOIN method ON method.id = sample.method
WHERE sample.date = '2011-11-03'
AND compound.name = 'Zinc (Dissolved)'
AND sample.location IN ("13.0", "22.0")
AND method.id = 1;
And just keep adding ORs inside the brackets for each other location.

Access partition function: Is there a way to make it show bin categories that don't have a count?

I'm trying to use the Access Partition function to generate the bins used to generate a histogram chart to show the frequency distribution of my % utilization data set. However, the Partition function only shows the category bin ranges (e.g. 0:9, 10:19 etc) only for the categories that have a count. I would like it to show up to 100.
Example:
Using this function:
% Utilization: Partition([Max],0,100,10)
The Full SQL is:
SELECT Count([qry].[Max]) AS Actuals, Partition([Max],0,100,10) AS [% Utilization]
FROM [qry]
GROUP BY Partition([Max],0,100,10);
gives me:
Actuals | % Utilization
4 | 0: 9
4 | 10: 19
4 | 20: 29
but I want it to show 0s for the ranges that don't have values up to 90:99. Can this be done?
Thanks in Advance
The only way I can think of doing this is with an additional Bins table that contains all the bins you wish to illustrate:
SELECT Bins.[% Utilization], t.Actuals FROM Bins
LEFT JOIN
(SELECT Count(max) AS Actuals,
Partition([max],0,100,10) AS [% Utilization]
FROM qry
GROUP BY Partition([max],0,100,10)) t
ON t.[% Utilization]=bins.[% Utilization]