Unique ID when creating Hive table from CSV files - csv

I have a list of CSV files that I want to export as Hive tables but I'm pretty sure that some records are redundant in the CSVs. Each record / row in the CSV are identified by a key and I want to generate the table using that key as the primary key . How will I generate the Hive table such that there are no repeating rows?

ROW_NUMBER() OVER([partition_by_clause] order_by_clause)
returns an ascending sequence of integers, starting with 1.
select x, row_number() over(order by x, property) as row_number, property from int_t;
+----+------------+----------+
| x | row_number | property |
+----+------------+----------+
| 1 | 1 | odd |
| 1 | 2 | square |
| 2 | 3 | even |
| 2 | 4 | prime |
| 3 | 5 | odd |
| 3 | 6 | prime |
| 4 | 7 | even |
| 4 | 8 | square |
| 5 | 9 | odd |
| 5 | 10 | prime |
| 6 | 11 | even |
| 6 | 12 | perfect |
| 7 | 13 | lucky |
| 7 | 14 | lucky |
| 7 | 15 | lucky |
| 7 | 16 | odd |
| 7 | 17 | prime |
| 8 | 18 | even |
| 9 | 19 | odd |
| 9 | 20 | square |
| 10 | 21 | even |
| 10 | 22 | round |
+----+------------+----------+

Related

Issue concatenating rows with duplicates

I have run into some issues trying to combine a row of variables where dublicates can be found.
Computers with Ids are saved in the Computer table:
| Computer.Id |
|-------------|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
Harddrive are saved in a HardDisk table with a HardDisk Id exclusive to the harddrive and a ComputerId linked to the Id in the Computer table
| Harddisk.ComputerId | Harddisk.Id |
|---------------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 10 |
The output I am looking to achieve is:
| Harddisk.ComputerId | Harddisk.Id |
|---------------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6,7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 10 |
The output I'm currently getting is:
| Harddisk.ComputerId | Harddisk.Id |
|---------------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 8 |
| 8 | 9 |
| 9 | 10 |
Notice how Harddisk 7 which is the disk that shares Computer 6 is gone.
My current query looks like the following, courtesy of scaisEdge:
SELECT *, group_concat(HardDisk.Id)
from Computer
inner join HardDisk on Computer.Id = HardDisk.ComputerId
group by Computer.Id
I hope someone is able to help me out!
You can't use * because this produce an a wrong aggregation in mysql for version < 5.7
try use explicit column's name in select
SELECT computer.ID, group_concat(HardDisk.Id) my_disk
from Computer
inner join HardDisk on Computer.Id = HardDisk.ComputerId
group by Computer.Id
if you need more column's not related to the same aggreagtion level you need a join
In mysql version < 5.7 if some columns mentioned in select clause are not mentioned properly in group by the aggregation function return the first occurrence of the select and not the correct aggreagted result
try add
echo $row['my_disk];

How to find all the employees that under a manger who is also an employee in mySQL version 5.7.22 without using CTE and no pre defined manager level?

ManagerId is nothing but EmpId. i need all the EmpId that come under the given EmpId including all the subtree. without using CTE as i'm trying this with HQL. with no hierarchy level defined.
+-------+-----------+
| EmpId | ManagerId |
+-------+-----------+
| 1 | null |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 1 |
| 6 | 3 |
| 7 | 6 |
| 8 | 6 |
| 9 | null |
| 10 | 3 |
| 11 | 10 |
| 12 | 1 |
| 13 | 12 |
+-------+-----------+
when the givem EmpId is 3:
expected response:
4
6
10
7
8
11

select non duplicates by column from select query [duplicate]

This question already has answers here:
SQL select only rows with max value on a column [duplicate]
(27 answers)
Group by minimum value in one field while selecting distinct rows
(10 answers)
Closed 4 years ago.
I've looked all over stackoverflow but without any luck, so here goes nothing.
I have a table populated with certain information on house positions, I select these positions and calculate the distance between the house coordinate and my desired coordinate, which I then order by distance ascending like so;
SELECT id, type, distance FROM (SELECT b.id, b.type, b.x, b.y, b.z,
SQRT(POWER(ABS(1654.5413 - b.x), 2) + POWER(ABS(-2293.7571 - b.y), 2) + POWER(ABS(-1.1996 - b.z), 2)) AS "distance"
FROM businesses b ORDER BY distance ASC) as T;
Example output;
+------+------+------------------------+
| id | type | distance |
+------+------+------------------------+
| 1953 | 2 | 0.00004489639611771451 |
| 2 | 100 | 8.757256937390904 |
| 1959 | 2 | 8.999959765646956 |
| 1960 | 2 | 10.499959765643807 |
| 1961 | 2 | 11.999959765641446 |
| 1962 | 2 | 13.499959765639607 |
| 1963 | 2 | 14.999959765638138 |
| 1964 | 2 | 16.499959765636934 |
| 2055 | 3 | 17.11486010149676 |
| 2054 | 1 | 17.751048488860313 |
| 1965 | 2 | 17.999959765635932 |
| 1966 | 2 | 19.499959765635083 |
| 1967 | 2 | 20.999959765634358 |
| 2056 | 5 | 22.26658275782834 |
| 1968 | 2 | 22.499959765633726 |
| 1969 | 2 | 23.999959765633175 |
| 2057 | 5 | 24.054132659013334 |
| 1970 | 2 | 25.49995976563269 |
| 2058 | 5 | 26.001138245767084 |
| 2061 | 4 | 26.853239370669378 |
| 1971 | 2 | 26.99995976563226 |
| 1972 | 2 | 28.49995976563187 |
| 2060 | 5 | 28.55999771765475 |
| 1973 | 2 | 29.999959765631523 |
| 2059 | 5 | 31.414688663981224 |
| 1974 | 2 | 31.499959765631207 |
| 1 | 100 | 121468.4587678613 |
+------+------+------------------------+
What I want to do with these results is only grab one row by selecting the non duplicates of the "type" column, like so (and keep the distance ASC order);
+------+------+------------------------+
| id | type | distance |
+------+------+------------------------+
| 1953 | 2 | 0.00004489639611771451 |
| 2 | 100 | 8.757256937390904 |
| 2055 | 3 | 17.11486010149676 |
| 2054 | 1 | 17.751048488860313 |
| 2056 | 5 | 22.26658275782834 |
| 2061 | 4 | 26.853239370669378 |
+------+------+------------------------+
If I attempt to "SELECT DISTINCT TYPE" it will not keep the order of the rows and will always select the last duplicate of "type" (I think I said that correctly).
How would I go about getting my desired result?

Picking out specific values from a group in MySQL

This seems like such a simple problem, but I can't find a good solution. I'm trying to select information from a slightly misformatted table. Basically, wherever sequence=0, the person_id should actually be a company_id. This company_id then applies to all the rows which have the same group_id.
Someone thought it was a good idea to format things this way instead of simply having a company_id column, but it makes trying to select by company very difficult. It would make my programming much easier to simply add this extra column, and fix the formatting.
I want to turn something like this:
+----------+------------+-----------+----------+
| group_id | date | person_id | sequence |
+----------+------------+-----------+----------+
| 1 | 2012-08-31 | 10 | 0 |
| 1 | 2012-08-31 | 11 | 1 |
| 1 | 2012-08-31 | 12 | 2 |
| 2 | 1999-04-16 | 10 | 0 |
| 2 | 1999-04-16 | 21 | 1 |
| 2 | 1999-04-16 | 22 | 2 |
| 2 | 1999-04-16 | 23 | 3 |
| 2 | 1999-04-16 | 24 | 4 |
| 3 | 2001-01-09 | 30 | 0 |
| 3 | 2001-01-09 | 31 | 1 |
| 3 | 2001-01-09 | 11 | 2 |
| 3 | 2001-01-09 | 12 | 3 |
+----------+------------+-----------+----------+
Into this:
+------------+----------+------------+-----------+----------+
| company_id | group_id | date | person_id | sequence |
+------------+----------+------------+-----------+----------+
| 10 | 1 | 2012-08-31 | 11 | 1 |
| 10 | 1 | 2012-08-31 | 12 | 2 |
| 10 | 2 | 1999-04-16 | 21 | 1 |
| 10 | 2 | 1999-04-16 | 22 | 2 |
| 10 | 2 | 1999-04-16 | 23 | 3 |
| 10 | 2 | 1999-04-16 | 24 | 4 |
| 30 | 3 | 2001-01-09 | 31 | 1 |
| 30 | 3 | 2001-01-09 | 11 | 2 |
| 30 | 3 | 2001-01-09 | 12 | 3 |
+------------+----------+------------+-----------+----------+
The only way I can think of how to achieve this is with nested SELECT statements, which are very inefficient considering I have about 100M rows. It's a one time fix though, so I don't mind letting it run overnight.
If you permanently want to change your table to include a company_id column then do this:
First alter the table and add the new column:
alter table your_table add company_id int;
Then update all rows to set the company to the person_id = 0 for the group:
UPDATE your_table a
JOIN your_table b ON a.group_id = b.group_id
SET a.company_id = b.person_id
WHERE b.sequence = 0;
And finally remove the rows with sequence = 0:
DELETE FROM your_table WHERE sequence = 0;
Sample SQL Fiddle
The end result will be:
| group_id | date | person_id | sequence | company_id |
|----------|------------|-----------|----------|------------|
| 1 | 2012-08-31 | 11 | 1 | 10 |
| 1 | 2012-08-31 | 12 | 2 | 10 |
| 2 | 1999-04-16 | 21 | 1 | 10 |
| 2 | 1999-04-16 | 22 | 2 | 10 |
| 2 | 1999-04-16 | 23 | 3 | 10 |
| 2 | 1999-04-16 | 24 | 4 | 10 |
| 3 | 2001-01-09 | 31 | 1 | 30 |
| 3 | 2001-01-09 | 11 | 2 | 30 |
| 3 | 2001-01-09 | 12 | 3 | 30 |

MySQL Query for averages

good morning. I have this table:
mysql> select * from Data;
+---------------------------+--------+-------+
| affyId | exptId | level |
+---------------------------+--------+-------+
| 31315_at | 3 | 250 |
| 31324_at | 3 | 91 |
| 31325_at | 1 | 191 |
| 31325_at | 2 | 101 |
| 31325_at | 4 | 51 |
| 31325_at | 5 | 71 |
| 31325_at | 6 | 31 |
| 31356_at | 3 | 91 |
| 31362_at | 3 | 260 |
| 31510_s_at | 3 | 257 |
| 5321_at | 4 | 90 |
| 5322_at | 4 | 90 |
| 5323_at | 4 | 90 |
| 5324_at | 3 | 57 |
| 5324_at | 4 | 90 |
| 5325_at | 4 | 90 |
| AFFX-BioB-3_at | 3 | 97 |
| AFFX-BioB-5_at | 3 | 20 |
| AFFX-BioB-M_at | 3 | 20 |
| AFFX-BioB-M_at | 5 | 214 |
| AFFX-BioB-M_at | 7 | 20 |
| AFFX-BioB-M_at | 8 | 40 |
| AFFX-BioB-M_at | 9 | 20 |
| AFFX-HSAC07/X00351_M_at | 3 | 86 |
| AFFX-HUMBAPDH/M33197_3_st | 3 | 277 |
| AFFX-HUMTFFR/M11507_at | 3 | 90 |
| AFFX-M27830_3_at | 3 | 271 |
| AFFX-MurIL10_at | 3 | 8 |
| AFFX-MurIL10_at | 5 | 8 |
| AFFX-MurIL10_at | 6 | 4 |
| AFFX-MurIL2_at | 3 | 20 |
| AFFX-MurIL4_at | 5 | 78 |
| AFFX-MurIL4_at | 6 | 20 |
| U95-32123_at | 1 | 128 |
| U95-32123_at | 2 | 128 |
| U98-40474_at | 1 | 57 |
| U98-40474_at | 2 | 57 |
+---------------------------+--------+-------+
37 rows in set (0.00 sec)
If I wanna look for the average expression level (level) of each array probe (affyId) across all experiments, I do SELECT affyId, AVG(level) AS average FROM Data GROUP BY affyId;
However, I can't figure out how to look for the average expression level of each array probe (affyId) for each experiment... It must be something similar to the last query, but I don't obtain good results... any help?
PD: someone told me I should give some reputation or click to some green button if somebody solves my question... Is it right? How do I do it? I'm pretty new on this website...
This shows the average for every affyId:
SELECT affyId, AVG(level) AS average FROM Data GROUP BY affyId
This the average for every exptId:
SELECT exptId, AVG(level) AS average FROM Data GROUP BY exptId
and this the average for every exptId in every affyId:
SELECT affyId, exptId, AVG(level) AS average FROM Data GROUP BY exptId, affyId
Just add that to the group by clause
SELECT affyId, exptId, AVG(level) AS average
FROM Data
GROUP BY affyId, exptId;