Related
I am dealing with the post-procession of CSV log filles arranged in the multi-column format. Usually the first column corresponds to the line number (ID), the second one containts its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (which is always negative):
ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100
6, 5, -6.7000
7, 10, -6.5600
8, 10, -6.4800
9, 7, -6.4500
10, 3, -6.4400
11, 8, -6.4300
12, 10, -6.4200
13, 3, -6.3300
14, 7, -6.2200
15, 1, -6.2000
16, 3, -6.2000
17, 4, -6.1700
18, 1, -6.0500
19, 9, -6.0200
20, 1, -6.0100
21, 1, -6.0000
22, 3, -5.9900
23, 4, -5.9800
24, 3, -5.9200
25, 2, -5.9100
26, 1, -5.8900
27, 1, -5.8500
28, 1, -5.8200
29, 1, -5.7900
30, 8, -5.7800
31, 1, -5.7800
32, 1, -5.7200
33, 3, -5.7100
34, 2, -5.7100
35, 1, -5.6900
36, 4, -5.6800
37, 2, -5.6500
38, 4, -5.6100
39, 1, -5.5900
40, 1, -5.5600
41, 1, -5.5500
42, 2, -5.5500
43, 1, -5.5200
44, 1, -5.5100
45, 2, -5.5000
46, 1, -5.5000
47, 3, -5.4700
48, 2, -5.4500
49, 1, -5.4500
50, 4, -5.4300
51, 1, -5.4300
52, 1, -5.3800
53, 2, -5.3800
54, 1, -5.3500
55, 1, -5.2800
56, 1, -5.2500
57, 2, -5.2500
58, 2, -5.2400
59, 2, -5.2300
60, 1, -5.1400
61, 1, -5.1100
62, 1, -5.1000
63, 2, -5.0300
64, 2, -5.0100
65, 2, -5.0100
66, 1, -4.9700
67, 1, -4.9200
68, 1, -4.9000
69, 2, -4.9000
70, 1, -4.8900
71, 1, -4.8600
72, 3, -4.7900
73, 2, -4.7900
74, 1, -4.7900
75, 1, -4.7700
76, 2, -4.7600
77, 1, -4.7500
78, 1, -4.7400
79, 1, -4.7300
80, 1, -4.7200
81, 2, -4.7100
82, 1, -4.6800
83, 2, -4.6300
84, 1, -4.5500
85, 1, -4.5000
86, 1, -4.4800
87, 2, -4.4500
88, 1, -4.4300
89, 1, -4.3900
90, 1, -4.3000
91, 1, -4.2500
92, 1, -4.2300
93, 1, -4.2200
94, 2, -4.1600
95, 1, -4.1500
96, 1, -4.1100
97, 1, -4.0300
98, 1, -4.0100
I need to reduce the total number of these lines, keeping in the output CSV only the first N lines from the first to the line with the biggest population (POP, the value of the second column) observed in the whole dataset. So in my example the expected output should be the first 5 lines, since the 5th ID has the biggest value of the second column (POP) compared to the all 98 lines:
ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100
Could you suggest me some AWK solution which would accept my CSV file and produce new one after such filtering based on the values in the second column?
You could try this awk command:
awk -F "," 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0} a > $2 && NR > 1 {arr[i]=$0; i++}' input
See demo at: https://awk.js.org/?gist=c8751cc25e444fb2e2b1a8f29849f127
This approach processed the file twice: once to find the max, and again to print the lines up to the max. I've incorporated your request to print a minimum number of lines.
awk -F ', ' -v min_lines=5 '
NR == FNR {
if ($2 > max) max=$2
next
}
{print}
$2 == max {
for (i = FNR; i <= min_lines; i++) {
getline
print
}
exit
}
' file.csv file.csv
$ awk -F, -v minlines=5 'NR==FNR { if($2>=max && NR>1) {max=$2; maxi=NR} next }
FNR<=minlines+1 || FNR<=maxi' file{,}
ID, POP, dG
1, 7, -6.9700
2, 2, -6.9500
3, 2, -6.8500
4, 6, -6.7200
5, 14, -6.7100
this will print until the last occurrence of the max value. If you want the first instance change $2>=max to $2>max
I am dealing with the post-processing of multi-column CSV arranged in fixed format: the first column corresponds to the line number (ID), the second one contains its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (always negative):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
In order to reduce the number of the lines I filtered this CSV with the aim to search for the line with the highest number in the second column (POP), using the following AWK expression:
# search CSV for the line with the highest POP and save all linnes before it, while keeping minimal number of the linnes (3) in the case if this line is found at the begining of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
For simple case when the string with maximum POP is located on the first line, the script will save this line (POP max) +2 lines after it(=min_lines=3).
For more complicated case, if the line with POP max is located in the middle of the CSV, the script detect this line + all the precedent lines from the begining of the CSV and list them in the new CSV keeping the original order. However, in that case output.csv would contain too many lines since the search string (with highest POP) is located on 26th line:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
In order to reduce the total number of the lines up to 3-5 lines in the output CSV, how it would be possible to customize my filter in order to save only the lines with a minor difference (e.g. the values in the pop column should match (POP >0.5 max(POP)) ), while comparing each line with the line having bigest value in the POP column? Finally, I need always to keep the first line as well as the line with the maximal value in the output. So the AWK solution should filter multi-string CSV in the following manner (please ignore coments in #):
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500 # this is POP max detected over all lines
This 2 phase awk should work for you:
awk -F ', ' -v n=2 'NR == 1 {next}
FNR==NR { if (max < $2) {max=$2; if (FNR==n) n++} next}
FNR <= n || $2 > (.5 * max)' file file
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500
Bring back the count of all the batsmen who have played a certain number of innings only if there are more than one batsman who have played that number of innings. 4 innings – 2 batsmen, 5 innings – 3 batsmen etc.
SELECT COUNT(Innings) AS TotalInnings, COUNT(player) AS Totalplayer
FROM Batting
GROUP BY Player
Player is the field name for the batsmens name. I am kinda stuck on how to write the last lines statement
4 innings – 2 batsmen, 5 innings – 3 batsmen etc
This is the table design
CREATE TABLE Batting (
Player CHAR(25),
CarrerSpan VARCHAR(20),
Matches INT,
Innings INT,
Playing INT,
Runs INT,
HighestScore CHAR(10),
AverageScore NUMERIC,
BallsFaced INT,
StrikeRate NUMERIC,
Hundreds INT,
Fifties INT,
Zeros INT,
Fours INT,
Sixs INT
);
INSERT INTO Batting
VALUES ('JP Duminy', '2007-2018', 77, 71, 22, 1825, '96*', 37.24, 1468, 124.31, 0, 11, 6, 130, 65);
INSERT INTO Batting
VALUES ('AB de Villiers', '2006-2017', 78, 75, 11, 1672, '79*', 26.12, 1237, 135.16, 0, 10, 5, 140, 60);
INSERT INTO Batting
VALUES ('HM Amla', '2009-2018', 41, 41, 5, 1158, '97*', 32.16, 883, 131.14, 0, 7, 2, 133, 23);
INSERT INTO Batting
VALUES ('F du Plessis', '2012-2017', 36, 36, 6, 1129, 119, 37.63, 849, 132.97, 1, 7, 0, 102, 35);
INSERT INTO Batting
VALUES('DA Miller', '2010-2018', 58, 51, 15, 1043, '101*', 28.97, 745, 140.00, 1, 1, 0, 71, 46);
INSERT INTO Batting
VALUES('GC Smith', '2005-2011', 33, 33, 2, 982, '89*', 31.67, 770, 127.53, 0, 5, 1, 123, 26);
INSERT INTO Batting
VALUES('Q de Kock','2012-2018', 32, 32, 4, 821, 59, 29.32, 637, 128.88, 0, 2, 3, 97, 24);
INSERT INTO Batting
VALUES('JH Kallis', '2005-2012', 25, 23, 4, 666, 73, 35.05, 558, 119.35, 0, 5, 0, 56, 20);
INSERT INTO Batting
VALUES('JA Morkel', '2005-2015', 50, 38, 11, 572, 43, 21.18, 402, 142.28, 0, 0, 1, 29, 39);
INSERT INTO Batting
VALUES('F Behardien', '2012-2018', 37, 29, 13, 515, '64*', 32.18, 402, 128.10, 0, 1, 1, 37, 16);
INSERT INTO Batting
VALUES('HH Gibbs', '2005-2010', 23, 23, 1, 400, '90*', 18.18, 318, 125.78, 0, 3, 4, 45, 12);
INSERT INTO Batting
VALUES('RR Rossouw', '2014-2016', 15, 14, 3, 327, 78, 29.72, 237, 137.97, 0, 2, 2, 29, 12);
INSERT INTO Batting
VALUES('LE Bosman', '2006-2010', 14, 14, 1, 323, 94, 24.84, 219, 147.48, 0, 3, 2, 27, 20);
INSERT INTO Batting
VALUES('RR Hendricks', '2014-2018', 13, 13, 0, 294, 70, 22.61, 262, 112.21, 0, 1, 2, 33, 3);
INSERT INTO Batting
VALUES('MV Boucher', '2005-2010', 25, 21, 6, 268, '36*', 17.86, 275, 97.45, 0, 0, 0, 22, 2);
INSERT INTO Batting
VALUES('RE Levi', '2012-2012', 13 ,13, 2, 236, '117*', 21.45, 167, 141.31, 1, 1, 3, 20, 15);
INSERT INTO Batting
VALUES('MN van Wyk', '2007-2015', 8, 7, 1, 225, '114*', 37.50, 157, 143.31, 1, 1, 0, 19, 14);
INSERT INTO Batting
VALUES('CA Ingram', '2010-2012', 9, 9, 1, 210, 78, 26.25, 162, 129.62, 0, 1, 1, 23, 7);
INSERT INTO Batting
VALUES('JM Kemp', '2005-2007', 8, 7, 3, 203, '89*', 50.75, 160, 126.87, 0, 1, 0, 17, 10);
INSERT INTO Batting
VALUES('J Botha', '2006-2012', 40, 0, 9, 201, 34, 18.27, 165, 121.81, 0, 0, 1, 15, 9);
INSERT INTO Batting
VALUES('H Davids', '2012-2013', 9, 9, 0, 161, 68, 17.88, 134, 120.14, 0, 2, 2, 18, 4);
INSERT INTO Batting
VALUES('JL Ontong', '2008-2015', 14, 10, 0, 158, 48, 15.80, 109, 144.95, 0, 0, 1, 6, 11);
INSERT INTO Batting
VALUES('JT Smuts', '2017-2018', 8, 8, 0, 126, 45, 15.75, 114, 110.52, 0, 0, 1, 14, 5);
INSERT INTO Batting
VALUES('RJ Peterson', '2006-2014', 21, 12, 4, 124, 34, 15.50, 113, 109.73, 0, 0, 1, 13, 2);
INSERT INTO Batting
VALUES('WD Parnell', '2009-2017', 40, 13, 9, 114, '29*', 28.50, 96, 118.75, 0, 0, 0, 10, 3);
INSERT INTO Batting
VALUES( 'H Klaasen', '2018-2018', 4, 4, 0, 110, '69', 27.50, 64, 171.87, 0, 1, 0, 5, 9);
INSERT INTO Batting
VALUES( 'M Mosehle', '2017-2017', 7, 6, 1, 105, '36', 21.00, 65, 161.53, 0, 0, 0, 6, 9);
INSERT INTO Batting
VALUES( 'D Wiese', '2013-2016', 20, 11, 4, 92, '28', 13.14, 75, 122.66, 0, 0, 1, 4, 3);
INSERT INTO Batting
VALUES( 'SM Pollock', '2005-2008', 12, 9, 2, 86, '36*', 12.28, 70, 122.85, 0, 0, 3, 4, 4);
INSERT INTO Batting
VALUES( 'CH Morris', '2012-2018', 17, 10, 3, 77, '17*', 11.00, 70, 110.00, 0, 0, 2, 7, 2);
INSERT INTO Batting
VALUES( 'RE van der Merwe', '2009-2010', 13, 6, 3, 57, '48', 19.00, 50, 114.00, 0, 0, 1, 2, 4);
INSERT INTO Batting
VALUES( 'C Jonker', '2018-2018', 1, 1, 0, 49, '49', 49.00, 24, 204.16, 0, 0, 0, 5, 2);
INSERT INTO Batting
VALUES( 'HG Kuhn', '2009-2017', 7, 6, 2, 49, '29', 12.25, 42, 116.66, 0, 0, 0, 3, 2);
INSERT INTO Batting
VALUES( 'JJ van der Wath', '2006-2007', 8, 4, 1, 46, '21', 15.33, 39, 117.94, 0, 0, 0, 3, 1);
INSERT INTO Batting
I think this could be the solution to your problem:
Query1
SELECT Innings,
COUNT(player) AS Totalplayer
FROM Batting
GROUP BY Innings
HAVING COUNT(player) > 1;
OR
Query2
SELECT Innings,
COUNT(player) AS Totalplayer
FROM Batting
GROUP BY Innings
HAVING COUNT(Innings) > 1;
Sample Output :
Innings
Totalplayer
4
2
6
3
7
2
9
3
10
2
13
3
14
2
23
2
View on DB Fiddle
Well you can save the number of rows for which the Innings column has certain value. You can save this number in a temporary table and then query the table. See following SQL:
CREATE TEMPORARY TABLE BatsmenCount (SELECT count(*) as total FROM Batting WHERE Innings=71);
SELECT * FROM BatsmenCount WHERE total > 1;
This query will return the number of batsmen that have played 71 innings and save the result in a temporary table. Next this number is fetched using SELECT query. If the query returns no results, then the number of batsmen that have played 71 innings is 1 or less. If the query returns a result, then this result is the number of batsmen that have played 71 innings.
See the SQL code on dbfiddle.
I'm having trouble with a JOIN and a GROUP_CONCAT. The query is concatenating additional data that should not be associated with the join.
Here's my table structure:
linkages
ID table_name tag_id
1 subcategories 6
2 categories 9
music
ID artwork
1 5
2 4
artwork
ID url_path
1 /some/file/path
2 /some/file/path
And here's my query:
SELECT music.*,
artwork.url_path AS artwork_url_path,
GROUP_CONCAT( linkages.tag_id ) AS tag_ids,
GROUP_CONCAT( linkages.table_name ) AS table_name
FROM music
LEFT JOIN artwork ON artwork.id = music.artwork
LEFT JOIN linkages ON music.id = linkages.track_id
WHERE music.id IN( '1356',
'1357',
'719',
'169',
'170',
'171',
'805' )
ORDER BY FIELD( music.id,
1356,
1357,
719,
169,
170,
171,
805 )
This is the result of the GROUP_CONCAT :
[tag_ids] => 3, 6, 9, 17, 19, 20, 26, 49, 63, 64, 53, 57, 63, 65, 67, 73, 79, 80, 85, 96, 98, 11, 53, 67, 3, 6, 15, 17, 26, 38, 50, 63, 74, 53, 56, 57, 62, 63, 65, 66, 67, 72, 85, 88, 98, 24, 69, 71, 3, 6, 15, 17, 26, 38, 50
The first portion of the result is correct:
[tag_ids] => 3, 6, 9, 17, 19, 20, 26, 49, 63, 64, 53, 57, 63, 65, 67, 73, 79, 80, 85, 96, 98, 11, 53, 67
Everything after the correct values seems random and most of the values don't exist in the result in the database, but it's still pulling it in. It seems to repeat a portion of the correct result (3, 6, 15, 17 - the 3, 6, 17 are correct, but 15 shouldn't be there, similar with a bunch of other numbers - 71, etc. I can't use DISTINCT because I need to match up the tag_ids and table_name results as a multidimensional array from the results.
Any thoughts as to why?
UPDATE:
I ended up solving it with the initial push from Gordon. It needed a GROUP_BY clause, otherwise it was putting every results tag id's in each result. The final query ended up becoming this:
SET SESSION group_concat_max_len = 1000000;
SELECT
music.*,
artwork.url_path as artwork_url_path,
GROUP_CONCAT(linkages.tag_id, ':', linkages.table_name) as tags
FROM music
LEFT JOIN artwork ON artwork.id = music.artwork
LEFT JOIN linkages ON music.id = linkages.track_id
WHERE music.id IN('1356', '1357', '719', '169', '170', '171', '805')
GROUP BY music.id
ORDER BY FIELD(music.id,1356,1357,719,169,170,171,805);
Your join is generating duplicate rows. I would suggest that you fix the root cause of the problem. But, a quick-and-dirty solution is to use group_concat(distinct):
GROUP_CONCAT(DISTINCT linkages.tag_id) as tag_ids,
GROUP_CONCAT(DISTINCT linkages.table_name) as table_name
You can put the columns in a single field using GROUP_CONCAT():
GROUP_CONCAT(DISTINCT linkages.tag_id, ':', linkages.table_name) as tags
This is probably something very simple, so forgive my blonde moment :)
I have a table 'album'
* albumId
* albumOwnerId (who created)
* albumCSD (create stamp date)
Now what I am trying to do is to select the top 10 most recently updated albums. But, I don't want 10 albums from the same person coming back - I only want one album per unique person. I.E 10 albums from 10 different people.
So, this is what I have below, but it is not working properly and I just can't figure out why. Any ideas?
Thanks
SELECT DISTINCT(albumOwnerId), albumId
FROM album
ORDER BY albumCSD DESC
LIMIT 0,10
Here is some example data, followed by what I am trying to get. Hope this makes it clearer.
DATA:
albumOwnerID, albumId, albumCSD
18, 194, '2010-10-23 11:02:30'
23, 193, '2010-10-22 11:39:59'
22, 192, '2010-10-12 21:48:16'
21, 181, '2010-10-12 20:34:11'
21, 178, '2010-10-12 20:20:16'
19, 168, '2010-10-12 18:31:55'
18, 167, '2010-10-11 21:06:55'
20, 166, '2010-10-11 21:01:47'
18, 165, '2010-10-11 21:00:32'
20, 164, '2010-10-11 20:50:06'
17, 145, '2010-10-10 18:54:24'
17, 144, '2010-10-10 18:49:28'
17, 143, '2010-10-10 18:48:08'
17, 142, '2010-10-10 18:46:54'
16, 130, '2010-10-10 16:17:57'
16, 129, '2010-10-10 16:17:26'
16, 128, '2010-10-10 16:07:21'
15, 119, '2010-10-10 15:24:28'
15, 118, '2010-10-10 15:24:11'
14, 100, '2010-10-09 18:22:49'
14, 99, '2010-10-09 18:18:46'
11, 98, '2010-10-09 15:50:13'
11, 97, '2010-10-09 15:44:09'
11, 96, '2010-10-09 15:42:28'
11, 95, '2010-10-09 15:37:25'
DESIRED DATA:
18, 194, '2010-10-23 11:02:30'
23, 193, '2010-10-22 11:39:59'
22, 192, '2010-10-12 21:48:16'
21, 181, '2010-10-12 20:34:11'
19, 168, '2010-10-12 18:31:55'
17, 145, '2010-10-10 18:54:24'
16, 130, '2010-10-10 16:17:57'
15, 119, '2010-10-10 15:24:28'
14, 100, '2010-10-09 18:22:49'
11, 98, '2010-10-09 15:50:13'
I get results, you want to have, with this query
SELECT albumOwnerID, albumId, albumCSD
FROM album
WHERE albumCSD in
(SELECT Max(album.albumCSD) AS MaxvonalbumCSD
FROM album
GROUP BY album.albumOwnerID);
However in MS Access
select albumOwnerID, albumID
from album
Group by albumOwnerID, albumID
Order by albumcsd desc
LIMIT 0,10
EDIT:
select albumOwnerID, albumID
from album
where albumOwnerID in (select distinct albumOwnerID from album order by albumCSD )
LIMIT 0,10