Related
I am dealing with the post-procession of CSV logs arranged in the multi-column format in the following order: the first column corresponds to the line number (ID), the second one contains its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (which is always negative):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
In order to reduce the number of the lines I filtered this CSV with the aim to search for the line with the highest number in the second column (POP), using the following AWK expression:
# search CSV for the line with the highest POP and save all lines before it, while keeping minimal number of the lines (3) in the case if this line is found at the beginning of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
thus obtaining the following reduced output CSV, which still has many lines since the search string (with highest POP) is located on 26th line:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
How it would be possible to further customize my filter via modifying my AWK expression (or pipe it to something else) in order to consider additionally only the lines with small difference in the negative value of the third column, dG compared to the first line (which has the value most negative)? For example to consider only the lines different no more then 20% in terms of dG compared to the first line, while keeping all rest conditions the same:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
Both tasks can be done in a single awk:
awk -F ', ' 'NR==1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR}; if (FNR==2) dg = $3 * .8; next} $3+0 == $3 && (FNR == n+1 || $3 > dg) {exit} 1' file file
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
To make it more readable:
awk -F ', ' '
NR == 1 {
next
}
FNR == NR {
if (max < $2) {
max = $2
n = FNR
}
if (FNR == 2)
dg = $3 * .8
next
}
$3 + 0 == $3 && (FNR == n+1 || $3 > dg) {
exit
}
1' file file
I am dealing with the post-processing of multi-column CSV arranged in fixed format: the first column corresponds to the line number (ID), the second one contains its population (POP, the number of the samples fell into this ID) and the third column (dG) represent some inherent value of this ID (always negative):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
In order to reduce the number of the lines I filtered this CSV with the aim to search for the line with the highest number in the second column (POP), using the following AWK expression:
# search CSV for the line with the highest POP and save all linnes before it, while keeping minimal number of the linnes (3) in the case if this line is found at the begining of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
For simple case when the string with maximum POP is located on the first line, the script will save this line (POP max) +2 lines after it(=min_lines=3).
For more complicated case, if the line with POP max is located in the middle of the CSV, the script detect this line + all the precedent lines from the begining of the CSV and list them in the new CSV keeping the original order. However, in that case output.csv would contain too many lines since the search string (with highest POP) is located on 26th line:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
In order to reduce the total number of the lines up to 3-5 lines in the output CSV, how it would be possible to customize my filter in order to save only the lines with a minor difference (e.g. the values in the pop column should match (POP >0.5 max(POP)) ), while comparing each line with the line having bigest value in the POP column? Finally, I need always to keep the first line as well as the line with the maximal value in the output. So the AWK solution should filter multi-string CSV in the following manner (please ignore coments in #):
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500 # this is POP max detected over all lines
This 2 phase awk should work for you:
awk -F ', ' -v n=2 'NR == 1 {next}
FNR==NR { if (max < $2) {max=$2; if (FNR==n) n++} next}
FNR <= n || $2 > (.5 * max)' file file
ID, POP, dG
1, 7, -9.6000
9, 16, -7.8100
26, 20, -6.6500
Suppose we have a table in mySQL database where fname has a connection to another fname(BB_Connection_name), we would like have a query to find the pair(s) of friends who find connection among themselves.
E.g
Sidharth and Asim both have each others BBid and BB_Connection_ID
I have looked for similar case of father, son and grandson question but in that not each father has a son and thus inner joining them makes things easier for solving. I tried using that but didn't work.
Here i need to check BB_Connection_ID for every fname(A) and then corresponding fname has A's BBid as his BB_Connection_ID or not.
The pairs which would be chosen, should be like Sidharth<->Asim
We need to find the pairs who have their connection ID to each other.
==========================================================================
Code for recreation of the table:
-----------------------------------------------------------------------------
create table world.bigbb(
BBid int not null auto_increment,
fname varchar(20) NOT NULL,
lname varchar(30),
BBdays int not null,
No_of_Nom int,
BB_rank int not null,
BB_Task varchar(10),
BB_Connection_ID int,
BB_Connection_name varchar(10),
primary key (BBid)
);
insert into world.bigbb (fname, lname, BBdays, No_of_Nom, BB_rank, BB_Task, BB_Connection_ID, BB_Connection_name)
values
('Sidharth', 'Shukla', 40, 4, 2, 'Kitchen', 11, 'Asim'),
('Arhaan', 'Khan', 7, 1, 9, 'Kitchen', 16, 'Rashmi'),
('Vikas', 'Bhau', 7, 1, 8, 'Bedroom', 11, 'Asim'),
('Khesari', 'Bihari', 7, 1, 12, 'Kitchen', 9, 'Paras'),
('Tehseem', 'Poonawala', 7, 1, 11, 'Washroom', 12, 'Khesari'),
('Shehnaaz', 'Gill', 40, 4, 4, 'Washroom', 9, 'Paras'),
('Himanshi', 'Khurana', 7, 0, 7, 'Bedroom', 8, 'Shefali'),
('Shefali', 'Zariwala', 7, 1, 10, 'Bedroom', 1, 'Sidharth'),
('Paras', 'Chabra', 40, 3, 1, 'Bathroom', 10, 'Mahira'),
('Mahira', 'Sharma', 40, 4, 5, 'Kitchen', 9, 'Paras'),
('Asim', 'Khan', 40, 3, 3, 'Bathroom', 1, 'Sidharth'),
('Arti', 'Singh', 40, 5, 6, 'Captain', 1, 'Sidharth'),
('Sidharth', 'Dey', 35, 6, 16, 'None', 14, 'Shefali'),
('Shefali', 'Bagga', 38, 5, 15, 'None', 13, 'Sidharth'),
('Abu', 'Fifi', 22, 5, 17, 'None', 11, 'Asim'),
('Rashmi', 'Desai', 38, 5, 13, 'None', 17, 'Debolina'),
('Debolina', 'Bhattacharjee', 38, 5, 14, 'None', 16, 'Rashmi');
One solution would be to self-join the table:
select
b1.fname name1,
b2.fname name2
from bigbb b1
inner join bigbb b2
on b1.BB_Connection_ID = b2.BBid
and b2.BB_Connection_ID = b1.BBid
and b1.BBid < b2.BBid
This will give you one record for each pair, with the record having the smallest BBid in the first column.
This demo on DB Fiddle with your sample data returns:
name1 | name2
:------- | :-------
Sidharth | Asim
Paras | Mahira
Sidharth | Shefali
Rashmi | Debolina
I have two tables, and I want to update the rows of torrents from scrapes every day.
scrapes:
id, torrent_id, name, status, complete, incomplete, downloaded
1, 1, http://tracker1.com, 1, 542, 23, 542
2, 1, http://tracker2.com, 1, 542, 23, 542
3, 2, http://tracker1.com, 1, 123, 34, 43
4, 2, http://tracker2.com, 1, 123, 34, 43
5, 3, http://tracker1.com, 1, 542, 23, 542
6, 3, http://tracker2.com, 1, 542, 23, 542
7, 4, http://tracker1.com, 1, 123, 34, 43
8, 4, http://tracker2.com, 1, 123, 34, 43
9, 5, http://tracker1.com, 1, 542, 23, 542
10, 5, http://tracker2.com, 1, 542, 23, 542
11, 6, http://tracker1.com, 1, 123, 34, 43
12, 6, http://tracker2.com, 1, 123, 34, 43
torrents:
id, name, complete, incomplete, downloaded
1, CentOS, 0, 0, 0
2, Ubuntu, 0, 0, 0
3, Debian, 0, 0, 0
4, Redhat, 0, 0, 0
5, Fedora, 0, 0, 0
6, Gentoo, 0, 0, 0
The scrapes may have multiple name, but I want to get the values only from the first found (for better performance) and also, I need to update only torrents ids 1, 3, 6 on one query.
UPDATE (SELECT * FROM scrapes WHERE torrent_id IN(1,3,6) GROUP BY torrent_id) as `myview` JOIN torrents ON myview.torrent_id=torrents.id SET torrent.complete=myview.complete WHERE 1
I have a json file whose rows are in the format as follows:
{"checkin_info": {"11-3": 17, "8-5": 1, "15-0": 2, "15-3": 2, "15-5": 2, "14-4": 1, "14- 5": 3, "14-6": 6, "14-0": 2, "14-1": 2, "14-3": 2, "0-5": 1, "1-6": 1, "11-5": 3, "11-4": 11, "13-1": 1, "11-6": 6, "11-1": 18, "13-6": 5, "13-5": 4, "11-2": 9, "12-6": 5, "12-4": 8, "12-5": 5, "12-2": 12, "12-3": 19, "12-0": 20, "12-1": 14, "13-3": 1, "9-5": 2, "9-4": 1, "13-2": 6, "20-1": 1, "9-6": 4, "16-3": 1, "16-1": 1, "16-5": 1, "10-0": 3, "10-1": 4, "10-2": 4, "10-3": 4, "10-4": 1, "10-5": 2, "10-6": 2, "11-0": 3}, "type": "checkin", "business_id": "KO9CpaSPOoqm0iCWm5scmg"}
and so on....it has 8282 entries like this.
I want to convert it into csv file like this.
business_id "0-0" "1-0" "2-0" "3-0" ….. "23-0" "0-1" ……. "23-1" …….. "0-4" …… "23-4" …… "23-6"
1 KO9CpaSPOoqm0iCWm5scmg 2 1 0 1 NA 1 1 NA NA NA NA NA 6 NA 7
2 oRqBAYtcBYZHXA7G8FlPaA 1 2 2 NA NA 2 NA NA 1 NA 2 NA 2 NA 2
I tried this code:
urlc <- "C:\\Users\\Ayush\\Desktop\\yelp_training_set\\yelp_training_set_checkin.json"
conc = file(urlc, "r")
inputc <- readLines(conc, -1L)
usec <- lapply(X=inputc,fromJSON)
for (i in 1:8282)
{
tt<-usec[[i]]$checkin_info
bb<-toString(tt)
usec[[i]]$checkin_info<-bb
}
dfc <- data.frame(matrix(unlist(usec), nrow=length(usec), byrow=T))
write.csv(dfc,file="checkin_tr.csv")
to convert it into form like this:
X1
business_id
1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1
D0IB17N66FiyYDCzTlAI4A
1, 1, 2, 1, 1
HLQGo3EaYVvAv22bONGkIw
1, 1, 1, 1
J6OojF0R_1OuwNlrZI-ynQ 2, 1, 2, 1, 2, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 2, 1, 2
But I want entries in column "X1" above in separate columns, as shown in the first table.
How can I do this? Please help
Using RJSONIO you can do something like this :
library(RJSONIO)
tt <- fromJSON(tt)
data.frame(business_id =tt$business_id,
do.call(rbind,list(tt$checkin_info)))
business_id X11.3 X8.5 X15.0 X15.3 X15.5 X14.4 X14.5 X14.6 X14.0 X14.1 X14.3 X0.5 X1.6 X11.5 X11.4 X13.1 X11.6 X11.1 X13.6 X13.5 X11.2 X12.6 X12.4
1 KO9CpaSPOoqm0iCWm5scmg 17 1 2 2 2 1 3 6 2 2 2 1 1 3 11 1 6 18 5 4 9 5 8
X12.5 X12.2 X12.3 X12.0 X12.1 X13.3 X9.5 X9.4 X13.2 X20.1 X9.6 X16.3 X16.1 X16.5 X10.0 X10.1 X10.2 X10.3 X10.4 X10.5 X10.6 X11.0
1 5 12 19 20 14 1 2 1 6 1 4 1 1 1 3 4 4 4 1 2 2 3
EDIT
I use a new idea here. It is easier to create a long format data.frame then convert it to a wide format using reshape2 for example.
library(RJSONIO)
## I create 2 shorter lines with different id
tt <- '{"checkin_info": {"11-3": 17, "8-5": 1, "15-0": 2}, "type": "checkin", "business_id": "KO9CpaSPOoqm0iCWm5scmg"}'
tt1 <- '{"checkin_info": {"12-0": 17, "7-5": 1, "15-0": 5}, "type": "checkin", "business_id": "iddd2"}'
## use inputc <- readLines(conc, -1L) in your case
inputc <- list(tt,tt1)
usec <- lapply(X=inputc,function(x){
tt <- fromJSON(x)
data.frame(business_id =tt$business_id,
names = names(tt$checkin_info),
values =unlist(tt$checkin_info))
})
## create a long data frame
dat <- do.call(rbind,usec)
## put in the wide format
library(reshape2)
dcast(business_id~names,data=dat)
business_id 11-3 15-0 8-5 12-0 7-5
1 KO9CpaSPOoqm0iCWm5scmg 17 2 1 NA NA
2 iddd2 NA 5 NA 17 1