Removing duplicate/opposite entries in Google Sheets - google-apps-script

I have a sheet containing the following data:
URL A, URL B, similar score in percent.
If URL A is 98% similar to URL B, it means that URL B is 98% similar to URL A, and listed as well.
I want to find and eliminate these duplicates/reversed entries. For now, I have tried having two extra columns concatenating URL A+URL B in one, and URL B+URL A in one. This way I have unique identifiers.
After this I'm kinda stuck, because I'm dealing with a lot of variables, as data is in two different rows, and two different columns. I might be looking into a script, taking the A+B value, iterating through the B+A value until it finds a match, and somehow marks this (or simply just deletes it), since my knowledge of formulas for highlighting these duplicates are falling short.
This sheet shows the concept - the first 100 rows (it's about 11K in total): https://docs.google.com/spreadsheets/d/1YKsguAn1lYjV4FlP_6_TlKGvFcpFAEzn7bpAyOEmozQ/edit?usp=sharing
Any suggestions for what I should look into?

Try the filter(match()) pattern to find duplicate values, like this:
=unique(
flatten(
filter(
A2:B,
match(A2:A & B2:B, B2:B & A2:A, 0),
C2:C >= 90
)
)
)

I ended up with a solution where I sorted by URL A and implemented this formula:
=IF(A2<B2,A2&B2,B2&A2)
This way I had the concatenation the same way for the real one and the opposite. I didn't know you could use "<" on strings.
After this, I could delete duplicated values in the column with the formula above.

Related

Google script custom function for different column [duplicate]

I'm trying to do a couple of different things with a spreadsheet in Google and running into some problems with the formulas I am using. I'm hoping someone might be able to direct me to a better solution or be able to correct the current issue I'm having.
First off all, here is a view of the data on Sheet 1 that I am pulling from:
Example Spreadsheet
The first task I'm trying to accomplish is to create a sheet that lists all of these shift days with the date in one column and the subject ("P: Ben" or S: Nicole") in another column. This sheet would be used to import the data via a CSV into our calendar system each month. I tried doing an Index-Match where it used the date to pull the associated values however I found that I had to keep adjusting the formula offsets in order to capture new information. It doesn't seem like Index-Match works when multiple rows/columns are involved. Is there a better way to pull this information?
The second task I am trying to accomplish is to create a new tab which lists all the dates a specific person is assigned too (that way this tab will update in real time and everyone can just look at their own sheet to see what days they are on-call). However, I run into the same problem here because for each new row I have to change the formula to reflect the correct information otherwise it doesn't pull the correct cell when it finds a match.
I would appreciate any and all information/advice on how to accomplish these tasks with the formula combination I mentioned or suggestions on other formulas to use that I have not been able to find.
Thanks in advance!
Brandon. There are a few ways to attack your tasks, but looking at the structure of your data, I would use curly brackets {} to create arrays. Here is an excerpt of how Google explains arrays in Sheets:
You can also create your own arrays in a formula in your spreadsheet
by using brackets { }. The brackets allow you to group together
values, while you use the following punctuation to determine which
order the values are displayed in:
Commas: Separate columns to help you write a row of data in an array.
For example, ={1, 2} would place the number 1 in the first cell and
the number 2 in the cell to the right in a new column.
Semicolons: Separate rows to help you write a column of data in an array. For
example, ={1; 2} would place the number 1 in the first cell and the
number 2 in the cell below in a new row.
Note: For countries that use
commas as decimal separators (for example €1,00), commas would be
replaced by backslashes () when creating arrays.
You can join multiple ranges into one continuous range using this same
punctuation. For example, to combine values from A1-A10 with the
values from D1-D10, you can use the following formula to create a
range in a continuous column: ={A1:A10; D1:D10}
Knowing that, here's a sample sheet of your data.
First Task:
create a sheet that lists all of these shift days with the date in one
column and the subject ("P: Ben" or S: Nicole") in another column.
To organize dates and subjects into discrete arrays, we'll collect them using curly brackets...
Dates: {A3:G3,A7:G7,A11:G11,A15:G15}
Subjects: {A4:G4,A5:G5,A8:G8,A9:G9,A12:G12,A13:G13,A16:G16,A17:G17}
This actually produces two rows rather than columns, but we'll deal with that in a minute. You'll note that, because there are two subjects per every one date, we need to effectively double each date captured.
Dates: {A3:G3,A3:G3,A7:G7,A7:G7,A11:G11,A11:G11,A15:G15,A15:G15}
Subjects: {A4:G4,A5:G5,A8:G8,A9:G9,A12:G12,A13:G13,A16:G16,A17:G17}
Still with me? If so, all that's left is to (a) turn these two rows into two columns using the TRANSPOSE function, (b) combine our two columns using another pair of curly brackets and a semicolon and (c) add a SORT function to list the dates in chronological order...
=SORT(TRANSPOSE({{A3:G3,A3:G3,A7:G7,A7:G7,A11:G11,A11:G11,A15:G15,A15:G15};{A4:G4,A5:G5,A8:G8,A9:G9,A12:G12,A13:G13,A16:G16,A17:G17}}),1,TRUE)
Second Task:
create a new tab which lists all the dates a specific person is
assigned too (that way this tab will update in real time and everyone
can just look at their own sheet to see what days they are on-call).
Assuming the two-column array we just created lives in A2:B53 on a new sheet called "Shifts," then we can use the FILTER function and SEARCH based on each name. The formula at the top of Ben's sheet would look like this:
=FILTER(Shifts!A2:B53,SEARCH("Ben",Shifts!B2:B53))
Hopefully this helps, but please let me know if I've misinterpreted anything. Cheers.

Countif Formula to exclude Duplicates

I sought help regarding this once, but I failed to outline my problem.
This time I am happy to share the sheet with dummy data in hope it explains my problem a bit better: Link to the sheet
My issue is the following:
In column E I am counting the number of opportunities for a rep (listed in column A). The data I am considering is in a separate sheet named "Pipeline".
I do this with the countif formula and I use additional criteria to filter on date as well. My dates for february are in B4 and G4, because I only want to see opportunities in February.
My formula looks like this:
=countIFS(Pipeline!$A:$A,$A7,Pipeline!$F:$F,">="&$B$4,Pipeline!$F:$F,"<="&$G$4)
This works perfectly fine. However, sometimes I have two opportunities in my pipeline sheet with the same name (these are split opportunities). If an opportunity has the same name it should be counted only once. I can't seem to find an easy way to update my countif formula.
In the dummy sheet I shared above, you can see that Peter has two "New - CC Tech" opportunities. I want this to count as one opportunity. Everything I googled so far suggests using rather complex formulas, which is not so easy as I have multiple criteria in the formula that I need to filter my results (such as name of the rep and dates). Please feel free to suggest a solution within the sheet above and play around with it.
I really appreciate the help!
Try this ('unique' based on A,B and F)
=query(unique({Pipeline!A:B,Pipeline!F:F}),"select count(Col1) where Col1='"&A7&"' and Col3>=DATE'"&TEXT($B$4,"yyyy-MM-dd")&"' and Col3<=DATE'"&TEXT($G$4,"yyyy-MM-dd")&"' label count(Col1) '' ")
or, if you consider that the date could be different between two lines ('unique' based only on A and B, the date could be different but within the limits)
=query(unique({Pipeline!A$2:B,arrayformula((Pipeline!F$2:F>=$B$4)*(Pipeline!F$2:F<=$G$4))}),"select count(Col1) where Col1='"&A7&"' and Col3>0 label count(Col1) '' ")
In this second formula, we construct a matrix with A, B and 0/1 (which is the result of the question: is F within limits), then we apply unique and we query when Col3 is equal to 1 and Col1 the name we are looking for

Split and repeat without

In this sheet, I've the below input data:
As seen, the courses are separated by /
I want to display the same in the format below, where each line shows one course only, with the data of the student repeated:
I know using =split(C3," / ",true,true) can split the courses into 2 columns at the same row, but I need them in the same column, so I tried =TRANSPOSE(split(C3," / ",true,true)) that is working fine for the first line only, but it fail with using ARRAYFORMULA.
Any thought? I'm opened for any potential solution, formula or script or any other.
UPDATE
I tried this trick, creating a new column showing number of courses for each student as =ArrayFormula(LEN(REGEXREPLACE(C11:C13, "[^/]", ""))+1)
Then using Rep to repeat each row based on the number of courses =arrayformula({transpose(split(concatenate(rept(B11:B13 & ",",D11:D13)),",",false,true)),transpose(split(concatenate(REPT(C11:C13 & ",",D11:D13)),",",false,true))}) then ended up with:
But here, I've the courses still joint together, how can i split them!
I've added two sheets to your sample spreadsheet. "Sheet2" is a cleanup of your testing sheet, "Sheet1." The other sheet ("Erik Help") references Sheet2, not Sheet1, and contains the following formula in cell A1:
=ArrayFormula({"Student ID","Student Name","Course";SUBSTITUTE(SPLIT(QUERY(FLATTEN(SPLIT(FILTER(SUBSTITUTE("/ "&Sheet2!C3:C,"/","/ "&Sheet2!A3:A&"zzz~"&Sheet2!B3:B&"~"),Sheet2!A3:A<>""),"/")),"Select * WHERE Col1 Is Not Null"),"~"),"zzz","")})
This one array formula produces all headers and results.
A virtual array is formed between the curly brackets { }. Headers are introduced first followed by a semicolon, which means "bump down one row to continue." The header titles can be changed as you like.
How It Works:
An addition "/ " is concatenated to the front of every non-blank entry in Sheet2!C2:C. Then SUBSTITUTE replaces every one of these forward slashes with Col A data, "zzz~", Col B data and "~". The tildes (~) will be used later by the outer SPLIT. The "zzz" is added to make sure that ID numbers are converted to text so that they hold formatting throughout the processing and don't turn into real numbers; later, the outer SUBSTITUTE will replace those with null (i.e., get rid of the 'zzz').
Once the initial concatenations are complete, they are SPLIT at the forward slash and then FLATTENed into one column. QUERY removes any blank rows in this virtual array so far. The remaining results are again SPLIT at the tilde. Finally, that outer SUBSTITUTE removes the temporary instances of 'zzz'.
I also added a custom CF formula for the alternating color banding on alternate rows.
You can try this one:
Formula:
=ARRAYFORMULA(TRIM(QUERY(SPLIT(FLATTEN(IF(IFERROR(SPLIT(C3:C5, "/"))="",,
A3:A5&"×"&B3:B5&"×"&SPLIT(C3:C5, "/"))), "×"),
"where Col3 is not null")))
Output:
Reference:
How to transpose & split multiple columns and repeat specific cells in a column

How do you separate comma separated values in different columns while maintaining values in the rest of the row in Google Sheets?

How do you adjust comma separated values in such a way that the value separated with commas is separated and that a new row is created for this value and that the other values are the same as in the row from which the value comes? That would look like this:
From this..
..to this.
I'm actually looking for an answer that doesn't use google script when possible and without using gigantic long and complex formulas. The use of a pivot table within Google sheets may be used, but is also not my preference. But if it's not possible to use only formulas then I'm open to other answers as well.
I've had this question for over a year and I can't find serious answers online after a few hours of searching. There will be answers using a google script, but that doesn't really fall within the scope of my question. I am willing to adjust or rephrase my question if the current question remains unanswered.
I myself have no idea how to answer the question and the attempts I have made are not to be taken seriously.
Lambda Update
It's 2022. We now have LAMBDA and a bunch of array functions. Thus we can combine everything into a single formula, as originally desired. The idea is still the same as before, just much cleaner. (Also FLATTEN is no longer undocumented.)
=ArrayFormula(
SPLIT(
TRANSPOSE(
SPLIT(
JOIN(
"",
BYROW(
A1:E4,
LAMBDA(row,
JOIN(
",",
REDUCE(
";",
row,
LAMBDA(cell1,cell2,
FLATTEN(FLATTEN(cell1)&","&SPLIT(cell2,","))
)
)
)
)
)
),
";,",
)
),
","
)
)
How it works
REDUCE combines all elements in an array to a single result using a function (Named Function or LAMBDA). In this case, we use the same permutation trick (combine column vector with row vector) as in the old solution to serialize every row. This gives us a column of rows, each starting with ;,
The rows are JOINed together, serializing the array.
BYROW applies a function to each row in a range, returning a single value for each row. Here, we use the process above in a LAMBDA function which gives a single serialized string for each row in the range.
We then JOIN all these together. Each row is delimited by ;,.
Split on the ;,, giving a row of serialized rows
Transpose the row to get a column
Split again on , deserializing each row.
For a much more readable solution, of course, you can name the lambdas, resulting in a formula that's cleaner still.
Draggable Formula solution (obsoleted by LAMBDA)
Let's see if I can get this ball rolling.
At a Glance
This solution is unfortunately unstable, as it relies on the Flatten undocumented function (turn any range into a column array), and requires two formulas to work. While I'm sure that you can do the same thing without Flatten(), this at least saves us some typing, as we rely on it heavily. Without flatten, we can achieve the same with TRANSPOSE(SPLIT(TEXTJOIN(...)), which is not nearly as elegant.
The core formula, while it does have a linear growth factor and can get messy with more columns, does have an easy pattern to follow for the setup. It can also be dragged, which is the next best thing to a single ArrayFormula.
Stage 1: Serialize Rows
As you might have expected, we're going to use some string serialization tricks to get what we want. Here's the core formula:
=TEXTJOIN(",",,
ArrayFormula(
Flatten(Flatten(Flatten(Flatten(Flatten(
SPLIT(A1,",")&",")&
SPLIT(B1,",")&",")&
SPLIT(C1,",")&",")&
SPLIT(D1,",")&",")&
SPLIT(E1,",")&";")
))&","
As you can see, it accounts for any commas inside each cell in the row. To add more columns, simply add another Flatten( and add your column to the list. Just make sure that the last one uses a ; and not a ,.
We take advantage of the fact that, in general, when ArrayFormula is applied to a column vector and a row vector, we can do an operation on every permutation of the two mixed together.
Examples:
=ArrayFormula({0;1}&{2,3}) is equivalent to ={"02","12";"03","13"}
=ArrayFormula(SEQUENCE(10)*SEQUENCE(1,10)) gives us a 10x10 multiplication table.
In our case, we use this to generate every possible permutation of rows based on the commas in each cell, serializes the row into a CSV string, ending each in a semicolon, then joins all the rows into one long string. The extra "," is so we can concatenate multiple tables together in the next stage.
When you're set up with the proper number of columns, drag this down to the height of the table. (Note: If some of your values can be blank, you also have to do some error checking around each SPLIT.)
Stage 2: Deserialize
This formula is considerably simpler. (Assuming serialization data is in column F.)
=ArrayFormula(
SPLIT(
TRANSPOSE(
SPLIT(
JOIN(,F:F),
";,",
)
),
","
)
)
First, glue all the strings together using JOIN. Since we know that each row ends with a ";,", we split on that to get our rows. After that, we can split each row up into cells by splitting on ",", resulting in our table.
Conclusion
It's not a single ArrayFormula, sure, but neither of these formulas is really all that complex, which is nice. ArrayFormulas can get messy and confusing quickly.
We've managed to avoid scripting too, which is a plus.
You can also hide the serialization column if you find it unsightly!

Find answer for data in A in predefined list of answers - VLOOKUP - Google Spreadsheets

I've got a google spreadsheet with a main sheet tab. column A contains a bunch of company names, some of which repeat, and are included multiple times. In column B i want to have a predefined unique code for each company. For instance if I had a company name Nike in a10, a14, a21 I would have the same code each time in b10, b14, b21.
I was initially looking at if / else blocks and switch statements (not sure if google spreadsheet can even do them) to accomplish this, but they would become massive and unmanageable as single line pieces of code will involve several hundred company names.
Instead I've setup another tab called Codes Data with a predefined list of all of the company names in column A and the code in column B. This list will be added to over time.
What I'm trying to do is have a formula in the main sheet column B that will check the value of the corresponding column A cell, find the unique code for that company in the Codes Data tab and place that code in column B.
I started using VLOOKUP for this and at first it seemed to work, but now I'm getting inconsistent results (i.e. its outputting Addidas | am-1121 and ACMECO RESTAUR | am-1121 where according to the Codes Data sheet it should output Addidas | ad-5426).
I've provided an example spreadsheet here : https://docs.google.com/spreadsheets/d/156Lla5IyLjB-hp7s50jpotC1qcaov9RdFkpUzATe710/edit#gid=458436476
Is VLOOKUP the correct function to be doing this? If so, how can I use it more properly, and if not what would be a better approach?
You must add FALSE to your lookup formula, or it won't work:
=VLOOKUP(A5, 'Codes Data'!$A:$B, 2, FALSE)
Use the Help menu, and choose "Sheets Help", and then type in "vlookup". You'll get the documentation:
is_sorted - [OPTIONAL - TRUE by default] - Indicates whether the column to be searched (the first column of the specified range) is sorted.
If is_sorted is TRUE or omitted, the nearest match (less than or equal to the search key) is returned. If all values in the search column are greater than the search key, #N/A is returned.
If is_sorted is set to TRUE or omitted, and the first column of the range is not in sorted order, an incorrect value might be returned.
If is_sorted is FALSE, only an exact match is returned. If there are multiple matching values, the content of the cell corresponding to the first value found is returned, and #N/A is returned if no such value is found.