When running Remove Duplicates in Power Query, does it leave the 1st instance alone and then delete any following duplicates? E.G. If there were duplicates on rows 10,11 and 12, it would delete rows 10 & 11? Is there documentation on this somewhere?
Thanks!
As far as I am aware, remove duplicates will remove items based on the order the data was initially loaded into Power Query. Any sorting or other operations you have performed after the data is loaded will not be factored into this. So duplicate items on rows 11 and 12 would be removed in your example, even if you sorted the data so the items on rows 11 and 12 were now above the item on row 10.
It is possible to make remove duplicates follow the current sort order if you use the function Table.Buffer() on the data before using the remove duplicates function in PQ (the actual function it runs is Table.Distinct(). This is because Table.Buffer() loads the table at the current state it is called into memory, and this resets the "load" order that is used to remove duplicates by Table.Distinct.
In practice the simplest way to do it looks like changing the default function when you use Remove Duplicates from this
= Table.Distinct(#"Sorted Rows", {"DuplicateColumn"})
to this
= Table.Distinct(Table.Buffer(#"Sorted Rows"), {"DuplicateColumn"})
Not sure about documentation but from experience: Yes, the first item is retained, any duplicates following will be removed.
With this knowledge under your belt you can use an Index columns to manipulate the order of entries if the default order does not produce the result you want.
Related
This question already has answers here:
Making changes to multiple records based on change of single record with SQL
(3 answers)
Closed 5 years ago.
I'd like to save some data in a MySQL table. The data have their orders which can be modified at my will. Say I have 10 rows. I want to move the 10th row to the 5th position, or insert some more rows between 2nd and 3rd position. Then, with a viewer I can get the data with the order I set. How can I implement such a table?
As I thought, I would save the order as float number in a new column. Each time I change the order, say, move 10th row between 5th and 6th, I would get the order number of 5th and 6th and get the average number of them, and update the order column of 10th row. Then I can get the data that ORDER BY order column. But I don't think it's a good idea. Any help about this problem?
You don't order the table, that makes no sense. You are actually not interested in the order the entries are placed in that table. Consider it random.
You are interested in the order you want to see the entries in. For that you create an additional column, call it "select_order" or "priority", however you like. In that you store simple integers which you use to describe the order you want to see the entries in.
Now you can "re-order" the entries however you like by changing those numbers in that order column. At query time you add an ORDER BY select_order clause to your SELECT query and will receive the entries in exactly the order you want.
This is the standard approach for relational database models. Which does not mean that there are no other approaches that might be interesting to look into for very special situations:
a priority table instead of a column which is joined during the SELECT query. This might make sense for situations with much more write than read operations. Note the much however.
a multiple column approach for situations where you can group entries and only re-order inside such groups. That dramatically reduces the number of entries you have to updated in case or re-ordering.
What is the best practice for moving rows. So that you might want to change order of items. Now if you make a new column called order_id or something, wouldn't that fail if I delete or select rows.
Another method I guess is to just switch values completely with an primary ID, so just values except the ID are changed. however I do not know what people usually use. There are so many websites that give you the ability to change order of things.how so they do that?
Every SQL statement that returns a visible result set should include an ORDER BY clause so that the results are consistent. The Standard does not guarantee that the order of rows in a particular table will remain constant or consistent, even if obvious changes aren't made to the table.
What you use for your ORDER BY clause depends on the use case. A date value is the usual choice for a comment thread or blog entry ordering. However, if you want the user to be able to customize the order that a result set shows in, then you have to provide a column that represents the position of the row, and adjust the value of that column when the user makes changes to the order they see.
For example, if you decide that the column will contain a sequential number, starting with 1 for the first row, 2 for the second, etc. then you will be ok to delete rows when they need to be deleted without having to do updates. However, if you insert a row, you will need to give the row you insert the sequential number appropriate for it's position, and update all rows below that with their new position. Same goes for if you move a row from somewhere else to a new location; the rows between the new and old locations need to be updated with new postion indexes.
I have one col in my database named position used for ordering. However, when some records is deleted, the sequence get messed up. I want to reorder this col when the table is changed(maybe use trigger).
position(old) -> position(new)
1 1
3 2
7 3
8 4
like this.
I think there will not exist equal number even in position(old), because I have already attach some function in PHP to reorder the column when updates occurs. However, when a record is deleted because of the deletion of its parent, function will not be called.
Thanks for help!
If you are using the column just for ordering, you do not need to update column on deletion, because the order will still be correct. And you will save some resources.
But if you really need to update by sequence, look at this answer:
updating columns with a sequence number mysql
I believe (as scrowler wrote) the better way in such case is to update rows from the application, after application deletes the parent record.
If you decide to update it in the application then...
If position = n is deleted, you logic should be set position = position - 1 where position > n
Please note that this will work only if you delete one record at a time from your application and before assuming that before the delete is triggered the data is already in sequence
I need to do a count on the items in a joined result set where a condition is true. I thus have a "from join where where" type of expression. This expression must end with a select or groupby. I do not need the column data actually and figure it is thus faster not to select it:
count = (from e in dc.entries select new {}).Count();
I have 2 questions:
Is there a faster way to do this in terms of the db load?
I have to duplicate my entire copy of the query. Is there a way to structure my query where I can have it one place for both counts and for getting say a list with all fields?
Thanks.
Please pay especial attention:
The query is a join and not a simple table thus I must use a select statement.
I will need 2 different query bodies because I do not need to load all the actual fields for the count but will for the list.
I assume when I use the select query it is filling up with data when I use query.Count vs Table.Count. Look forward to those who understand what I'm asking for possible better ways to do this and some detailed knowledge of what actually happens. I need to pull out the logging to look into this deeper.
Queryable.Count
The query behavior that occurs as a
result of executing an expression tree
that represents calling
Count(IQueryable)
depends on the implementation of the
type of the source parameter. The
expected behavior is that it counts
the number of items in source.
In fact, if you use LinqToSql or LinqToEntities, Queryable.Count() is sent into the database. No columns are loaded to memory. Check the generated sql to confirm.
I assume when I use the select query it is filling up with data when I use query.Count vs Table.Count
This is not true. Check the generated sql to confirm.
I have to duplicate my entire copy of the query. Is there a way to structure my query where I can have it one place for both counts and for getting say a list with all fields
If you need both the count and the list, get the list and count it.
If you need the count sometimes and other times you need the list... write a method that returns the complex IQueryable, and sometimes call .Count() and other times call .ToList();
I do not need the column data actually and figure it is thus faster not to select it.
This is basically false in your scenario. It can be true in a scenario where an index covers the result columns, but you don't have any result columns.
In your scenario, whatever index is chosen by the query optimizer, that index can be used to make the count.
Sum up: Query optimizer will perform the optimization you desire.
//you can put a where condition here
var queryEntries = from e in dc.entries select e;
//Get count
queryEntries.Count();
//Loop through Entries, so you basically returned all entries
foreach(entry en in queryEntries)
{}
i have a query to fetch 100 rows of events ordered by timestamp. i want to ignore top 2 entries from the result set. The problem is that there is no criteria match (simply to ignore first 2 rows).
i am using pager (drupal) which results 10 events per page. If i process it after fetching 10 rows i lost 2 entries (first page contains only 8 entries). how to solve the problem ?
If you are using Views, you can just set the offset to 2 which will ignore the first two records.
USe limit
LIMIT 2,98
LIMIT 2,100
Add that to your SQL command, I think it should work.
You can't use offsets with pager_query() which I assume you're using here. Maybe you need to reconsider how you're querying? Maybe run a query for the first two records, and then in your pager SQL use a WHERE condition to exclude the IDs of the first two results.