I seem to be having an issue when calling up the history of a versioned SQLAlchemy class.
I have the following inheritance structure setup:
Node(Versioned, Base)
Specimen(Node)
Animal(Specimen)
If I attempt to fetch the animal history using the query generated by :
AnimalHistory = self.__history_mapper__.class_
q = object_session(self).query(AnimalHistory).filter(AnimalHistory.id == self.id).order_by(AnimalHistory.version.desc())
logger.debug(q)
I get the following query:
SELECT bla bla #trimmed for brevity FROM node_history
JOIN specimen_history ON node_history.id = specimen_history.id AND node_history.version = specimen_history.version
JOIN animal_history ON specimen_history.id = animal_history.id
WHERE animal_history.id = 28
ORDER BY animal_history.version DESC
Basically, I seem to be missing the appropriate "AND" statement on the animal_history JOIN.
Because of this, I get an unwanted cartesian product between animal and (specimen, node)
Could anyone point out the modification needed inside history_meta.py in order to fix this ?
Thanks !!
Answer was actually provided through the SQLAlchemy Google Groups:
https://groups.google.com/forum/#!topic/sqlalchemy/YVAI4C94NBs
Related
Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
I have been revamping my webapp with Django 1.7. It has been a lot of fun... and tears and blood!
Thing is I've been struggling for the last few days to do a simple LEFT JOIN to aggregate some values from a table with no FKs. The query result should go to a Class View (DetailView). Believe me that I have searched and searched for an answer in the entire web (including the World Wide Web) to no avail.
You might ask why is it that my tables have no Foreign Keys? Well, the original database design hadn't any and now the tables hold hundreds of millions of rows. I could add FK constraints but that would be costly, but it would brake things, and would require to remake entire scripts that do extraction and loading!
I thought falling back to the good old raw SQL because, according to Django,
raw() has a bunch of other options that make it very powerful...
Yeah, right. The truth is model.objects.raw() that its power is limited and it doesn't work for what I want to do (it just doesn't aggregate).
Tables/Models (simplified)
Table `customer` (customer_id, order_id)
Table `order` (order_id, order_name)
MySQL/Django query (simplified)
'SELECT a.order_id, SUM(a.order_value)
FROM order a
LEFT JOIN customer b
ON a.order_id = b.order_id
WHERE b.customer_id = %s', [customer_id]
It looks so innocent, right? Hell no! It's a Django nightmare! Of course, I could easily do that in Django with a __set, but alas, I don't have FKs.
On top of my problem I am trying to add aggregates to a context in my DetailView template. So I tried hacking it with View() and made a function inside my DetailView custom class:
def NewContextFTW():
# here get the freaking queryset in my own terms
return myhighlycomplexqueryset
And then in template:
{% for rows in view.NewContextFTW %}
{{rows.id}}
{{rows.sum_order_value}}
{% endfor %}
...but it failed.
Edit:
I found a solution today! And I want to share the love to the world! See my answer below.
This solution solved my problem. Hope it solves yours too.
Override get_context_data, execute SQL directly through a connection, convert resulting list into a dictionary, add it to context, enjoy:
def get_context_data(self, **kwargs):
# get context from the class this very function belongs to
context = super(MyClassView, self).get_context_data(**kwargs)
user = self.user # example
cursor = connection.cursor()
cursor.execute('SELECT a.order_id, SUM(a.order_value) FROM order a LEFT JOIN customer b ON a.order_id = b.order_id WHERE b.customer_id = %s', [user])
d = dictfetchall(cursor)
new_object_list = []
for i in range(len(d)):
order_id = d[i]['order_id']
sum_order_value = d[i]['SUM(a.order_value)']
row = aggregated_row(order_id, sum_order_value)
new_object_list.append(row)
context['aggregated_values'] = new_object_list
return context
To make that work don't forget to:
from django.views.generic import DetailView
from django.db import connection
...then define the following function to convert the list into a dictionary (this boilerplate comes from Django's 1.7 own documentation):
def dictfetchall(cursor):
"Returns all rows from a cursor as a dict"
desc = cursor.description
return [
dict(zip([col[0] for col in desc], row))
for row in cursor.fetchall()
]
...create a class to hold your rows as objects:
class aggregated_row(object):
def __init__(self, order_id, sum_order_value):
self.order_id = order_id
self.sum_order_value = sum_order_value
...and last but not least, the template:
{% for rows in aggregated_values%}
<tr>
<td>{{rows.order_id}}</td>
<td>{{rows.sum_order_value}}</td>
</tr>
{% endfor %}
any hint on what's wrong with the below query?
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
ItemModifier = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_COD,
ItemName = o.ITEM_ID.ITEM_COD,
ItemID = o.ITEM_ID.ITEM_ID,
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == o.ITEM_ID.ITEM_ID && d.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID == o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID
select new Price()
{
ID = o.PRICE_ID,
PriceList = o.PRICELIST_ID.PRICELIST_,
Price = o.PRICE_
}).ToList()
}).ToList()
};
o in subquery is in read and I got the message "Could not find an implementation of the query pattern for source type . 'Where' not found."
I would like to have distinct ItemID, ItemModifier: should I create a custom IEqualityComparer to do it?
Thank you!
It seems like XPO it's not able to respond to this scenario. For reference this is what you could do with DbContext.
It sounds like maybe you want a GroupBy. Try something like this.
var result = dbContext.Prices
.GroupBy(p => new {p.ItemName, p.ItemTypeName)
.Select(g => new Item
{
ItemName = g.Key.ItemName,
ItemTypeName = g.Key.ItemTypeName,
Prices = g.Select(p => new Price
{
Price = p.Price
}
).ToList()
})
.Skip(x)
.Take(y)
.ToList();
Probable cause
In general, XPO does not support "free joins" in most of the cases. It was explicitelly written somewhere in their knowledgebase or Q/A site. If I hit that article again, I'll include a link to it.
In your original code example, you were trying to perform a "free join" in the INNER query. The 'WHERE' clause was doing a join-by-key, probably navigational, but also it contained an extra filter by "modifier" which probably is not a part of the definition of the relation.
Moreover, the query tried to reuse the IQueryable<PRICE> o in the inner query - what actually seems somewhat supported by XPO - but if you ever add any prefiltering ('where') to the toplevel 'o', it would have high odds of breaking again.
The docs state that XPO supports only navigational joins, along paths formed by properties and/or xpcollections defined in your XPObjects. This applies to XPO as whole, so XPQuery too. All other kinds of joins are called "free joins" and either:
are silently emulated by XPO by fetching related objects, extracting key values from them and rewriting the query into a multiple roundtrips with a series of partial queries that fetch full objects with WHERE-id-IN-(#p0,#p1,#p2,...) - but this happens only in the some simpliest cases
or are "not fully supported", meaning they throw exceptions and require you to manually split the query or rephrase it
Possible direct solution schemes
If ITEM_ID is a relation and XPCollection in PRICE class, then you could rewrite your query so that it fetches a PRICE object then builds up a result object and initializes its fields with PRICE object's properties. Something like:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>().AsEnumerable()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
....
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Note the 'AsEnumerable' that breaks the query and ensures that PRICE objects are first fetched instead of just trying to translate the query. Very probable that this would "just work".
Also, splitting the query into explicit stages sometimes help the XPO to analyze it:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new
{
id = o.ITEM_ID.ITEM_ID,
itemcod = o.ITEM_ID.ITEM_COD,
....
}
).AsEnumerable()
.Select(temp =>
select new ItemPriceViewModel()
{
ID = temp.id
ItemCod = temp.itemcod,
....
ItemPrices = (from d in XpoSession.Query<PRICE>()
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Here, note that I first fetch the item-data from server, and then conctruct the item on the 'client', and then build the required groupings. Note that I could not refer to the variable o anymore. In these precise case and examples, unsuprisingly, the second one (splitted) would be probably even slower than the first one, since it would fetch all PRICEs and then refetch the groupings through additional queries, while the first one would just fetch all PRICEs and then would calculate the groups in-memory basing on the PRICEs already fetched. This is not an a sideeffect of my laziness, but it is a common pitfall when rewriting the LINQ queries, so I included it as a warning :)
Both of these code examples are NOT RECOMMENDED for your case, as they would probably have very poor performace, especially if you have many PRICEs in the table, which is highly likely. I included them to present as only an example of how you could rewrite the query to siplify its structure so the XPO can eat it without choking. However, you have to be really careful and pay attention to little details, as you can very easily spoil the performance.
observations and real solution
However, it is worth noting that they are not that much worse than your original query. It was itself quite poor, since it tried to perform something near O(N^2) row-fetches from the table just to perform to group te rows by "ITEM_ID" and then formatting the results as separate objects. Properly done, it would be something like O(N lg N)+O(N), so regardless of being supported or not, your alternate attempt with GroupBy is surely a much better approach, and I'm really glad you found it yourself.
Very often when you are trying to split/simplify the XPQuery expressions as I did above, you implicitely rethink the problem and find an easier and simplier way to express the query that was initially not-supported or just were crashing.
Unfortunatelly, your query was in fact quite simple. For a really complex queries that cannot be "just rephrased", splitting into stages and making some of the join-filter work at 'client' is unavoidable.. But again, doing them on XPCollections or XPViews with CritieriaOperators is impossible too, so either we have to bear with it or use plain direct handcrafted SQL..
Sidenote:
Whole XPO has problems with "free joins", they are "not fully supported" not only in XPQuery, but also there's not much for them in XPCollection, XPView, CriteriaOperators, etc, too. But, it is worth noting that at least in "my version" of DX11, the XPQuery has very poor LINQ support at all.
I've hit many cases where a proper LINQ query was:
throwing "NotSupportedException", mostly in FreeJoins, but also very often with complex GroupBy or Select-with-Projection, GroupJoin, and many others - sometimes even Distinct(!) seemed to malfunction
throwing "NullReferenceExceptions" at some proper type conversions (XPO tried to interprete a column that held INT/NULL as an object..), often I had to write some completely odd and artificial expressions like foo!=null && foo.bar!=123 instead of foo = 123 despite the 'foo' being an public int Foo {get;set;}, all because the DX could not cope properly with NULLs in the database (because XPO created nullable-INT column for this property.. but that's another story)
throwing other random ArgumentException/InvalidOperation exceptions from other constructs
or even analyzing the query structure improperly, for example this one is usually valid:
session.Query<ABC>()
.Where( abc => abc.foo == "somefilter" )
.Select( abc => new { first = abc, b = abc } )
.ToArray();
but things like this one usually throws:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
but this one is valid:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.ToArray()
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
The middle code example usually throws with an error that reveals that XPO layer were trying to find ".first.foo" path inside the ABC class, which is obviously wrong since at that point the element type isn't ABC anymore but instead a' anonymous class.
disclaimer
I've already noted that, but let me repeat: these observations are related to DX11 and most probably also earlier. I do not know what of that was fixed in DX12 and above (if anything at all was!).
I've stumbled upon a very strange LINQ to SQL behaviour / bug, that I just can't understand.
Let's take the following tables as an example: Customers -> Orders -> Details.
Each table is a subtable of the previous table, with a regular Primary-Foreign key relationship (1 to many).
If I execute the follow query:
var q = from c in context.Customers
select (c.Orders.FirstOrDefault() ?? new Order()).Details.Count();
Then I get an exception: Could not format node 'Value' for execution as SQL.
But the following queries do not throw an exception:
var q = from c in context.Customers
select (c.Orders.FirstOrDefault() ?? new Order()).OrderDateTime;
var q = from c in context.Customers
select (new Order()).Details.Count();
If I change my primary query as follows, I don't get an exception:
var q = from r in context.Customers.ToList()
select (c.Orders.FirstOrDefault() ?? new Order()).Details.Count();
Now I could understand that the last query works, because of the following logic:
Since there is no mapping of "new Order()" to SQL (I'm guessing here), I need to work on a local list instead.
But what I can't understand is why do the other two queries work?!?
I could potentially accept working with the "local" version of context.Customers.ToList(), but how to speed up the query?
For instance in the last query example, I'm pretty sure that each select will cause a new SQL query to be executed to retrieve the Orders. Now I could avoid lazy loading by using DataLoadOptions, but then I would be retrieving thousands of Order rows for no reason what so ever (I only need the first row)...
If I could execute the entire query in one SQL statement as I would like (my first query example), then the SQL engine itself would be smart enough to only retrieve one Order row for each Customer...
Is there perhaps a way to rewrite my original query in such a way that it will work as intended and be executed in one swoop by the SQL server?
EDIT:
(longer answer for Arturo)
The queries I provided are purely for example purposes. I know they are pointless in their own right, I just wanted to show a simplistic example.
The reason your example works is because you have avoided using "new Order()" all together. If I slightly modify your query to still use it, then I still get an exception:
var results = from e in (from c in db.Customers
select new { c.CustomerID, FirstOrder = c.Orders.FirstOrDefault() })
select new { e.CustomerID, Count = (e.FirstOrder != null ? e.FirstOrder : new Order()).Details().Count() }
Although this time the exception is slightly different - Could not format node 'ClientQuery' for execution as SQL.
If I use the ?? syntax instead of (x ? y : z) in that query, I get the same exception as I originaly got.
In my real-life query I don't need Count(), I need to select a couple of properties from the last table (which in my previous examples would be Details). Essentially I need to merge values of all the rows in each table. Inorder to give a more hefty example I'll first have to restate my tabels:
Models -> ModelCategoryVariations <- CategoryVariations -> CategoryVariationItems -> ModelModuleCategoryVariationItemAmounts -> ModelModuleCategoryVariationItemAmountValueChanges
The -> sign represents a 1 -> many relationship. Do notice that there is one sign that is the other way round...
My real query would go something like this:
var q = from m in context.Models
from mcv in m.ModelCategoryVariations
... // select some more tables
select new
{
ModelId = m.Id,
ModelName = m.Name,
CategoryVariationName = mcv.CategoryVariation.Name,
..., // values from other tables
Categories = (from cvi in mcv.CategoryVariation.CategoryVariationItems
let mmcvia = cvi.ModelModuleCategoryVariationItemAmounts.SingleOrDefault(mmcvia2 => mmcvia2.ModelModuleId == m.ModelModuleId) ?? new ModelModuleCategoryVariationItemAmount()
select new
{
cvi.Id,
Amount = (mmcvia.ModelModuleCategoryVariationItemAmountValueChanges.FirstOrDefault() ?? new ModelModuleCategoryVariationItemAmountValueChange()).Amount
... // select some more properties
}
}
This query blows up at the line let mmcvia =.
If I recall correctly, by using let mmcvia = new ModelModuleCategoryVariationItemAmount(), the query would blow up at the next ?? operand, which is at Amount =.
If I start the query with from m in context.Models.ToList() then everything works...
Why are you looking into only the individual count without selecting anything related to the customer.
You can do the following.
var results = from e in
(from c in db.Customers
select new { c.CustomerID, FirstOrder = c.Orders.FirstOrDefault() })
select new { e.CustomerID, DetailCount = e.FirstOrder != null ? e.FirstOrder.Details.Count() : 0 };
EDIT:
OK, I think you are over complicating your query.
The problem is that you are using the new WhateverObject() in your query, T-SQL doesnt know anyting about that; T-SQL knows about records in your hard drive, your are throwing something that doesn't exist. Only C# knows about that. DON'T USE new IN YOUR QUERIES OTHER THAN IN THE OUTER MOST SELECT STATEMENT because that is what C# will receive, and C# knows about creating new instances of objects.
Of course is going to work if you use ToList() method, but performance is affected because now you have your application host and sql server working together to give you the results and it might take many calls to your database instead of one.
Try this instead:
Categories = (from cvi in mcv.CategoryVariation.CategoryVariationItems
let mmcvia =
cvi.ModelModuleCategoryVariationItemAmounts.SingleOrDefault(
mmcvia2 => mmcvia2.ModelModuleId == m.ModelModuleId)
select new
{
cvi.Id,
Amount = mmcvia != null ?
(mmcvia.ModelModuleCategoryVariationItemAmountValueChanges.Select(
x => x.Amount).FirstOrDefault() : 0
... // select some more properties
}
Using the Select() method allows you to get the first Amount or its default value. I used "0" as an example only, I dont know what is your default value for Amount.
So, I'm trying to return a collection of People whose ID is contained within a locally created collection of ids ( IQueryable)
When I specify "locally created collection", I mean that the Ids collection hasnt come from a LinqToSql query and has been programatically created (based upon user input).
My query looks like this:
var qry = from p in DBContext.People
where Ids.Contains(p.ID)
select p.ID;
This causes the following exception...
"queries with local collections are not supported"
How can I find all the People with an id that is contained within my locally created Ids collection?
Is it possible using LinqToSql?
If Ids is a List, array or similar, L2S will translate into a contains.
If Ids is a IQueryable, just turn it into a list before using it in the query. E.g.:
List<int> listOfIDs = IDs.ToList();
var query =
from st in dc.SomeTable
where listOfIDs.Contains(st.ID)
select .....
I was struggling with this problem also. Solved my problem with using Any() instead
people.Where(x => ids.Any(id => id == x.ID))
As the guys mentioned above, converting the ids, which is of type IQueryable to List or Array will solve the issue, this will be translated to "IN" operator in SQL.But be careful because if the count of ids >= 2100 this will cause another issue which is "The server supports a maximum of 2100 parameters" and that is the maximum number of parameters(values) you can pass to "IN" in SQL server.
Another alternative would be keeping ids as IQueryable and using LINQ "Any" operator instead of "Contains", this will be translated to "EXISTS" in SQL server.
I'm sorry but the answers here didn't work for me as I'm doing dynamic types further along.
What I did was to use "UNION" in a loop which works great. Here's how:
var firstID = cityList.First().id;
var cities = dc.zs_Cities.Where(c => c.id == firstID);
foreach(var c in cityList)
{
var tempCity = c;
cities = cities.Union(dc.zs_Cities.Where(cty => cty.id == tempCity.id));
}