Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
Related
I'm trying to break this problem down into manageable parts: Spatial Query.
I want to create an automation script that will put a work order's LatitudeY coordinate in the work order's DESCRIPTION field.
I understand that a work order's coordinates are not stored in the WORKORDER table; they're stored in the WOSERVICEADDRESS table.
Therefore, I believe the script needs to reference a relationship in the Database Configuration application that will point to the related table.
How can I do this?
(Maximo 7.6.1.1)
You can get the related Mbo and get the values from the related Mbo and use it as shown in the below code. By getting the related Mbo you can also alter it's attributes.
from psdi.mbo import MboConstants
serviceAddressSet = mbo.getMboSet("SERVICEADDRESS")
if(serviceAddressSet.count() > 0):
serviceAddressMbo = serviceAddressSet.moveFirst()
latitudeY = serviceAddressMbo.getString("LATITUDEY")
longitudeX = serviceAddressMbo.getString("LONGITUDEX")
mbo.setValue("DESCRIPTION","%s, %s" % (longitudeX, latitudeY),MboConstants.NOACCESSCHECK)
serviceAddressSet.close()
I've got a sample script that compiles successfully:
from psdi.mbo import MboConstants
wonum = mbo.getString("WONUM")
mbo.setValue("DESCRIPTION",wonum,MboConstants.NOACCESSCHECK)
I can change it to get the LatitudeY value via the SERVICEADDRESS relationship:
from psdi.mbo import MboConstants
laty = mbo.getString("SERVICEADDRESS.LatitudeY")
longx = mbo.getString("SERVICEADDRESS.LONGITUDEX")
mbo.setValue("DESCRIPTION",laty + ", " + longx,MboConstants.NOACCESSCHECK)
This appears to work.
I have a work order in Maximo 7.6.1.1:
The WO has LatitudeY and LongitudeX coordinates in the Service Address tab.
The WO has a custom zone field.
And there is a feature class (polygons) in a separate GIS database.
I want to do spatial query to return an attribute from the polygon record that the WO intersects and use it to populate zone in the WO.
How can I do this?
Related keyword: Maximo Spatial
To do this live in Maximo using an automation script is possible or by writing custom code into Spatial (more challenging). You want to use the /MapServer/identify tool and post the geometry xy, coordinate system, and the layer you want to query. identify window
You will have to format the geometry object correctly and test your post from the window. I usually grab the post from the network section of developer tools once I get it to work and change the output format to json and use it in my code.
You may actually not need to touch your Maximo environment at all. How about just using a trigger on your work orders table ? That trigger can then automatically fill the zone ID from a simple select statement that matches x and y with the zones in the zones table. Here is how that could look like.
This assumes that your work orders are in a table like this:
create table work_orders (
wo_id number primary key,
x number,
y number,
zone_id number
);
and the zones in a table like this
create table zones (
zone_id number primary key,
shape st_geometry
)
Then the trigger would be like this
create or replace trigger work_orders_fill_zone
before insert or update of x,y on work_orders
for each row
begin
select zone_id
into :new.zone_id
from zones
where sde.st_contains (zone_shape, sde.st_point (:new.x, :new.y, 4326) ) = 1;
end;
/
Some assumptions:
The x and y columns contain coordinates in WGS84 longitude/latitude (not in some projection or some other long/lat coordinate system)
Zones don't overlap: a work order point is always therefore in one and only one zone. If not, then the query may return multiple results, which you then need to handle.
Zones fully cover the territory your work orders can take place in. If a work order location can be outside all your zones, then you also need to handle that (the query would return no result).
The x and y columns are always filled. If they are optional, then you also need to handle that case (set zone_id to NULL if either x or y is NULL)
After that, each time a new work order is inserted in the work_orders table, the zone_id column will be automatically updated.
You can initialize zone_id in your existing work orders with a simple update:
update work_orders set x=x, y=y;
This will make the trigger run for each row in the table ... It may take some time to complete if the table is large.
Adapt the code in the Library Scripts section of Maximo 76 Scripting Features (pdf):
#What the script does:
# 1. Takes the X&Y coordinates of a work order in Maximo
# 2. Generates a URL from the coordinates
# 3. Executes the URL via a separate script/library (LIB_HTTPCLIENT)
# 4. Performs a spatial query in an ESRI REST feature service (a separate GIS system)
# 5. Returns JSON text to Maximo with the attributes of the zone that the work
# order intersected
# 6. Parses the zone number from the JSON text
# 7. Inserts the zone number into the work order record
from psdi.mbo import MboConstants
from java.util import HashMap
from com.ibm.json.java import JSONObject
field_to_update = "ZONE"
gis_field_name = "ROADS_ZONE"
def get_coords():
"""
Get the y and x coordinates(UTM projection) from the WOSERVICEADDRESS table
via the SERVICEADDRESS system relationship.
The datatype of the LatitdeY and LongitudeX fields is decimal.
"""
laty = mbo.getDouble("SERVICEADDRESS.LatitudeY")
longx = mbo.getDouble("SERVICEADDRESS.LongitudeX")
#Test values
#laty = 4444444.7001941890
#longx = 666666.0312127020
return laty, longx
def is_latlong_valid(laty, longx):
#Verify if the numbers are legitimate UTM coordinates
return (4000000 <= laty <= 5000000 and
600000 <= longx <= 700000)
def make_url(laty, longx, gis_field_name):
"""
Assembles the URL (including the longx and the laty).
Note: The coordinates are flipped in the url.
"""
url = (
"http://hostname.port"
"/arcgis/rest/services/Example"
"/Zones/MapServer/15/query?"
"geometry={0}%2C{1}&"
"geometryType=esriGeometryPoint&"
"spatialRel=esriSpatialRelIntersects&"
"outFields={2}&"
"returnGeometry=false&"
"f=pjson"
).format(longx, laty, gis_field_name)
return url
def fetch_zone(url):
# Get the JSON text from the feature service (the JSON text contains the zone value).
ctx = HashMap()
ctx.put("url", url)
service.invokeScript("LIBHTTPCLIENT", ctx)
json_text = str(ctx.get("response"))
# Parse the zone value from the JSON text
obj = JSONObject.parse(json_text)
parsed_val = obj.get("features")[0].get("attributes").get(gis_field_name)
return parsed_val
try:
laty, longx = get_coords()
if not is_latlong_valid(laty, longx):
service.log('Invalid coordinates')
else:
url = make_url(laty, longx, gis_field_name)
zone = fetch_zone(url)
#Insert the zone value into the zone field in the work order
mbo.setValue(field_to_update, zone, MboConstants.NOACCESSCHECK)
service.log(zone)
except:
#If the script fails, then set the field value to null.
mbo.setValue(field_to_update, None, MboConstants.NOACCESSCHECK)
service.log("An exception occurred")
LIBHTTPCLIENT: (a reusable Jython library script)
from psdi.iface.router import HTTPHandler
from java.util import HashMap
from java.lang import String
handler = HTTPHandler()
map = HashMap()
map.put("URL", url)
map.put("HTTPMETHOD", "GET")
responseBytes = handler.invoke(map, None)
response = String(responseBytes, "utf-8")
I have a script that scans a DynamoDB table that stores my instance IDs. Then I try to query another table to see if it also has that same instance and get all of the metadata attributes in a master table. When I iterate through the query using the instance ID from the initial scan of the first table, I am noticing each character of the instance id string is being printed to a new line, instead of the entire string on one line. I am confused how to fix this. Below is my code, sample output, and the expected output.
CODE:
import boto3
import json
from boto3.dynamodb.conditions import Key, Attr
def table_diff():
dynamo = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
table_missing = dynamodb.Table('RunningInstances')
missing_response = dynamo.scan(TableName='CWPMissingAgent')
for instances in missing_response['Items']:
instance_id = instances['missing_instances']['S']
# This works how I want, prints i-xxxxx
print(instance_id)
for id in instance_id:
# This does not print how I want (vertically)
print(id)
query_response = table_missing.query(KeyConditionExpression=Key('ID').eq(id))
OUTPUT:
i
-
x
x
x
x
x
EXPECTED OUTPUT:
i-xxxxx
etc etc
instance_id is a string. Thus, when you loop over it (for id in instance_id), you are actually looping over each character in the string, and printing them out individually.
Why do you try to loop over it, when you say that just printing it produces the correct result?
I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create
any hint on what's wrong with the below query?
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
ItemModifier = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_COD,
ItemName = o.ITEM_ID.ITEM_COD,
ItemID = o.ITEM_ID.ITEM_ID,
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == o.ITEM_ID.ITEM_ID && d.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID == o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID
select new Price()
{
ID = o.PRICE_ID,
PriceList = o.PRICELIST_ID.PRICELIST_,
Price = o.PRICE_
}).ToList()
}).ToList()
};
o in subquery is in read and I got the message "Could not find an implementation of the query pattern for source type . 'Where' not found."
I would like to have distinct ItemID, ItemModifier: should I create a custom IEqualityComparer to do it?
Thank you!
It seems like XPO it's not able to respond to this scenario. For reference this is what you could do with DbContext.
It sounds like maybe you want a GroupBy. Try something like this.
var result = dbContext.Prices
.GroupBy(p => new {p.ItemName, p.ItemTypeName)
.Select(g => new Item
{
ItemName = g.Key.ItemName,
ItemTypeName = g.Key.ItemTypeName,
Prices = g.Select(p => new Price
{
Price = p.Price
}
).ToList()
})
.Skip(x)
.Take(y)
.ToList();
Probable cause
In general, XPO does not support "free joins" in most of the cases. It was explicitelly written somewhere in their knowledgebase or Q/A site. If I hit that article again, I'll include a link to it.
In your original code example, you were trying to perform a "free join" in the INNER query. The 'WHERE' clause was doing a join-by-key, probably navigational, but also it contained an extra filter by "modifier" which probably is not a part of the definition of the relation.
Moreover, the query tried to reuse the IQueryable<PRICE> o in the inner query - what actually seems somewhat supported by XPO - but if you ever add any prefiltering ('where') to the toplevel 'o', it would have high odds of breaking again.
The docs state that XPO supports only navigational joins, along paths formed by properties and/or xpcollections defined in your XPObjects. This applies to XPO as whole, so XPQuery too. All other kinds of joins are called "free joins" and either:
are silently emulated by XPO by fetching related objects, extracting key values from them and rewriting the query into a multiple roundtrips with a series of partial queries that fetch full objects with WHERE-id-IN-(#p0,#p1,#p2,...) - but this happens only in the some simpliest cases
or are "not fully supported", meaning they throw exceptions and require you to manually split the query or rephrase it
Possible direct solution schemes
If ITEM_ID is a relation and XPCollection in PRICE class, then you could rewrite your query so that it fetches a PRICE object then builds up a result object and initializes its fields with PRICE object's properties. Something like:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>().AsEnumerable()
select new ItemPriceViewModel()
{
ID = o.ITEM_ID.ITEM_ID,
ItemCod = o.ITEM_ID.ITEM_COD,
....
ItemModifierID = o.ITEM_MODIFIER_ID.ITEM_MODIFIER_ID,
ItemPrices = (from d in o
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Note the 'AsEnumerable' that breaks the query and ensures that PRICE objects are first fetched instead of just trying to translate the query. Very probable that this would "just work".
Also, splitting the query into explicit stages sometimes help the XPO to analyze it:
return new ItemPricesViewModel()
{
Source = (from o in XpoSession.Query<PRICE>()
select new
{
id = o.ITEM_ID.ITEM_ID,
itemcod = o.ITEM_ID.ITEM_COD,
....
}
).AsEnumerable()
.Select(temp =>
select new ItemPriceViewModel()
{
ID = temp.id
ItemCod = temp.itemcod,
....
ItemPrices = (from d in XpoSession.Query<PRICE>()
where d.ITEM_ID.ITEM_ID == ....
select new Price()
.... .... ....
};
Here, note that I first fetch the item-data from server, and then conctruct the item on the 'client', and then build the required groupings. Note that I could not refer to the variable o anymore. In these precise case and examples, unsuprisingly, the second one (splitted) would be probably even slower than the first one, since it would fetch all PRICEs and then refetch the groupings through additional queries, while the first one would just fetch all PRICEs and then would calculate the groups in-memory basing on the PRICEs already fetched. This is not an a sideeffect of my laziness, but it is a common pitfall when rewriting the LINQ queries, so I included it as a warning :)
Both of these code examples are NOT RECOMMENDED for your case, as they would probably have very poor performace, especially if you have many PRICEs in the table, which is highly likely. I included them to present as only an example of how you could rewrite the query to siplify its structure so the XPO can eat it without choking. However, you have to be really careful and pay attention to little details, as you can very easily spoil the performance.
observations and real solution
However, it is worth noting that they are not that much worse than your original query. It was itself quite poor, since it tried to perform something near O(N^2) row-fetches from the table just to perform to group te rows by "ITEM_ID" and then formatting the results as separate objects. Properly done, it would be something like O(N lg N)+O(N), so regardless of being supported or not, your alternate attempt with GroupBy is surely a much better approach, and I'm really glad you found it yourself.
Very often when you are trying to split/simplify the XPQuery expressions as I did above, you implicitely rethink the problem and find an easier and simplier way to express the query that was initially not-supported or just were crashing.
Unfortunatelly, your query was in fact quite simple. For a really complex queries that cannot be "just rephrased", splitting into stages and making some of the join-filter work at 'client' is unavoidable.. But again, doing them on XPCollections or XPViews with CritieriaOperators is impossible too, so either we have to bear with it or use plain direct handcrafted SQL..
Sidenote:
Whole XPO has problems with "free joins", they are "not fully supported" not only in XPQuery, but also there's not much for them in XPCollection, XPView, CriteriaOperators, etc, too. But, it is worth noting that at least in "my version" of DX11, the XPQuery has very poor LINQ support at all.
I've hit many cases where a proper LINQ query was:
throwing "NotSupportedException", mostly in FreeJoins, but also very often with complex GroupBy or Select-with-Projection, GroupJoin, and many others - sometimes even Distinct(!) seemed to malfunction
throwing "NullReferenceExceptions" at some proper type conversions (XPO tried to interprete a column that held INT/NULL as an object..), often I had to write some completely odd and artificial expressions like foo!=null && foo.bar!=123 instead of foo = 123 despite the 'foo' being an public int Foo {get;set;}, all because the DX could not cope properly with NULLs in the database (because XPO created nullable-INT column for this property.. but that's another story)
throwing other random ArgumentException/InvalidOperation exceptions from other constructs
or even analyzing the query structure improperly, for example this one is usually valid:
session.Query<ABC>()
.Where( abc => abc.foo == "somefilter" )
.Select( abc => new { first = abc, b = abc } )
.ToArray();
but things like this one usually throws:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
but this one is valid:
session.Query<ABC>()
.Select( abc => new { first = abc, b = abc } )
.ToArray()
.Where ( temp => temp.first.foo == "somefilter" )
.ToArray();
The middle code example usually throws with an error that reveals that XPO layer were trying to find ".first.foo" path inside the ABC class, which is obviously wrong since at that point the element type isn't ABC anymore but instead a' anonymous class.
disclaimer
I've already noted that, but let me repeat: these observations are related to DX11 and most probably also earlier. I do not know what of that was fixed in DX12 and above (if anything at all was!).