SQLAlchemy: How to check for existence in the Database - sqlalchemy

Is there a better way than doing this with sqlalchemy?
def has_keyword(self, kw):
s = self.sessionmaker()
return 0 < s.query(Keyword).filter(kw.word == kw.word)

You can make the query marginally more optimal by adding .count() to the end
ex: return 0 < s.query(Keyword).filter(kw.word == kw.word).count()

A Pythonic way I like to do it with:
query = session.query(Model).filter( ... )
try:
model = query.one()
except NoResultFound:
# it does not exist!
except MultipleResultsFound:
# there are more than one matching the filter criteria!
This allows for, say, creating a new model when one does not exist, and warning the user if multiple ones exist (pick the first one, etc).

Related

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

Django bulk update setting each to different values? [duplicate]

I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create

find row in ruby array

I have a mysql query that returns this type of data:
{"id"=>1, "serviceCode"=>"1D00", "price"=>9.19}
{"id"=>2, "serviceCode"=>"1D01", "price"=>9.65}
I need to return the id field based on a match of the serviceCode.
i.e. I need a method like this
def findID(serviceCode)
find the row that has the service code and return the ID
end
I was thinking of having a serviceCodes.each do |row| method and loop through and essentially go
if row == serviceCode
return row['id']
end
is there a faster / easier way?
You can use the method Enumerable#find:
service_codes = [
{"id"=>1, "serviceCode"=>"1D00", "price"=>9.19},
{"id"=>2, "serviceCode"=>"1D01", "price"=>9.65}
]
service_codes.find { |row| row['serviceCode'] == '1D00' }
# => {"id"=>1, "serviceCode"=>"1D00", "price"=>9.19}
If you use Rails Active Record as ORM and your Model named Product (only for example),
you can use something like this:
def findID(serviceCode)
Product.select(:id).where(serviceCode: serviceCode).first
end
If you have plain SQL Query in plain ruby class (not recommended), you should change this query to get only the id, as Luiggi mentioned. But aware of SQL Injections if your serviceCode coming from external Requests.

Rails select random record

I don't know if I'm just looking in the wrong places here or what, but does active record have a method for retrieving a random object?
Something like?
#user = User.random
Or... well since that method doesn't exist is there some amazing "Rails Way" of doing this, I always seem to be to verbose. I'm using mysql as well.
Most of the examples I've seen that do this end up counting the rows in the table, then generating a random number to choose one. This is because alternatives such as RAND() are inefficient in that they actually get every row and assign them a random number, or so I've read (and are database specific I think).
You can add a method like the one I found here.
module ActiveRecord
class Base
def self.random
if (c = count) != 0
find(:first, :offset =>rand(c))
end
end
end
end
This will make it so any Model you use has a method called random which works in the way I described above: generates a random number within the count of the rows in the table, then fetches the row associated with that random number. So basically, you're only doing one fetch which is what you probably prefer :)
You can also take a look at this rails plugin.
We found that offsets ran very slowly on MySql for a large table. Instead of using offset like:
model.find(:first, :offset =>rand(c))
...we found the following technique ran more than 10x faster (fixed off by 1):
max_id = Model.maximum("id")
min_id = Model.minimum("id")
id_range = max_id - min_id + 1
random_id = min_id + rand(id_range).to_i
Model.find(:first, :conditions => "id >= #{random_id}", :limit => 1, :order => "id")
Try using Array's sample method:
#user = User.all.sample(1)
In Rails 4 I would extend ActiveRecord::Relation:
class ActiveRecord::Relation
def random
offset(rand(count))
end
end
This way you can use scopes:
SomeModel.all.random.first # Return one random record
SomeModel.some_scope.another_scope.random.first
I'd use a named scope. Just throw this into your User model.
named_scope :random, :order=>'RAND()', :limit=>1
The random function isn't the same in each database though. SQLite and others use RANDOM() but you'll need to use RAND() for MySQL.
If you'd like to be able to grab more than one random row you can try this.
named_scope :random, lambda { |*args| { :order=>'RAND()', :limit=>args[0] || 1 } }
If you call User.random it will default to 1 but you can also call User.random(3) if you want more than one.
If you would need a random record but only within certain criteria you could use "random_where" from this code:
module ActiveRecord
class Base
def self.random
if (c = count) != 0
find(:first, :offset =>rand(c))
end
end
def self.random_where(*params)
if (c = where(*params).count) != 0
where(*params).find(:first, :offset =>rand(c))
end
end
end
end
For e.g :
#user = User.random_where("active = 1")
This function is very useful for displaying random products based on some additional criteria
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
Simple Usage:
#user = User.random_records(1).take
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
Here is the best solution for getting random records from database.
RoR provide everything in ease of use.
For getting random records from DB use sample, below is the description for that with example.
Backport of Array#sample based on Marc-Andre Lafortune’s github.com/marcandre/backports/ Returns a random element or n random elements from the array. If the array is empty and n is nil, returns nil. If n is passed and its value is less than 0, it raises an ArgumentError exception. If the value of n is equal or greater than 0 it returns [].
[1,2,3,4,5,6].sample # => 4
[1,2,3,4,5,6].sample(3) # => [2, 4, 5]
[1,2,3,4,5,6].sample(-3) # => ArgumentError: negative array size
[].sample # => nil
[].sample(3) # => []
You can use condition with as per your requirement like below example.
User.where(active: true).sample(5)
it will return randomly 5 active user's from User table
For more help please visit : http://apidock.com/rails/Array/sample

Recalculate Counter Cache of 120k Records [Rails / ActiveRecord]

The following situation:
I have a poi model, which has many pictures (1:n). I want to recalculate the counter_cache column, because the values are inconsistent.
I've tried to iterate within ruby over each record, but this takes much too long and quits sometimes with some "segmentation fault" bugs.
So i wonder, if its possible to do this with a raw sql query?
If, for example, you have Post and Picture models, and post has_many :pictures, you can do it with update_all :
Post.update_all("pictures_count=(Select count(*) from pictures where pictures.post_id=posts.id)")
I found a nice solution on krautcomputing.
It uses reflections to find all counter caches of a project, SQL queries to find only the objects that are inconsistent and use Rails reset_counters to clean things up.
Unfortunately it only works with "conventional" counter caches (no class name, no custom counter cache names) so I refined it:
Rails.application.eager_load!
ActiveRecord::Base.descendants.each do |many_class|
many_class.reflections.each do |name, reflection|
if reflection.options[:counter_cache]
one_class = reflection.class_name.constantize
one_table, many_table = [one_class, many_class].map(&:table_name)
# more reflections, use :inverse_of, :counter_cache etc.
inverse_of = reflection.options[:inverse_of]
counter_cache = reflection.options[:counter_cache]
if counter_cache === true
counter_cache = "#{many_table}_count"
inverse_of ||= many_table.to_sym
else
inverse_of ||= counter_cache.to_s.sub(/_count$/,'').to_sym
end
ids = one_class
.joins(inverse_of)
.group("#{one_table}.id")
.having("MAX(#{one_table}.#{counter_cache}) != COUNT(#{many_table}.id)")
.pluck("#{one_table}.id")
ids.each do |id|
puts "reset #{id} on #{many_table}"
one_class.reset_counters id, inverse_of
end
end
end
end