Stream a NOT NULL selection of a table? - sqlalchemy

I'm trying to select the primary key for all rows in a table based on if another column is NULL.
The following code does not do what I want, but this is what it would look like as a pure select(), but the table is so large that it nearly fills up memory before returning any results.
s = tweets.select().where(tweets.c.coordinates != None)
result = engine.execute(s)
for row in result:
print(row)
Because the table is so large, I found a streaming solution that works for the session.query() object:
def page_query(q):
r = True
offset = 0
while r:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
so I'm trying to structure the above select() as a query(), but when I do, it returns every row in the table, including ones with coordinates = 'null'
q= session.query(Tweet).filter(Tweet.coordinates.is_not(None))
for i in page_query(q):
print(f' {i}')
If I instead do
q= session.query(Tweet).filter(Tweet.coordinates.is_not('null'))
for i in page_query(q):
print(f' {i}')
I get an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "'null'"
LINE 3: WHERE milan_tweets.coordinates IS NOT 'null'
^
(using != appears to give the same results as the built in .is_not())
So how can I make this selection?
EDIT: Code block at the top does NOT do what I expected originally, my mistake.
Rows are added to the database as python Nones, and looking in dbeaver shows the values as "null"

You have correctly diagnosed the problem.
Query returns e.g. a million rows,
and the psycopg2 driver drags all
of those result rows over the network,
buffering them locally, before returning
even a single row up to your app.
Why? Because the public API includes a
detail where your app could ask
"how many rows were in that result?",
and the driver must retrieve all in order
to learn that bit of trivia.
If you promise not to ask the "how many?"
question, you can stream results with this:
import sqlalchemy as sa
engine = sa.create_engine(uri).execution_options(stream_results=True)
Then rows will be up-delivered to your app
nearly
as soon as they become available,
rather than being buffered
a long time.
This yields a significantly smaller
memory footprint for your python process,
as the DB driver layer does not need
to malloc() storage sufficient to
store all million result rows.
https://docs.sqlalchemy.org/en/14/core/connections.html#streaming-with-a-fixed-buffer-via-yield-per
cf test_core_fetchmany_w_streaming

Related

Best way to check for near objects in SQL database according to x and y columns

I have a database with pretty static objects e.g buildings with x and y coorinates for a game in which I will be sending http requests to my server to get all objects around some given x and y coordinates.
Currently I am using this simple sql on the server which then returns the data in JSON.
SELECT OBJECTS.id \"id\", POINTS.x, POINTS.y
FROM OBJECT, OBJECTPOINTS, POINTS
WHERE OBJECTPOINTS.OID = OBJECTS.ID AND OBJECTPOINTS.PID = POINTS.ID AND
ABS(POINTS.x -\"+x+") < 0.01 AND ABS(POINTS.y - "+y+") < 0.01;"
Each object is represented by points which will be used to draw on the client.
I am currently achieving ~ 5 seconds respond time for around 1.5M Points and 200k objects.
To me this is fairly reasonable however the problem I have is that the db is blocked with each request. Here are 10 requests sent at the same time
:
36 seconds for map data with 10 clients requesting at the same time is way too much.
So my question is what would be a better way to handle the request rather then comparing distance in sql?
Would it be deliberatly faster to hold all of those objects in memory and iterate them on the server?
I have also thought of abstracting all the data in to some kind of grid and then first checking in which grid the request coords are to then run the same query as above on the db with only the objects in that certain square. Is there some clever solution I might be overlooking in the sql maybe?
Your query cannot be utilized by an index, because you are using a function on your column data in the where-clause of your query (ABS(POINTS.x ...)). If you rewrite your query to compare the raw value of your columns against another value you can add an index to your table and your query no longer needs to scan your full table to answer the query.
Rewrite your where clause to something like this to replace the ABS() function.
(POINTS.x < (x + 0.01) AND POINTS.x > (x - 0.01))
Then add an index to your table like:
alter table POINTS add index position(x, y);
Check the changing of the scanned rows of both queries with and without the index by adding the explain keyword infront of your query.

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.
It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).
In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)
MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.
In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"
Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

Rails - how to fetch random records from an object?

I am doing something like this:
data = Model.where('something="something"')
random_data = data.rand(100..200)
returns:
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Once I get this random data, I need to iterate through that data, like this:
random_data.each do |rd|
...
I know there's a way to fetch random data in MySQL, but I need to pick the random data like 400 times, so I think to load data once from database and 400 times to pick random number is more efficient than to run the query 400 times on MySQL.
But - how to get rid of that error?
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Thank you in advance
I would add the following scope to the model (depends on the database you are using):
# to model/model.rb
# 'RANDOM' works with postgresql and sqlite, whereas mysql uses 'RAND'
scope :random, -> { order('RAND()') }
Then the following query would load a random number (in the range of 200-400) of objects in one query:
Model.random.limit(rand(200...400))
If you really want to do that in Rails and not in the database, then load all records and use sample:
Model.all.sample(rand(200..400))
But that to be slower (depending on the number of entries in the database), because Rails would load all records from the database and instantiate them what might take loads of memory.
It really depends how much effort you want to put into optimizing this, because there's more than one solution. Here's 2 options..
Something simple is to use ORDER BY RAND() LIMIT 400 to randomly select 400 items.
Alternatively, just select everything under the moon and then use Ruby to randomly pick 400 out of the total result set, ex:
data = Model.where(something: 'something').all # all is necessary to exec query
400.times do
data.sample # returns a random model
end
I wouldn't recommend the second method, but it should work.
Another way, which is not DB specific is :
def self.random_record
self.where('something = ? and id = ?', "something", rand(self.count))
end
The only things here is - 2 queries are being performed. self.count is doing one query - SELECT COUNT(*) FROM models and the other is your actual query to get a random record.
Well, now suppose you want n random records. Then write it like :
def self.random_records n
records = self.count
rand_ids = Array.new(n) { rand(records) }
self.where('something = ? and id IN (?)',
"something", rand_ids )
end
Use data.sample(rand(100..200))
for more info why rand is not working, read here https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/4555

What's the mongoid equvilent to COUNT(`column`) in MYSQL?

I want to count the amount of matching documents with a query using mongoid such as:
Chain.where(:updated_at.gte => past_time).count
However, I am worried that what's actually happening here is that mongoid is selecting and PARSING everything from mongoid and then returning the count to me. This seems very slow. I want mongo to directly return to me a count, so that ruby/mongoid doesnt have to parse a large amount of objects. In MYSQL I would do this by doing COUNT(column), which would spare PHP (for instance) the hassle of parsing/mapping a bunch of rows just to disregard them since I'm only interested in the amount of rows returned.
You're worrying needlessly. If you check the Mongoid docs, you'll see that Criteria#count is thing wrapper around Moped::Query#count. If you look at how Moped::Query#count works, you'll see this:
def count(limit = false)
command = { count: collection.name, query: selector }
command.merge!(skip: operation.skip, limit: operation.limit) if limit
result = collection.database.command(command)
result["n"].to_i
end
So Moped::Query#count simply sends a count command down into MongoDB, then MongoDB does the counting and sends the count back to your Ruby code.

How could I know how much time it takes to query items in a table of MYSQL?

Our website has a problem: The visiting time of one page is too long. We have found out that it has a n*n matrix in that page; and for each item in the matrix, it queries three tables from MYSQL database. Every item in that matrix do the query quiet alike.
So I wonder maybe it is the large amount of MYSQL queries lead to the problem. And I want to try to fix it. Here is one of my confusions I list below:
1.
m = store.execute('SELECT X FROM TABLE1 WHERE I=1')
result = store.execute('SELECT Y FROM TABLE2 WHERE X in m')
2.
r = store.execute('SELECT X, Y FROM TABLE2');
result = []
for each in r:
i = store.execute('SELECT I FROM TABLE1 WHERE X=%s', each[0])
if i[0][0]=1:
result.append(each)
It got about 200 items in TABLE1 and more then 400 items in TABLE2. I don't know witch part takes the most time, so I can't make a better decision of how to write my sql statement.
How could I know how much time it takes to do some operation in MYSQL? Thank you!
Rather than installing a bunch of special tools, you could take a dead-simple approach like this (pardon my Ruby):
start = Time.new
# DB query here
puts "Query XYZ took #{Time.now - start} sec"
Hopefully you can translate that to Python. OR... pardon my Ruby again...
QUERY_TIMES = {}
def query(sql)
start = Time.new
connection.execute(sql)
elapsed = Time.new - start
QUERY_TIMES[sql] ||= []
QUERY_TIMES[sql] << elapsed
end
Then run all your queries through this custom method. After doing a test run, you can make it print out the number of times each query was run, and the average/total execution times.
For the future, plan to spend some time learning about "profilers" (if you haven't already). Get a good one for your chosen platform, and spend a little time learning how to use it well.
I use the MySQL Workbench for SQL development. It gives response times and can connect remotely to MySQL servers granted you have the permission (which in this case will give you a more accurate reading).
http://www.mysql.com/products/workbench/
Also, as you've realized it appears you have a SQL statement in a for loop. That could drastically effect performance. You'll want to take a different route with retrieving that data.