Why are my two SQL iterations different? - sqlalchemy

I have the following SQLAlchemy iteration in python:
for business in session.query(Business).filter(Business.name.like("%" + term + "%")):
print business.name
I have about 7000 businesses in my list, and this runs in under 10ms. Great!
However, I want to support a more specific search algorithm than like - for example, I want to match & with and, and so forth. So, I tried the following:
for business in session.query(Business):
if term in business.name:
print business.name
The latter takes about 600ms to run. What is SQLAlchemy doing in its filter call that makes the iteration so much faster? How can I make mine faster?
Thanks!

SA does nothing to optimize your query.
The difference you have is between:
filter data on RDBMS directly on one hand and
load all rows from the database into memory, then filter out those not needed.
Obviously, the second version is going to be much slower, especially if the number of rows is large. In fact, the SA might even contribute to your second version being slower, as it creates Business objects for every row in the database, which depending on your model might be relatively expensive operation. Furthermore, if you have eager-load relationships, it will load those too. Just run two queries below against database directly to see how many rows each returns:
SELECT * FROM Business WHERE Name LIKE '%term%';
SELECT * FROM Business;
I would suggest using SQL engine filtering capabilities for large number of objects, as you can do most of the filtering directly there much more efficiently

Related

Is filtered selection faster than fetching all the rows and then filtering

So I want to create a table in the frontend where I will list every single user. The thing is that the tables are relational and I have to get data from multiple tables in order to fulfill my goal.
Now here comes my question (keep in mind I have a MySQL database) :
Which method is better on the long run :
Generate joined queries that fetch all the data from each table where a user has any information (it outputs ~80 column per row and only 15 of them are needed)
Fetch the data that I need with multiple queries and then just "stick" the values together and output them (15 columns and all of them are needed, but I have to do extra work)
I would suggest you to go for third option.
Generate joined queries that fetch only necessary 15 columns for your front end. It would be the most efficient way.
If you are facing challenges with joining the tables then you can share table structures with sample data and desired output here with your query. We can try to help you achieve your goal.
This is a bit long for a comment.
I don't understand your first option. Why would you be selecting columns that you don't need? If there are 15 columns that you specifically want, then select those columns and nothing else.
In general, it is faster to have the database do most of the work. It can take advantage of its optimizer to produce the best execution plan that it can.
From Experience with embedded hardware mysql server.
If the hardware can do it and has enough resources you let the databse server run it course, as it can run its optimizer.
But if the server hardware lags on some fronts, you transpport all data to the client and let it run Javascript on all returned data.
The same goes for bandwith of the internet connection, it is slow, you want lesser number of rows, to transport because that the user will notice it, even old smartphones have to much power in cpu, amd can so handle everything with easy what you through at them.
In Basic there is no sime answer, you have to check server hardware and the usual bandwith offered and then program a solution that works best
A simple Rule of Thumb:
Fewer round-trips to the database server is usually the faster alternative.

Cosmos DB : Faster Search Options

We have huge cosmosDB container with billions of rows and almost 300 columns. Data is partitioned and modeled in a way we query it most of the time.
For example : User table is partitioned by userId thats why below query works fine.
Select * from User where userId = "user01234"
But in some cases, we need to query data differently that need sorting and then query.
For example : Get data from User Table using userpost and date of post
Select * from user where userPostId = "P01234" orderBy date limit 100
This query takes lot of time because of the size of data and data is not partitioned based on query2 (user Post).
My question is - How can we make query2 and other similar queries faster when data is not partitioned accordingly.
Option 1: "Create separate collection which is partitioned as per Query2" -
This will make query faster but for any new query we will end up creating a new collection, which is duplication of billions of records. [Costly Option]
Option 2: "Build elastic search on top of DB?" This is time consuming option and may be over killing for this slow query problem.
Is there any other option that can be used? Let me know your thoughts.
Thanks in advance!
Both options are expensive. The key is deciding which is cheaper, including running the cross-partition query. This will require you costing each of these options out.
For the cross-partition query, capture the RU charge in the response object so you know the cost of it.
For change feed, this will have an upfront cost as you run it over your existing collection, but whether that cost remains high depends on how much data is inserted or updated each month. Calculating the cost to populate your second collection will take some work. You can start by measuring the RU Charge in the response object when doing an insert then multiply by the number of rows. Calculating how much throughput you'll need will be a function of how quickly you want to populate your second collection. It's also a function of how much compute and how many instances you use to read and write the data to the second collection.
Once the second collection is populated, Change Feed will cost 2 RU/s to poll for changes (btw, this is configurable) and 1 RU/s to read each new item. The cost of inserting data into a second collection costs whatever it is when you measured it earlier.
If this second query doesn't get run that often and your data doesn't change that much, then change feed could save you money. If you run this query a lot and your data changes frequently too, change feed could still save you money.
With regards to Elastic Search or Azure Search, I generally find this can be more expensive than keeping the cross-partition query or change feed. Especially if you're doing it to just answer a second query. Generally this is a better option when you need true free text query capabilities.
A third option you might explore is using Azure Synapse Link and then run both queries using SQL Serverless or Spark.
Some other observations.
Unless you need all 300 properties in these queries you run, you may want to consider shredding these items into separate documents and storing as separate rows. Especially if you have highly asymmetric update patterns where only a small number of properties get frequently updated. This will save you a ton of money on updates because the smaller the item you update, the cheaper (and faster) it will be.
The other thing I would suggest is to look at your index policy and exclude every property that is not used in the where clause for your queries and include properties that are. This will have a dramatic impact on RU consumption for inserts. Also take a look at composite index for your date property as this has a dramatic impact on queries that use order by.

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

What is the optimal solution, use Inner Join or multiple queries?

What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]
Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.
This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!
Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.
In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.
One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.