Efficient way of getting count of a particular document from Couchbase - couchbase

We have a Couchbase DB and a spring boot application that interacts with it. The database has many types of documents amongst which we have order transaction documents. One of the queries is to get the Count of the number of Order transaction documents from the DB. The count of these documents in the DB is around 200 million and increasing. The query looks like the following :
SELECT COUNT(META(d).id) AS count FROM `orders` AS d WHERE META().id LIKE ':order:trx:%'
We are using the following index for this :
CREATE INDEX `idx_order_trx_meta_id` ON `orders`(meta().id) WHERE meta().id LIKE ':order:trx:%'
However this query times out even after adding an index on the bucket.
Is there a way to write efficient count queries for large document sets? Like for example use a view that keeps the count of this document or some other way?

The count needs computed by scanning the index.
You can increase index timeout if timeout is from indexer, otherwise query timeout or both. Also try second step suggested below.
In EE try the following query and see if it can use index aggregation. If required create partition index.
SELECT COUNT(1) AS count
FROM `orders` AS d
WHERE META().id LIKE ":order:trx:%";
Also try the following (store 1 character per document in index), No partition index. And see if it can use fast index count (As it is using complete index, if it used it should give in few ms as one lookup from stats, MB-34624)
CREATE INDEX `ix1` ON `orders`(SUBSTR(meta().id,0,1))
WHERE meta().id LIKE ':order:trx:%'
SELECT COUNT(1) AS count
FROM `orders` AS d
WHERE META().id LIKE ":order:trx:%" AND SUBSTR(meta().id,0,1) IS NOT MISSING;
Also if you don't need this programmatically use index stats or UI should give you number of indexed items

Related

why query all colum is way slower than query only id column in mysql high limit offset situation?

1. select * from inv_inventory_change limit 1000000,10
2. select id from inv_inventory_change limit 1000000,10
The first sql' timeconsumption is about 1.6s, the second sql timeconsumption is about 0.37s;
So the 2nd sql and 1st sql timeconsumption differential is about 1.27s;
I understand msyql will use covering index when only query indexed column, that is why 'select id' is faster;
However, when i use in idlist sql below to execute, i found it only took about 0.2s which is much shorter than the differential(1.27s), which is confusing me;
select * from inv_inventory_change c where c.id in (1013712,1013713,1013714,1013715,1013716,1013717,1013718,1013719,1013720,1013721);
My key question is why the time differential is much bigger than the where id in sql;
The inv_inventory_change table has 2321211 records;
And i add 'order by id asc' on above sqls, the timeconsumption not change;
EXPLAIN
The rule is very simple; your first query can be served without reading data from the disk/memory cache.
select id from inv_inventory_change limit 1000000,10
This can be directly served from the index table (B-Tree or its variant) without reading page information and other meta information.
select * from inv_inventory_change limit 1000000,10
This query will require two steps to fetch records. First, it will perform a query on the index table, which would be quick, but next, it needs to read page information for those records that will require disk io and storing in cache, etc. Since a LIMIT is applied, it will automatically sort for you depending on the default ORDER BY setting, most likely it will sort using the id field. Since you're selecting a large number of records it will use FileSort to sort records or something similar.
select * from inv_inventory_change c where c.id in (1013712,1013713,1013714,1013715,1013716,1013717,1013718,1013719,1013720,1013721);
This query would be served using a range scan on the index table and it can find the entry corresponding to 1013712 in O(lon N) time and it should be able the serve the query quickly.
You should also look at the number of records you're reading, e.g the query having limit 1000000,10 will require many disk io due to a large number of entries whereas in the 3rd example it will read a handful number of pages.

Query can not be executed, or waiting too long

I am using MySQL 5.0 and working with some crowded tables. I actually want to calculate something and
wrote a query like this:
SELECT
shuttle_payments.payment_user as user,
SUM(-1 * (shuttle_payments.payment_price + meal_payments.payment_price ) +
print_payments.payment_price) as spent
FROM
((shuttle_payments
INNER JOIN meal_payments ON shuttle_payments.payment_user = meal_payments.payment_user)
INNER JOIN print_payments ON meal_payments.payment_user = print_payments.payment_user)
GROUP BY
shuttle_payments.payment_user
ORDER BY
spent DESC
LIMIT 1
Well, there are 3 tables here and have approx. 60,000 rows per table. Is it taking too long because tables are so crowded (so should I transfer to NoSQL or sth) or it is a normal query but my server is taking too long because its CPU is weak? Or my query is wrong?
I want this query to sum all price columns from three tables and found which user spent the most money.
Thanks for your time :)
It looks like your query is Ok. You have to check whether there are indexes present on these three tables or not.
Please create indexes like-
CREATE INDEX idx_shuttle_payments ON shuttle_payments(payment_user);
CREATE INDEX idx_meal_payments ON meal_payments(payment_user);
CREATE INDEX idx_print_payments ON print_payments(payment_user);
Above statements will create non-clustered indexes on payment_user column.
if payment_user data type is BLOB/Text then -
CREATE INDEX idx_shuttle_payments ON shuttle_payments(payment_user(100));
CREATE INDEX idx_meal_payments ON meal_payments(payment_user(100));
CREATE INDEX idx_print_payments ON print_payments(payment_user(100));
In above statements, I have set prefix length to 100. You have to decide this prefix length as per your data.
From MySQL documentation:
BLOB and TEXT columns also can be indexed, but a prefix length must be
given.

Why PostgreSQL query give inconsistent results unless enforce a predictable result ordering with ORDER BY

In PostgreSQL doc says:
The query optimizer takes LIMIT into account when generating query plans, so you are very likely to get different plans (yielding different row orders) depending on what you give for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
But in MySQL InnoDB table, they result will have been delivered in PRIMARY KEY.
Why the query give inconsistent results? What happens in the query?
What the Postgres documentation and your observations are telling you is that records in SQL tables have no internal order. Instead, database tables are modeled after unordered sets of records. Hence, in the following query which could run on either MySQL or Postgres:
SELECT *
FROM yourTable
LIMIT 5
The database is free to return whichever 5 records it wants. In the case of MySQL, if you are seeing an ordering based on primary key, it is only by coincidence, and MySQL does not offer any such contract that this would always happen.
To resolve this problem, you should always be using an ORDER BY clause when using LIMIT. So the following query is well-defined:
SELECT *
FROM yourTable
ORDER BY some_column
LIMIT 5

SQL Performance of grouping by DATE(TIMESTAMP) vs separate columns for DATE and TIME

I'm facing a problem of displaying data from MySQL database.
I have a table with all user requestes in format:
| TIMESTAMP Time / +INDEX | Some other params |
I want to show this data on my website as a table with number of requests in each day.
The query is quite simple:
SELECT DATE(Time) as D, COUNT(*) as S FROM Stats GROUP BY D ORDER BY D DESC
But when looking into EXPLAIN this drives me mad:
Using index; **Using temporary; Using filesort**
From MySQL docs it says that it creates temporary table for this query on hard drive.
How fast it would be with 1.000.000 records? And how fast with 100.000.000?
Is there any way to put INDEX on result of function?
Maybe I should create separate columns for DATE and TIME and than group by DATE column?
What are other good ways of dealing with such problem? Caching? Another DB engine?
If you have an index on your Time column this operation is going to perform tolerably well. I'm guessing you do have that index, because your EXPLAIN output says it's using an index.
Why does this work well? Because MySQL can access this index in order -- it can scan the index -- to satisfy your query.
Don't be confused by Using temporary; Using filesort. This simply means MySQL needs to create and return a virtual table with a row for each day. That's pretty small and almost surely fits in memory. filesort doesn't necessarily mean the file has spilled to a temp file on disk; it just means MySQL has to sort the virtual table. It has to sort it to get the last day first.
By the way, if you can restrict the date range of the query you'll get predictable performance on this query even when your application has been in use for years. Try something this:
SELECT DATE(Time) as D, COUNT(*) as S
FROM Stats
WHERE Time >= CURDATE() - INTERVAL 30 DAY
GROUP BY D ORDER BY D DESC
First: a GROUP BY means sorting and it is an expensive operation. The data in the index is sorted but even in this case the ddbb needs to groups dates. So I feel that indexing by DATE may help as it will improve the speed of the query at the cost of refreshing another index at every insert. Please test it, i am not 100% sure.
Other alternatives are:
Using a partitioned table by month.
Using a materialized views
Updating a counter with every visit.
Precalculating and storing yesterday's data. Just refresh your daily visits with a WHERE DAY(timestamp) = TODAY. This way the serer will have to sort a smaller amount of data.
Dependes on how often do user visit your page and when you do need this data. Do not optimize prematuraly if you do not need it.

Running count and count distinct on many rows (tens of thousands)

I'm trying to run this query:
SELECT
COUNT(events.event_id) AS total_events,
COUNT(matches.fight_id) AS total_matches,
COUNT(players.fighter_id) AS total_players,
COUNT(DISTINCT events.organization) AS total_organizations,
COUNT(DISTINCT players.country) AS total_countries
FROM
events, matches, players
These are table details:
Events = 21k
Players = 90k
Matches = 155k
All of those are uniques, so the query's first 3 things will be those numbers. The other two values should be total_organizations, where the organization column is in the events (should return couple hundred), and total_countries should count distinct countries using country column in players table (also couple hundred).
All three of those ID columns are unique and indexed.
This query as it stands now takes forever. I never even have patience to see it complete. Is there a faster way of doing this?
Also, I need this to load these results on every page load, so should I just put this query in some hidden file, and set a cron job to run every midnight or something and populate a "totals" table or something so I can retrieve it from that table quickly?
Thanks!
First, remove the unnecessary join here; it's preventing most (if not all) of your indexes from being used. You want three different queries:
SELECT
COUNT(events.event_id) AS total_events,
COUNT(DISTINCT events.organization) AS total_organizations
FROM
events;
SELECT
COUNT(matches.fight_id) AS total_matches
FROM
matches;
SELECT
COUNT(players.fighter_id) AS total_players,
COUNT(DISTINCT players.country) AS total_countries
FROM
players;
This should go a long way to improving the performance of these queries.
Now, consider adding these indexes:
CREATE INDEX "events_organization" ON events (organization);
CREATE INDEX "players_country" ON events (country);
Compare the EXPLAIN SELECT ... results before and after adding these indexes. They might help and they might not.
Note that if you are using the InnoDB storage engine then all table rows will be visited anyway, to enforce transactional isolation. In this case, indexes will only be used to determine which table rows to visit. Since you are counting the entire table, the indexes will not be used at all.
If you are using MyISAM, which does not fully support MVCC, then COUNT() queries should be able to execute using only index cardinality, which will result in nearly instant results. This is possible because transactions are not supported on MyISAM, which means that isolation becomes a non-issue.
So if you are using InnoDB, then you may wind up having to use a cronjob to create a cache of this data anyway.