As part of an ETL process, I am attempting to read all document key ids from a Couchbase bucket (Couchbase Enterprise 4.5). This bucket may have tens of millions of documents. To test this (at this stage I'm just trying to see if this approach will be quick enough for our needs), I am setting a large serverSideTimeout value using code like this:
final N1qlParams n1qlParams = N1qlParams.build().serverSideTimeout(1L, TimeUnit.DAYS);
aBucket.query(N1qlQuery.simple("select meta(b).id from `my_bucket` as b", n1qlParams))
This starts to execute, and my subscriber starts getting ids back from the query, but then I get this error:
{"msg":"Index scan timed out - cause: Index scan timed out","code":12015}
I'm not surprised that it's needing to do an index scan, as I am in fact trying to read out everything in the primary index. The obvious related question here about "Index scan timed out" mentions in the comments that there is a setting to adjust the index scan timeout value, but I can't find where this setting is. I've looked in the N1qlParams object, the CouchbaseEnvironment, and on the Index Settings section of the Cluster settings in the Couchbase Admin UI, and I can't find this setting anywhere. How do I set the index scan timeout to a longer value for queries where I'm expecting to do a full index scan?
As found in a Couchbase forum post, one needs to send an HTTP POST to http://<server>:9102/settings with content {"indexer.settings.scan_timeout": <new_timeout_in_milliseconds>}.
It appears that there are many of these low-level index service settings that can be configured using this /settings page; sending a GET will retrieve them all with their current values.
Related
There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.
I have a big table, which saved data with an ID based on input from an external API. The ID is stored in an int field. When I developed the system, I encountered no problems, because the ID of records in the external API were always below 2147483647.
The system has been fetching data from the API for the last few months, and apparantly the ID crossed the 2147483647 mark. I now have a database with thousands of unusable records with ID 2147483647.
It is not possible to fetch this information from the database again (basically, the API allows us to look up data from max x days ago).
I am pretty sure that I am doomed. But might there be any backlog, or any other way, to retrieve the original input queries, or numbers that were truncated by MySQL to fit in the int field?
As already discussed in the comments, there is no way to retrieve the information from the table. It was silently(?!!!) truncated to 32 bits.
First, call the API provider, explain your situation, and see if you can redo the queries. Best that happens is they say yes and you don't have to try to reconstruct things from logs. Worst that happens is they say no and you're back where you are now.
Then there are some logs I would check.
First is the MySQL General Query Log. IF you had this turned on, it may contain the queries which were run. Another possibility is the Slow Query Log, more often enabled, if your queries happened to be slow.
In MySQL, data truncation is a warning by default. It's possible those warnings went into a log and included the original data. The MySQL Error Log is one possibility. On Windows it may have gone into the Windows Event Log. On a Mac, it might be in a log visible to the Console. In Unix, it might have gone to syslog.
Then it's possible the API queries themselves are logged somewhere. If you used a proxy it might contain them in its log. The program fetching from the API and adding to the database may also have its own logs. It's a long shot.
As a last resort, try grepping all of /var/log and /var/local/log and anywhere else you might think could contain a log.
In the future there are some things you can do to prevent this sort of thing from happening again. The most important is to turn on strict SQL mode. This will turn warnings, like that data has been truncated, into errors.
Set UNIQUE constraints on unique columns. Had your API ID column been declared UNIQUE the error would have been detected.
Use UNSIGNED BIGINT for numeric IDs. 2 billion is a number easily exceeded these days. It will mean 4 extra bytes per row or about 8 gigabytes extra to store 2 billion rows. Disk is cheap.
Consider turning on ANSI SQL mode. This will disable a lot of MySQL extensions and make your SQL more portable.
Finally, consider switching to PostgreSQL. Over the years MySQL has accumulated a lot of bad ideas, mish-mashes of functions, and bad default behaviors. You just got bit by one. PostgreSQL is far better designed, more powerful and flexible, and usually as fast or faster.
In Postgres, you would have gotten an error.
test=# CREATE TABLE foo ( id INTEGER );
CREATE TABLE
test=# INSERT INTO foo (id) VALUES (2147483648);
ERROR: integer out of range
If you have binary logging enabled, and you still have backups of the binlogs, and your binlog_format is not set to ROW then your original insert and/or update statements should be preserved there, where you could extract them and replay them into another server with a more appropriate table definition.
If you don't have the binlog enabled and/or you aren't archiving the binlogs in perpetuity... this is one of the reasons why you should consider doing it.
I am creating an asp.net *MVC* application using EF code first. I had used Sql azure as my database. But it turns out Sql Azure is not reliable. So I am thinking of using MySql/PostgreSQL for database.
I wanted to know the repercussions/implications of using EF code first with MySql/PostgreSQL in regards of performance.
Has anyone used this combo in production or knows anyone who has used it?
EDIT
I keep on getting following exceptions in Sql Azure.
SqlException: "*A transport-level error has occurred when receiving results from the server.*
(provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)"
SqlException: *"Database 'XXXXXXXXXXXXXXXX' on server 'XXXXXXXXXXXXXXXX' is not
currently available. Please retry the connection later.* If the problem persists, contact
customer support, and provide them the session tracing ID of '4acac87a-bfbe-4ab1-bbb6c-4b81fb315da'.
Login failed for user 'XXXXXXXXXXXXXXXX'."
First your problem seems to be a network issue, perhaps with your ISP. You may want to look at getting a remote PostgreSQL or MySQL db I think you will run into the same problems.
Secondly comparing MySQL and PostgreSQL performance is relatively tricky. In general, MySQL is optimized for pkey lookups, and PostgreSQL is more generally optimized for complex use cases. This may be a bit low-level but....
MySQL InnoDB tables are basically btree indexes where the leaf note includes the table data. The primary key is the key of the index. If no primary key is provided, one will be created for you. This means two things:
select * from my_large_table will be slow as there is no support for a physical order scan.
Select * from my_large_table where secondary_index_value = 2 requires two index traversals sinc ethe secondary index an only refer to the primary key values.
In contrast a selection for a primary key value will be faster than on PostgreSQL because the index contains the data.
PostgreSQL by comparison stores information in an unordered way in a series of heap pages. The indexes are separate from the data. If you want to pull by primary key you scan the index, then read the data page in which the data is found, and then pull the data. In comparison, if you pull from a secondary index, this is not any slower. Additionally, the tables are structured such that sequential disk access is possible when doing a long select * from my_large_table will result in the operating system read-ahead cache being able to speed performance significantly.
In short, if your queries are simply joinless selection by primary key, then MySQL will give you better performance. If you have joins and such, PostgreSQL will do better.
Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.
I have mysql Proxy running which takes a query, performs an md5 on it, and caches the result into a memcached DB. the problem occurs when an update happens in the rails app that would invalidate that cache. Any ideas on how to invalidate all of the proper keys in the cache at that time?
The core of the problem, is you don't know what the key is since it is md5 generated.
However, you can mitigate the problem by not storing data for that query.
You query may look like this "SELECT my_data.* FROM my_data WHERE conditions"
However, you can reduce the redudeancy of data by use this query instead
SELECT my_data.id FROM my_data WHERE conditions
Which is then followed up by
Memcache.mget( ids )
This won't prohibit the return on data that no longer matches the conditions, but may mitigate returning stale data.
--
Another option is to look into using namespaces: See here:
http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing
You can namespace all of your major queries. You won't be able to delete the keys, but you can change the key version id, which will in effect expire your data.
Logistically messy, but you could use it on a few bad queries.
--
lastly, you could store those queries in a different memcache server and flush on a more frequent basis.