I am trying to understand the compaction feature of Couchbase. I would also like to know the best time to compact my bucket and why compacting a bucket is necessary.
Couchbase uses an append-only file for writing data to disk. Since this file is append-only it means that every time you write to it the file gets larger. This is the case whether or not you are adding new data or updating existing data.
If you just continue writing data to an append-only file then eventually you will run out of disk space unless you reclaim this space by removing no longer used portions of the file. This process is called compaction. Below is an simple example of how compaction works.
Imagine having a file that is append only and has key value data.
key1, value1
key2, value2
key3, value3
If you update key1 the file will look like this
key1, value1
key2, value2
key3, value3
key1, value4
As you can see the file grew due to the update. After the compaction process runs the file would look like this:
key2, value2
key3, value3
key1, value4
This is a very simplified example of how compaction works and it is much more complicated in append-only data stores.
In Couchbase it is recommended that you schedule compaction to run at night (or at a time when your application is has the lowest usage). The reason is that compaction is a disk intensive task. If you cannot just run compaction at night it will start automatically if a file has a certain amount of fragmentation (unused data) in the file. At the end of the day though it really depends on your deployment and workload characteristics, but most people find that the Couchbase defaults work fine for them.
Related
I have used SSIS Balance Data Distributor just to fill 50,000 Records from a OLEDB Source to OLEDB Destination,
When i don't Use SSIS BDD it takes 2 minutes 40 Secs when i use BDD it takes 1 minute 55 Secs which does not make that much difference.
I also find that the data does not load into destination Parallelly it is loading in first destination and later on it fills the next one. (One at a time) Can any of you please help how to fill them parallelly?
Balanced Data Distribution is not a silver bullet for performance and runtime. It is good when:
You have CPU-intensive transformations, and can benefit from parallel execution
Your destination supports concurrent insert
The first case is clear and it depends on your dataflow. As for the concurrent insert on OLE DB destination; the best results are on heap tables, or tables without a primary key/clustered index and other indexes as well. Or the clustered key has to defined on autoincremented surrogate key. On OLE DB Destination you might need to disable table lock; otherwise it could prevent insert from being parallel. But check for yourself, as written in Mark's answer - sometimes parallel insert works with table lock, but on a heap table or columnstore.
Other table types (with indexes, cluster or not) might escalate locks to table level or require index rebuilding, effectively disabling parallel insert. Delete or disable it.
So, you have to evaluate yourself weather the parallel execution justifies additional efforts in development and support.
When you are using BDD to insert into the same table, you will only get parallel inserts if the table is a heap (no clustered index) and does not have a unique constraint.
If tablock is turned ON for all the destinations, sql will take a special lock (BU), which will allow parallel inserts to the same heap.
If there are no other indexes on the table and the database is NOT in full recovery model, you will gain the additional benefit of minimal logging.
As you noted in your question, using BDD saved you about 45s - it’s actually working. You’ll probably see different performance when running on a server that has more cores and memory, so be sure to test there as well. The measure of success will be total duration, rather than what visual studio displays in its debugger.
Planning for server execution, it would also be helpful to increase these two properties on the data flow:
-DefaultBufferMaxRows (try adding a 0 to bring it to 100,000)
-DefaultBufferSize - add a zero to max out memory
If you’re on sql 2016, you can instead set AutoAdjustBufferSize to true, which will ignore the above properties and optimize the buffer to the best performance size.
These adjustments will increase the commit size of the inserts which will result in faster writes to some degree.
The bottom line is keep tablock on, test on the server. BDD is working.
I have a huge amount of data which is loaded from ETL tool into the database. Sometimes etl tool generates some unusual data and puts them inside a table, say for simlicity I want to fill 5 correct data and get 10 as a result in my database, so I detect the inconsistency.
As the option to update data to the state which I want I had to TRUNCATE the schema in MySQL database and INSERT data from ETL tool again under my control. In this case everything looks nice, but it takes too much time to reload data.
I investigated this issue and found out that to DELETE data and INSERT it again takes much more time as for example to use the query INSERT…..ON DUPLICATE KEY UPDATE. So I don‘t need to delete all data but can just check and update it when necessary, what will save my load time.
I want to use this query, but I am a little bit confused, because of these additional 5 wrong data, which are already sitting in my database. How can I remove them without deleting everything from my table before inserting??
as you mention
"Sometimes etl tool generates some unusual data and puts them inside
a table"
You need to investigate your ETL code and correct it. Its not suppose to generate any data, ETL tool only transforms your data as per rule. Focus on ETL code rather than MySQL database.
To me that sounds like there’s a problem in the dataflow setup in your ETL tool. You don’t say what you are using, but I would go back over the select criteria and review what fields you are selecting and what are your WHERE criteria. Perhaps what is in your WHERE statements is causing the extra data.
As for the INSERT…ON DUPLICATE KEY UPDATE syntax, make sure you don’t have an AUTO_INCREMENT column in an InnoDB table. Because in that case only the INSERT will increase the auto-increment value. And check that your table doesn’t have multiple unique indexes because if your WHERE a=xx matches several rows than only 1 will be updated. (MySQL 5.7, see reference manual: https://dev.mysql.com/doc/refman/5.7/en/ .)
If you find that your ETL tools are not providing enough flexibility then you could investigate other options. Here is a good article comparing ETL tools.
I have a scenario where I want to invalidate write operation. Briefing the scenario below:
There are two process running in altogether different systems and we have one common database. Let the process be termed as 'P1' and 'P2'. Each process can do read (R) and write (W) operaion. R1 and W1 show operation done by P1 and R2, W2 by P2. Lets take a common DB object (O).
Now operations are executing in following sequence :
R1 (read 'O' by process 'P1')
R2 (read 'O' by process 'P2')
W1 (write 'O' by process 'P1') -> this make 'O' of P2 dirty.
Now what I want is to fail P2 while performing its W2 operation as it contain old inconsistent object.
I have read few blogs of checking timestamp before persisting, but this is not the solution (as it is error prone even if it is matter of millisecs).
I want to know enterprise level solution.
Also I will like to know how it can be achieved by using 3rd party solution like hibernate.
You need optimistic locking. In hibernate you can provide optimistic locking with a separate numeric field annotated as #version for each entity. Each insert or update operation increments this field value by one and avoid stale data( with lower version value) from geting persisted to database.
I have a system with two processes, one of which does a single insert, and the other a bulk insert. Obviously the second process is faster, and I'm working on migrating the first process to a bulk insert mechanism, but I was stumped this morning by a question from a colleague about "why bulk insert would be faster than single inserts".
So indeed, why is bulk insert faster than single insert?
Also, are there differences between bulk and single inserts in MySQL and HBase, given that their database architectures are completely different? I am using both for my project, and am wondering if there are differences in the bulk and single inserts for these two databases.
As far as i know, this depends on the Hbase configuration also. Normally a bulk insert would mean usage of List of Puts together, in this case, the insert ( called flushing in habse layer) is done automatically when you call table.put. Single inserts might wait for any other insert call so as to do a batch flush in the middle layer. However this will depend on the configuration also.
Another reason may be the easiness of task, its more efficient Map and Reduce, if you have more jobs at a time. The migration of file chunks are decided for all inputs single time. But in indvidual inserts, this becomes a crucial point.
In short - Bulkload operation bypasses regular write path. Thats's why it is fast.
So, what happens during normal write process when you do simple row by row put operation?
All the data is written simultaneously to WAL and memstore and when memestore is full, data is flushed to a new HFile. However in case of Bulkload , it directly writes to StoreFile in the running hbase cluster. NO Intermediate stuff...
Quick tip - if you don't want to use bulkload as often it is done in short burst which put additional burden on the cluster, you can writing to WAL false using Put.setWriteToWal(false) to save some timing.
But this will increase your data loss chances..
I have a ssis package which I run using an sql job for bulk copy of data from one database to other. the destination is our integration server where we have enough space for database. But when i run this job (i.e package). it creates huge number of temp files in localsettings/temp folder in orders of for a 1GB mdf file it creates some 20gb of temp files. I have manually created this package and didnot use import export wizard. Can any one help me how to avoid this huge tempfiles while executing?.If any further details needed plese mention.
Note: many said if we create a package using import export wizard and set optimize for many tables true this will happen. But in this package i query only one table and have created manually without using import export wizard.
Why is the package creating temp files?
SSIS is an in-memory ETL solution, except when it can't keep everything in memory and begins swapping to disk.
Why would restructuring the package as #jeff hornby suggested help?
Fully and partially blocking transformations force memory copies in your data flow. Assume you have 10 buckets carrying 1MB of data each. When you use a blocking transformation, as those buckets arrive at a transformation the data has to be copied from one memory location to another one. You've now doubled your packages total memory consumption as you have 10MB of data used before the union all transformation and then another 10MB after it.
Only use columns that you need. If a column is not in your destination, don't add it to the data flow. Use the database to perform sorts and merges. Cast your data to the appropriate types before it ever hits the data flow.
What else could be causing temp file usage
Lookup transformations. I've seen people crush their ETL server when they use SELECT * FROM dbo.BillionRowTable when all they needed was one or two columns for the current time period. The default behaviour of a lookup operation is to execute that source query and cache the results in memory. For large tables, wide and/or deep, this can make it look like your data flow isn't even running as SSIS is busy streaming and caching all of that data as part of the pre-execute phase.
Binary/LOB data. Have an (n)varchar(max)/varbinary(max) or classic BLOB data type in your source table? Sorry, that's not going to be in memory. Instead, the data flow is going to carry a pointer along and write a file out for each one of those objects.
Too much parallel processing. SSIS is awesome in that you get free paralleization of your proessing. Except you can have too much of a good thing. If you have 20 data flows all floating in space with no precedence between them, the Integration Services engine may try to run all of them at once. Add a precedence constraint between them, even if it's just on completion (on success/on fail) to force some serialization of operations. Inside a data flow, you can introduce the same challenge by having unrelated operations going on. My rule of thumb is that starting at any source or destination, I should be able to reach all the other source/destinations.
What else can I do?
Examine what else is using memory on the box. Have you set a sane (non-default) maximum memory value for SQL Server? SSIS like RAM like a fat kid loves cake so you need to balance the memory needs of SSIS against the database itself-they have completely separate memory spaces.
Each data flow has the ability to set the [BufferTempStoragePath and BlobTempStoragePath2. Take advantage of this and put that on a drive with sufficient storage
Finally, add more RAM. If you can't make the package better by doing the above, throw more hardware at it and be done.
If you're getting that many temp files, then you probably have a lot of blocking transforms in your data flow. Try to eliminate the following types of transformations: Aggregate, Fuzzy Grouping, Fuzzy Lookup, Row Sampling, Sort, Term Extraction. Also, partially blocking transactions can create the same problems but not in the same scale: Data Mining Query, Merge, Merge Join, Pivot, Term Lookup, Union All, Unpivot. You might want to try to minimize these transformations.
Probably the problem is a sort transformation somewhere in your data flow (this is the most common). You might be able to eliminate this by using an ORDER BY clause in your SQL statement. Just remember to set the sorted property in the data source.