I created a console application which sends data from a Sql Database to RavenDB.
I have a freakish amount of data to transfer, so it's taking an incredibly long time.
(1,000,000 rows takes RavenDB about 2 hours to store)
RavenDB takes longer to import the data than is collected from Sql Server by the Console application.
Is there any way to speed up the transfer or perhaps an existing tool which does this already?
using (var session = this._store.OpenSession())
{
//row.Count is never more than 1024
while (i < row.Count)
{
session.Store(row[i]);
i++;
}
session.SaveChanges();
}
Could you post the code where you insert to RavenDB, this is likely where the bottleneck lies. You should be making requests concurrently.
Setting:
HttpJsonRequest.ConfigureRequest += (e,x)=>((HttpWebRequest)x.Request).UnsafeAuthenticatedConnectionSharing = true;
As well as processing your insert records in a batch.
As for insert performance you'll likely never match SQLServer as RavenDb is optimized for read vs write.
Related
I am working with Google BigQuery for the first time on a client project and have created packages in SSIS to insert data into tables (an odd combination but one required by my client), using an SSIS plugin (CData).
I am looking to insert around 100k rows into a BigQuery table, however, when I look to do further update queries on this table, these cannot be performed because the data is still in the buffer. How does one know how long this will take in BigQuery and are there ways to speed up the process?
It doesn't matter if the data is still in the buffer. If you query the table, the data in the buffer will be included too. Just one of the many awesome things about BigQuery.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
A record that arrives in the streaming buffer will remain there for
some minimum amount of time (minutes). During this period while the
record is buffered, it's possible that you may issue a query that will
reference the table. The Instant Availability Reader allows workers
from the query engine to read the buffered records prior to being
committed to managed storage.
data is still in the buffer. How does one know how long this will take in BigQuery?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export operations. See more in documentation
Meantime, have in mind - Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. So, as stated above - up to 90 minutes
are there ways to speed up the process?
The only way in your case is to use loading data instead of streaming data. As per how I understand your case - data is in MS SQL, so you can potentially make your SSIS package batch aware and load it batch by batch through Cloud Storage
This is just to explain how I think it probably works:
Let's say the web server needs data from 10 tables. The data that will finally be displayed on the client needs some kind of formatting which can be done either on the database or the web server.Let's say the time to fetch the raw data for one table is 1 sec and the time to fetch formatted data for one table is 2 sec (It takes one second to format the data for one table and the formatting can be easily done either on the web server or the database.)
Let's consider the following cases for communication:
Case 1:
for(i = 0; i < 10; i++)
{
table[i].getDataFromDB(); //2 sec - gets formatted datafrom DB, Call is completed before control goes to next statement
table[i].sendDataToClient(); //0 sec - control doesn't wait for completion of this step
}
Case 2:
for(i = 0; i < 10; i++)
{
table[i].getDataFromDB(); //1 sec - gets raw data from DB, Call is completed before control goes to next statement
table[i].formatData(); //0 sec - will be executed as a parallel process which takes 1 sec to complete (control moves to next statement before completion)
}
formatData()
{
//format the data which takes 1 sec
sendDataToClient(); //0 sec - control doesn't wait for completion of this step
}
Assume it takes no time (0 sec) to send the data from the web server to the client since it will be constant for both cases.
In case 1, the data for each table will be displayed at interval of 2 seconds on the client, and the complete data will be
on client after 20 seconds.
In case 2, the data for first table will be displayed after 2 sec, but the data for the next 9 will then be displayed at sec 3,4,...,11.
Which is the correct way and how is it achieved between popular web server and databases ?
Popular web servers and databases can work either way, depending on how the application is written.
That said, unless you have an extreme situation, you will likely find that the performance impact is small enough that your primary concern should instead by code maintainability. From this point of view, formatting the data in the application (which runs on the web server) is usually preferred, as it is usually harder to maintain business logic that is implemented on the database level.
Many web application frameworks will do much of the formatting work for you, as well.
I have Couchbase community edition v4, build 4047. Everything seems to be great with it until I started issuing queries against a simple view. The view is just projecting the documents like so, which seems harmless:
function (doc, meta) {
if(doc.applicationId){
emit(doc.applicationId,meta.id);
}
}
I'm using the .Net client to connect and execute the query from my application, though I don't think that matters. It's a single node configuration. I'm clocking time in between the actual http requests and the queries are taking between 4 seconds up to over 2 minutes if I send something like 15 requests in at a time through Fiddler.
I am using a stale index to try and boost that time, but it doesn't seem to have much impact. The bucket is not very large. There are only a couple of documents in the bucket. I've allocated 100M RAM for indexing. I'd think that's fine for at least the few documents we're working with at the moment.
This is primarily local development, but we are observing similar behaviors when promoted to our servers. The servers don't use a significant amount of RAM either, but at the same time we aren't storing a significant amount of documents. We're only talking about 10 or 20 at the most? These documents only contain like 5 primitive-type properties.
Do you have some suggestions for diagnosing this? The logs through the couchbase admin console don't show anything unusual as far as I can tell and this doesn't seem like normal behavior.
Update:
Here is my code to query the documents
public async Task ExpireCurrentSession(string applicationId)
{
using (var bucket = GetSessionBucket())
{
var query = bucket
.CreateQuery("activeSessions", "activeSessionsByApplicationId")
.Key(applicationId)
.Stale(Couchbase.Views.StaleState.Ok);
var result = await bucket.QueryAsync<string>(query);
foreach (var session in result.Rows)
{
await bucket.RemoveAsync(session.Value);
}
}
}
The code seems fine, and should work as you expect. The 100mb RAM you mention allocating actually isn't for views, it only affects N1QL global secondary indexes. Which brings me to the following suggestion:
You don't need to use a view for this in Couchbase 4.0; you can use N1QL to do this simpler and (probably) more efficiently.
Create a N1QL index on the applicationId field (either in code of from the cbq command line shell) like so:
CREATE INDEX ix_applicationId ON bucketName(applicationId) USING GSI;
You can then use a simple SELECT query to get the relevant document IDs:
SELECT META(bucketName) FROM bucketName WHERE applicationId = '123';
Or even simpler, you can just use a DELETE query to delete them directly:
DELETE FROM bucketName WHERE applicationId = '123';
Note that DML statements, like DELETE are still considered a beta feature in Couchbase 4.0, so do your own risk assessment.
To run N1QL queries from .NET you use almost the same syntax as for views:
await bucket.QueryAsync<dynamic>("QUERY GOES HERE");
Straight to the Qeustion ->.
The problem : To do async bulk inserts (not necessary bulk, if MySql can Handle it) using Node.js (coming form a .NET and PHP background)
Example :
Assume i have 40(adjustable) functions doing some work(async) and each adding a record in the Table after its single iteration, now it is very probable that at the same time more than one function makes an insertion call. Can MySql handle it that ways directly?, considering there is going to be an Auto-update field.
In C#(.NET) i would have used a dataTable to contain all the rows from each function and in the end bulk-insert the dataTable into the database Table. and launch many threads for each function.
What approach will you suggest in this case,
Shall the approach change in case i need to handle 10,000 or 4 million rows per table?
ALso The DB schema is not going to change, will MongoDB be a better choice for this?
I am new to Node, NoSql and in the noob learning phase at the moment. So if you can provide some explanation to your answer, it would be awesome.
Thanks.
EDIT :
Answer : Neither MySql or MongoDB support any sort of Bulk insert, under the hood it is just a foreach loop.
Both of them are capable of handling a large number of connections simultanously, the performance will largely depend on you requirement and production environment.
1) in MySql queries are executed sequentially per connection. If you are using one connection, your 40~ functions will result in 40 queries enqueued (via explicit queue in mysql library, your code or system queue based on syncronisation primitives), not necessarily in the same order you started 40 functions. MySQL won't have any race conditions problems with auto-update fields in that case
2) if you really want to execute 40 queries in parallel you need to open 40 connections to MySQL (which is not a good idea from performance point of view, but again, Mysql is designed to handle auto-increments correctly for multiple clients)
3) There is no special bulk insert command in the Mysql protocol on the wire level, any library exposing bulk insert api in fact just doing long 'insert ... values' query.
I am working on tool to optimize linq to sql queries. Basically it intercepts the linq execution pipeline and makes some optimizations like for example removing a redundant join from a query. Of course, there is an overhead in the execution time before the query gets executed in the dbms, but then, the query should be processed faster. I don't want to use a sql profiler because I know that the generated query will be perform better in the dbms than the original one, I am looking for a correct way of measuring the global time between the creation of the query in linq and the end of its execution. Currently, I am using the Stopwatch class and my code looks something like this:
var sw = new Stopwatch();
sw.Start();
const int amount = 100;
for (var i = 0; i < amount; i++)
{
ExecuteNonOptimizedQuery();
}
sw.Stop();
Console.Writeline("Executing the query {2} times took: {0}ms. On average, each query took: {1}ms", sw.ElapsedMilliseconds, sw.ElapsedMilliseconds / amount, amount);
Basically the ExecutenNonOptimizedQuery() method creates a new DataContext, creates a query and then iterates over the results.
I did this for both versions of the query, the normal one and the optimized one. I took the idea from this post from Frans Bouma.
Is there any other approach/considerations I should take?
Thanks in advance!
You could Run under a profiler. But an instrumented build will impact the performance of the code, possibly distorting the result (with a large part of the total execution time in SQL Server, this is very likely). A sampling profiler might help. (The Visual Studio Team System profiler can do both.)
For an isolated test to compare to approaches using Stopwatch is generally the preferred approach. You can remove the time show in SQL Profiler from the stopwatch time to get the time in client code (including your post-processing).