Azure Search partitions - How does partitioning work? - partitioning

While adding more partitions in Azure Search Service, I see it doesn't require any partition key. We push data from our application and don't use indexer to pull it. Assuming I have only one index and I am using 3 partitions, I have below questions:
While pushing a document into the index, how does the service know which partition to create that particular doc in?
While querying documents, does the service fan-out query across all partitions every time and then collate the results?

The service decides which partition the document should be created in based on the document's id. We don't expose which partition a document is in, and you don't need to know this information to search for a document.
Yes. Please see our tutorial on service scalability for more information on how to plan for search and indexing capacity and performance optimization

Azure Search automatically balances documents across the available partitions.
When querying documents, the service calls the relevant partitions and then collect the results.
To learn more about partitions\replicas, see search capacity planning.

Related

What are best practices for partitioning DocumentDB across accounts?

I am developing an application that uses DocumentDB to store customer data. One of the requirements is that we segregate customer data by geographic region, so that US customers' data is stored within the US, and European customers' data lives in Europe.
The way I planned to achieve this is to have two DocumentDB accounts, since an account is associated with a data centre/region. Each account would then have a database, and a collection within that database.
I've reviewed the DocumentDB documentation on client- and server-side partitioning (e.g. 1, 2), but it seems to me that the built-in partitioning support will not be able to deal with multiple regions. Even though an implementation of IPartitionResolver could conceivably return an arbitrary collection self-link, the partition map is associated with the DocumentClient and therefore tied to a specific account.
Therefore it appears I will need to create my own partitioning logic and maintain two separate DocumentClient instances - one for the US account and one for the Europe account. Are there any other ways of achieving this requirement?
Azure's best practices on data partitioning says:
All databases are created in the context of a DocumentDB account. A
single DocumentDB account can contain several databases, and it
specifies in which region the databases are created. Each DocumentDB
account also enforces its own access control. You can use DocumentDB
accounts to geo-locate shards (collections within databases) close to
the users who need to access them, and enforce restrictions so that
only those users can connect to them.
So, if your intention is to keep the data near to user (and not just keep them stored separate) your only option is to create different accounts. Lucky that billing is not per account based but per collection based.
DocumentDB's resource model gives an impression that you can not (atleast out of the box) mix DocumentDB accounts. It doesn't look like partition keys are of any use too as partitions too can happen only within the same account.
May be this sample would help you or give some hints.

Why does the Couchbase Server API require a name for new documents

When you create a document using the Couchbase Server API, one of the arguments is a document name. What is this used for and why is it needed?
When using Couchbase Lite you can create an empty document and it is assigned an _id and _rev. You do not need to give it a name. So what is this argument for in Couchbase Server?
In Couchbase Server it is a design decision that all objects are identified by a the object ID, key or name (all the same thing by different names) and those are not auto-assigned. The reason for this is that keys are not embedded in the document itself, key lookups are the fastest way to get that object and the technology dictates this under the hood of the server. Getting a document by ID is much faster than querying for it. Querying means you are asking a question, whereas getting the object by ID means you already know the answer and are just telling the DB to go get it for you and is therefor faster.
If the ID is something that is random, then more than likely you must query the DB and that is less efficient. Couchbase Mobile's sync_gateway together with Couchbase Lite handles this on your behalf if you want it to as it can have its own keyspace and key pattern it manages for key lookups. If you are going straight to the DB on your own with the Couchbase SDK though, knowing that key will be the fastest way to get that object. Like I said, Couchbase Sync_Gateway handeles this lookup for you, as it is the app server. When you go direct with the SDKs you get more control and different design patterns emerge.
Many people in Couchbase Server create a key pattern that means something to their application. As an example for a user profile store I might consider breaking up the profile into three separate documents with a unique username (in this example hernandez94) for each document:
1) login-data::hernandez94 is the object that has the encrypted password since I need to query that all of the time and want it in Couchbase's managed cache for performance reasons.
2) sec-questions::hernandez94 is the object that has the user's 3 security questions and since I do not use that very often, do not care if it is in the managed cache
3) main::hernandez94 is the user's main document that has everything else that I might need to query often, but not nearly as often as other times.
This way I have tailored my keyspace naming to my application's access patterns and therefor get only the data I need and exactly when I need it for best performance. If I want, since these key names are standardized in my app, I could do a paralellized bulk get on all three of these document since my app can construct the name and it would be VERY fast. Again, I am not querying for the data, I have the keys, just go get them. I could normalize this keyspace naming further depending on the access patterns of my application. email-addresses::hernandez94, phones::hernandez94, appl-settings::hernandez94, etc.

What would be the best DB cache to use for this application?

I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).

Which strategy to use for designing a log data storage?

We want to design a data storage with Relational database keeping the request message(http/s,xmpp etc.) logs. For generating logs we use a solution based on Apache synapse esb. However since we want to store the logs and read the logs only for maintenance issues the read/write ratio will be low. (write count will be intensive since system will receive many messages to be logged. ) We thought of using Cassandra for its distributed nature and clustering capabilities. However with Cassandra database schemas search queries with filter are difficult, always requiring secondary indexes.
To keep it short my question is whether should we try the clustering solutions of mysql or using Cassandra with suitable schema design for search queries with filters?
If you wish to do real time analytics over your semi-structured or unstructured data you can go with Cassandra + Hadoop cluster. Since Cassandra wiki itself suggests Datastax Brisk edition, for such kind of architecture. It is worth giving it a try
On the other side if you wish to do realtime queries over raw logs for small set of data. Ex.
select useragent from raw_log_table where id='xxx'
Then you should do a lots of research over you row key and column key design. Because that decides the complexity of the query. Better have a look at the case studies of people here http://www.datastax.com/cassandrausers1
Regards,
Tamil

Pros and Cons of the MySQL Archive Storage Engine?

For a website I am developing, I am going to store user activity information (like Facebook) and HTTP requests to the server into a MySQL database.
I was going to use InnoDB as the storage engine, however, I briefly read about the Archive storage engine and this sounds like the way forward for information which is only Inserted or Selected and never Updated or Deleted.
Before I proceed, I would like to know the any pros and cons to help me make a decision whether to use it or not.
We are not currently using the Archive storage engine, but have done some research on it. One requirement of our application is that it stores lots of historical information for long periods of time. That data is never updated, but needs to be selected occasionally.
So ... with that in mind:
PROs
Very small footprint. Uses zlib for data compression.
More efficient/quicker backups due to small data file size.
CONs
Does not allow column indexes. Selects require a full table scan.
Does not support spatial data.
--
It is likely that we will not use this storage engine due to the indexing limitations.