I am a developer on a Couchbase/AngularJS/Express.js/Node.JS application. The application current state is the beginning of User Testing. As the users are accessing the new application, a user wants to have a retrieve from Couchbase reviewed for efficiency. The user feels that search is taking too long to return from Couchbase. As an experienced SQL programmer and learning N1QL, my SQL experience is telling me the request is not feasible.
The process involves the following:
1700 to 1800 JSON Documents
Expected return is either zero or some documents
All documents is impossible (will explain shortly).
The data must travel across the Atlantic because the data and application will reside on a machine within England.
The user will enter at a minimum three character to search for a name and expect a list of name returned containing the characters entered. Please remember, the more characters entered by the user results in a much more refined search. The user issue is that it is taking a couple of seconds to return.
The N1QL statement looks as follows:
SELECT id,
name,
abbr,
meta(bucket).id meta_id
FROM bucket
WHERE type = 'store'
and Upper(name) like Upper('%passedinname%’)
order by name
If the user enters, ‘ABC’, the N1QL statement will return all JSON document information where the store JSON document contains ‘ABC’ in the name. For example, if I had stores named:
ABC Market
Mary’s AbC Deli
Both documents would return. As you may notice, I have to keep the search case insensitive. This is per the user request.
I looked over the options within Couchbase. I believe there is no viable means to increase the retrieval speed. I looked into Couchbase View. I do not see how I can apply my N1QL statement against it. The couch.model.js script only allows for retrieval from the view. It does not allow for the entrance of search criteria that not key values.
I looked into Couchbase Index. I do not see how I can apply my N1QL statement against it either. The N1QL statement will still need to search each store JSON document for the ‘passedinname’.
I even looked into Couchbase Full Text Search. I cannot use this option. The department has set up a policy to prevent it.
If you have a possible solution, please feel free to make a suggestion. I appreciate all suggestion gratefully.
TIA
Anthony
You are in luck. Unlike SQL, N1QL handles this with indexed speedup. There are two approaches: one for arbitrary substrings, including partial words, and another for whole words and / or tokens.
For partial words, see the SUFFIXES() function. It is more appropriate for shorter fields (e.g. tags, name, title) than longer fields (e.g. long description).
For whole words, see TOKENS() and SPLIT().
Both approaches are described in these articles.
https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi
https://dzone.com/articles/split-and-conquer-efficient-string-search-with-n1q
https://dzone.com/articles/more-than-like-efficient-json-search-with-couchbas
Related
We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}
We store these products currently as self-contained documents in Azure Cosmos DB.
Question 1: Partitioning.
The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by e.g. name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?
Question 2: Extended metadata.
We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?
Some brainstorming from me below.
1: Just dump everything in one self-contained document.
One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
"prop1": "prop1",
"prop2": "prop2",
"propN": "propN"
}
}
The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:
api.org.com/products/{id} -- will always return a product with the basic common fields
api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)
2: Split the document
One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).
One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.
Sorry for a long post. Hope this was understandable, looking forward for your feedback!
Answer 1:
If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons.
First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting
with a simple physical partition, it will be faster than going in every single physical partition looking for data.
(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)
The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that
you are performing a query. This is where the system wouldn't do so well.
Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.
Answer 2:
Cosmos DB is a NoSQL schema-less database for a reason.
The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here.
You can simply go with your first approach and either have everything under a single property or just have them at the top level.
Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version
and just map to different versions depending on the endpoint.
Hope this helps.
Let's say we have a requirement to create a system that consumes a high-volume, real-time data stream of documents, and that matches those documents against a set of user-defined search queries as those documents become available. This is a prospective, as opposed to a retrospective, search service. What would be an appropriate persistence solution?
Suppose that users want to see a live feed of documents that match their queries--think Google Alerts--and that the feed must display certain metadata for each document. Let's assume an indefinite lifespan for matches; i.e., the system will allow the user to see all of the matches for a query from the time when the particular query was created. So the metadata for each document that comes in the stream, and the associations between the document and the user queries that matched that document, must be persisted to a database.
Let's throw in another requirement, that users want to be able to facet on some of the metadata: e.g., the user wants to see only the matching documents for a particular query whose metadata field "result type" equals "blog," and wants a count of the number of blog matches.
Here are some hypothetical numbers:
200,000 new documents in the data stream every day.
-The metadata for every document is persisted.
1000 users with about 5 search queries each: about 5000 total user search queries.
-These queries are simple boolean queries.
-As each new document comes in, it is processed against all 5000 queries to see which queries are a match.
Each feed--one for each user query--is refreshed to the user every minute. In other words, for every feed, a query to the database for the most recent page of matches is performed every minute.
Speed in displaying the feed to the user is of paramount importance. Scalability and high availability are essential as well.
The relationship between users and queries is relational, as is the relationship between queries and matching documents, but the document metadata itself are just key-value pairs. So my initial thought was to keep the relational data in a relational DB like MySQL and the metadata in a NoSQL DB, but can the faceting requirement be achieved in a NoSQL DB? Also, constructing a feed would then require making a call to two separate data stores, which is additional complexity. Or perhaps shove everything into MySQL, but this would entail lots of joins and counts. If we store all the data as key-value pairs in some other kind of data store, again, how would we do the faceting? And there would be a ton of redundant metadata for documents that match more than one search query.
What kind of database(s) would be a good fit for this scenario? I'm aware of tools such as Twitter Storm and Yahoo's S4, which could be used to construct the overall architecture of such a system, but I'd like to focus on the database, given the data storage, volume, and query/faceting requirements.
First, I disagree with Ben. 200k new records per day compares with 86,400 seconds in a day, so we are talking about three records per second. This is not earth shattering, but it is a respectable clip for new data.
Second, I think this is a real problem that people face. I'm not going to be one that says that this forum is not appropriate for the topic.
I think the answer to the question has a lot to do with the complexity and type of user queries that are supported. If the queries consist of a bunch of binary predicates, for instance, then you can extract the particular rules from the document data and then readily apply the rules. If, on the other hand, the queries consist of complex scoring over the text of the documents, then you might need an inverted index paired with a scoring algorithm for each user query.
My approach to such a system would be to parse the queries into individual data elements that can be determined from each document (which I might call a "queries signature" since the results would contain all fields needed to satisfy the queries). This "queries signature" would be created each time a document was loaded, and it could then be used to satisfy the queries.
Adding a new query would require processing all the documents to assign new values. Given the volume of data, this might need to be more of a batch task.
Whether SQL is appropriate depends on the features that you need to extract from the data. This in turn depends on the nature of the user queries. It is possible that SQL is sufficient. On the other hand, you might need more sophisticated tools, especially if you are using text mining concepts for the queries.
Thinking about this, it sounds like an event-processing task, rather than a regular data processing operation, so it might be worth investigating Complex Event Processing systems - rather than building everything on a regular database, using a system which processes the queries on the incoming data as it streams into the system. There are commercial systems which can hit the speed & high-availability criteria, but I haven't researched the available OSS options (luckily, people on quora have done so).
Take a look at Elastic Search. It has a percolator feature that matches a document against registered queries.
http://www.elasticsearch.org/blog/2011/02/08/percolator.html
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.
I am pretty excited about the new Mysql XMl Functions.
Now I can finally embed something like "object oriented" documents in my oldschool relational database.
For an example use-case consider a user who sings up at your website using facebook connect.
You can fetch an object for the user using the graph api, and get nice information. This information however can vary vastly. Some fields may or may not be set, some may be added over time and so on.
Well if you are just intersted in very special fields (for example friends relations, gender, movies...), you can project them into your relational database scheme.
However using the XMl functions you could store the whole object inside a field and then your different models can access the data using the ExtractValue function. You can store everything right away without needing to worry what you will need later.
But what will the performance be?
For example I have a table with 50 000 entries which represent useres.
I have an enum field that states "male", "female" (or various other genders to be politically correct).
The performance of for example fetching all males will be very fast.
But what about something like WHERE ExtractValue(userdata, '/gender/') = 'male' ?
How will the performance vary if the object gets bigger?
Can I maby somehow put an Index on specified xpath selections?
How do field types work together with this functions/performance. Varchar/blob?
Do I need fulltext indexes?
To sum up my question:
Mysql XML functins look great. And I am sure they are really great if you just want to store structured data that you fetch and analyze further in your application.
But how will they stand battle in procedures where there are internal scans/sorting/comparision/calculations performed on them?
Can Mysql replace document oriented databases like CouchDB/Sesame?
What are the gains and trade offs of XML functions?
How and why are they better/worse than a dynamic application that stores various data as attributes?
For example a key/value table with an xpath as key and the value as value connected to the document entity.
Anyone made any other experiences with it or has noticed something mentionable?
I tend to make comments similar to Pekka's, but I think the reason we cannot laugh this off is your statement "This information however can vary vastly." That means it is not realistic to plan to parse it all and project it into the database.
I cannot answer all of your questions, but I can answer some of them.
Most notably I cannot tell you about performance on MySQL. I have seen it in SQL Server, tested it, and found that SQL Server performs in memory XML extractions very slowly, to me it seemed as if it were reading from disk, but that is a bit of an exaggeration. Others may dispute this, but that is what I found.
"Can Mysql replace document oriented databases like CouchDB/Sesame?" This question is a bit over-broad but in your case using MySQL lets you keep ACID compliance for these XML chunks, assuming you are using InnoDB, which cannot be said automatically for some of those document oriented databases.
"How and why are they better/worse than a dynamic application that stores various data as attributes?" I think this is really a matter of style. You are given XML chunks that are (presumably) documented and MySQL can navigate them. If you just keep them as-such you save a step. What would be gained by converting them to something else?
The MySQL docs suggest that the XML file will go into a clob field. Performance may suffer on larger docs. Perhaps then you will identify sub-documents that you want to regularly break out and put into a child table.
Along these same lines, if there are particular sub-docs you know you will want to know about, you can make a child table, "HasDocs", do a little pre-processing, and populate it with names of sub-docs with their counts. This would make for faster statistical analysis and also make it faster to find docs that have certain sub-docs.
Wish I could say more, hope this helps.
I have a table contain the city around the worlds it contain more than 70,000 cities.
and also have auto suggest input in my home page - which used intensively in my home page-, that make a sql query (like search) for each input in the input (after the second letter)..
so i afraid from that heavily load,,...,, so I looking for any solution or technique can help in such situation .
Cache the table, preferably in memory. 70.000 cities is not that much data. If each city takes up 50 bytes, that's only 70000 * 50 / (1024 ^ 2) = 3MByte. And after all, a list of cities doesn't change that fast.
If you are using AJAX calls exclusively, you could cache the data for every combination of the first two letters in JSON. Assuming a Latin-like alphabet, that would be around 680 combinations. Save each of those to a text file in JSON format, and have jQuery access the text files directly.
Create an index on the city 'names' to begin with. This speeds up queries that look like:
SELECT name FROM cities WHERE name LIKE 'ka%'
Also try making your auto complete form a little 'lazy'. The more letters a user enters, lesser the number of records your database has to deal with.
What resources exist for Database performance-tuning?
You should cache as much data as you can on the web server. Data that does not change often like list of Countries, Cities, etc is a good candidate for this. Realistically, how often do you add a country? Even if you change the list, a simple refresh of the cache will handle this.
You should make sure that your queries are tuned properly to make best use of Index and Join techniques.
You may have load on your DB from other queries as well. You may want to look into techniques to improve performance of MySQL databases.
Just get your table to fit in memory, which should be trivial for 70k rows.
Then you can do a scan very easily. Maybe don't even use a sql database for this (as it doesn't change very often), just dump the cities into a text file and scan that. That'd definitely be better if you have many web servers but only one db server as each could keep its own copy of the file.
How many queries per second are you seeing peak? I can't imagine there being that many people typing city names in, even if it is a very busy site.
Also you could cache the individual responses (e.g. in memcached) if you get a good hit rate (e.g. because people tend to type the same things in)
Actually you could also probably precalculate the responses for all one-three letter combinations, that's only 26*26*26 (=17k) entries. As a four or more letter input must logically be a subset of one of those, you could then scan the appropriate one of the 17k entries.
If you have an index on the the city name it should be handled by the database efficiently. This statement is wrong, see comments below
To lower the demands on your server resources you can offer autocompletion only after n more characters. Also allow for some timeout, i.e. don't do a request when a user is still typing.
Once the user stopped typing for a while you can request autocompletion.