Database solutions for a prospective (not retrospective) search - mysql

Let's say we have a requirement to create a system that consumes a high-volume, real-time data stream of documents, and that matches those documents against a set of user-defined search queries as those documents become available. This is a prospective, as opposed to a retrospective, search service. What would be an appropriate persistence solution?
Suppose that users want to see a live feed of documents that match their queries--think Google Alerts--and that the feed must display certain metadata for each document. Let's assume an indefinite lifespan for matches; i.e., the system will allow the user to see all of the matches for a query from the time when the particular query was created. So the metadata for each document that comes in the stream, and the associations between the document and the user queries that matched that document, must be persisted to a database.
Let's throw in another requirement, that users want to be able to facet on some of the metadata: e.g., the user wants to see only the matching documents for a particular query whose metadata field "result type" equals "blog," and wants a count of the number of blog matches.
Here are some hypothetical numbers:
200,000 new documents in the data stream every day.
-The metadata for every document is persisted.
1000 users with about 5 search queries each: about 5000 total user search queries.
-These queries are simple boolean queries.
-As each new document comes in, it is processed against all 5000 queries to see which queries are a match.
Each feed--one for each user query--is refreshed to the user every minute. In other words, for every feed, a query to the database for the most recent page of matches is performed every minute.
Speed in displaying the feed to the user is of paramount importance. Scalability and high availability are essential as well.
The relationship between users and queries is relational, as is the relationship between queries and matching documents, but the document metadata itself are just key-value pairs. So my initial thought was to keep the relational data in a relational DB like MySQL and the metadata in a NoSQL DB, but can the faceting requirement be achieved in a NoSQL DB? Also, constructing a feed would then require making a call to two separate data stores, which is additional complexity. Or perhaps shove everything into MySQL, but this would entail lots of joins and counts. If we store all the data as key-value pairs in some other kind of data store, again, how would we do the faceting? And there would be a ton of redundant metadata for documents that match more than one search query.
What kind of database(s) would be a good fit for this scenario? I'm aware of tools such as Twitter Storm and Yahoo's S4, which could be used to construct the overall architecture of such a system, but I'd like to focus on the database, given the data storage, volume, and query/faceting requirements.

First, I disagree with Ben. 200k new records per day compares with 86,400 seconds in a day, so we are talking about three records per second. This is not earth shattering, but it is a respectable clip for new data.
Second, I think this is a real problem that people face. I'm not going to be one that says that this forum is not appropriate for the topic.
I think the answer to the question has a lot to do with the complexity and type of user queries that are supported. If the queries consist of a bunch of binary predicates, for instance, then you can extract the particular rules from the document data and then readily apply the rules. If, on the other hand, the queries consist of complex scoring over the text of the documents, then you might need an inverted index paired with a scoring algorithm for each user query.
My approach to such a system would be to parse the queries into individual data elements that can be determined from each document (which I might call a "queries signature" since the results would contain all fields needed to satisfy the queries). This "queries signature" would be created each time a document was loaded, and it could then be used to satisfy the queries.
Adding a new query would require processing all the documents to assign new values. Given the volume of data, this might need to be more of a batch task.
Whether SQL is appropriate depends on the features that you need to extract from the data. This in turn depends on the nature of the user queries. It is possible that SQL is sufficient. On the other hand, you might need more sophisticated tools, especially if you are using text mining concepts for the queries.

Thinking about this, it sounds like an event-processing task, rather than a regular data processing operation, so it might be worth investigating Complex Event Processing systems - rather than building everything on a regular database, using a system which processes the queries on the incoming data as it streams into the system. There are commercial systems which can hit the speed & high-availability criteria, but I haven't researched the available OSS options (luckily, people on quora have done so).

Take a look at Elastic Search. It has a percolator feature that matches a document against registered queries.
http://www.elasticsearch.org/blog/2011/02/08/percolator.html

Related

Store "extended" metadata on entities stored in Azure Cosmos DB as JSON documents

We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}
We store these products currently as self-contained documents in Azure Cosmos DB.
Question 1: Partitioning.
The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by e.g. name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?
Question 2: Extended metadata.
We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?
Some brainstorming from me below.
1: Just dump everything in one self-contained document.
One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
"prop1": "prop1",
"prop2": "prop2",
"propN": "propN"
}
}
The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:
api.org.com/products/{id} -- will always return a product with the basic common fields
api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)
2: Split the document
One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).
One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.
Sorry for a long post. Hope this was understandable, looking forward for your feedback!
Answer 1:
If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons.
First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting
with a simple physical partition, it will be faster than going in every single physical partition looking for data.
(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)
The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that
you are performing a query. This is where the system wouldn't do so well.
Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.
Answer 2:
Cosmos DB is a NoSQL schema-less database for a reason.
The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here.
You can simply go with your first approach and either have everything under a single property or just have them at the top level.
Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version
and just map to different versions depending on the endpoint.
Hope this helps.

Mongo vs MySql Search Optimization

So I'm in the process of designing a system that is going to store document type of data (i.e. transcribed documents). Immediately, I thought this would be a great opportunity to leverage a NOSQL implementation like MongoDB. However, given that I have zero experience with Mongo, I'm wondering: on each of these docuemnts, I have a number of metadata tags I want to be able to search across: things like date, author, keywords, etc. If I were to use an RDBMS like MySql, I'd probably store these items in a separate table liked by a foreign key and the index the items that were most likely to be searched on. Then I could run queries against that table and only pull back and the full text results for the items that matched (saves on disk read not having to reach through a row that contains a lot of text or BLOB information).
Would something similar be possible with Mongo? I know in Mongo I could simply create 1 document that would have all the metadata AND the actual transcription but is it easy and highly performant to search the various fields in the metadata if the document is stored like that? Is there a best practice when needing to perform searches across various items in a document in Mongo? Or is this type of scenario more suited for an RDBMS rather than a NOSQL implementation?
You can add indexes for individual fields in mongodb documents. Only when the indexes get larger than your memory, performance of index based searches may become a problem.
When you decide if to go with mongodb, keep in mind that there is no join operation. This has to be done by your db layer or above.
If your primary concern is searching, there is an ElasticSearch river for mongodb, so you can utilize ElasticSearch on your dataset.
The NoSQL model, is geared for data storage in long range (OLTP model), yes you can create indexes and perform queries that you want, instead of you having related entities across tables, you have a complete entity that owns all entities dependent on it within herself.
When you have to extract complex reports with many joins in a relational database in a context of millions of data becomes impractical such an act, because you may end up compromising other applications.
For example:
We have the room and student bodies.
Each room have many students, the relational model we would have the following:
SELECT * FROM ROOM R
INNER JOIN
S STUDENT
ON = S.ID R.STUDENTID
Imagine doing that with some 20 tables with thousands of data? His performance will be horrible.
With MongoDB you will do so:
db.sala.find (null)
And will have all their rooms with their students.
MongoDB is a database that performs scanning horizontally.
You can read:
http://openmymind.net/mongodb.pdf
This site also has an interactive tutorial that uses the book's examples. Very nice.
And here you can experience the mongodb online and test your commands.
Search for try mongo db.
Also read about shards with replicaSets. I believe it will open your mind greatly.
You can install Robomongo which is a graphical interface for you to tinker with mongodb.
http://robomongo.org/

Database optimized for searching in large number of objects with different attributes

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database.
I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.
I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:
I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these
DataObject Common (META) Attributes
Unique ID (Some kind of string containing a unique identifier)
Created Date (A date time showing creation time of the object)
Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
... I think you get the Idea
Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples
Data Object variable Attributes
Color (Some CSS like color string)
Name (A string)
Category (The category or Tag of this item) (Maybe we also have more than one of these?)
URL (a url containing some website)
Cost (a number with decimals
... And a whole lot of other stuff mostly being of the usual column types
References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.
A small sample:
Image
Unique ID = "0s987tncsgdfb64s5dxnt"
Created Date = "2013-11-21 12:23:11"
Type = "Image"
Title = "A cute cat"
Category = "Animal"
Size = "10234"
Mime = "image/jpeg"
Filename = "cat_123.jpg"
Copyright = "None"
Typical Operations
An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.
Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:
Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.
Storage Requirements
Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.
tl;dr
Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.
A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data
Update
Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.
In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.
Some small drawback still remain but are acceptable at the moment:
- MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient)
- You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.
I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.
MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.
There's many diffrent NoSql providers and they have different ways of handling you data.
Think about that before you choose a NoSql platform.
The possibilities are
Key–value Stores - ex. Redis, Voldemort, Oracle BDB
Column Store - ex. Cassandra, HBase
Document Store - ex. CouchDB, MongoDb
Graph Database - ex. Neo4J, InfoGrid, Infinite Graph
Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.
You can try the Document based NoSql at http://try.mongodb.org/
In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.
The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.
Hope this inspires a little and feel free to ask any westions

MySQL scalable data model

I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.

MySQL: Complex queries or tracking/counter fields

I'm just thinking about MySQL database design and there are often situations where
A particular action is or is not carried out and consequently data is or is not stored in the database
Whether or not a user undertook a particular action is displayed statistically
An example of this would be:
A user does or does not fill out a survey. If they do fill out a survey, the data they provide is stored in the database. The total number of users who filled out the survey is displayed.
Now, in order to get the number of users who filled out the survey, we could either
create a field of type BOOL which is set to TRUE on suvey completion; we then calculate the number of users who completed the survey using a simple COUNT(*) WHERE field=TRUE
calculate the number of users who filled out the survey using the data they provided by joining the users and survey results tables and grouping on the user
This isn't a particularly complex example, but there are cases where without the BOOL flag, queries can be become very complex and expensive. But the flag is an almost unnecessary addition to the database tables - we use it only for convenience. Also it means we have to ensure that we UPDATE all user flags at the relevant time, as well as storing user data.
What would be your approach to this kind of problem? For smaller applications, i'll usually just write complex queries and cache their results (occasionally using views to make things more manageable). But in larger applications, with potentially many joins, I might be tempted to flag the users with an action field so that reads are simpler and cheaper.
The best solution is an indexed view (SQL Server terminology) or a materialized view (Oracle terminology) or a materialized query table (DB2 terminology). All those solutions keep the data up to date in real time. No maintenance.
When your platform doesn't support those kinds of database objects, you have to resort to using a table, along with all the other things necessary to keep the data right. You can keep the data right with
triggers
cron jobs
If you use triggers, you should probably also run a periodic cron job to make sure the data stored matches the data calculated.
It helps that, in the real world, most of these kinds of requirements really don't have to be up to date in real time. These kinds of numbers usually support management decisions; a lag of even a day is often acceptable. (In other words, it sometimes helps to think of it as a data warehouse problem or as a report rather than as an OLTP problem.) I've had to negotiate these kinds of requirements many times. I've never had anyone refuse to accept a two-hour update cycle. (But that's certainly application-dependent.)
calculate the number of users . . . by joining the users and
survey results tables and grouping on
the user
If you can join the users and the survey results tables, then the survey results table must have a user identifier, right? If that's right, you don't need to join those two tables to determine the number of users who completed a survey.
What you are describing is called a "denormalized view", i.e. a table that contains results which can be computed from other data already in the database. The reason to do this is indeed performance, whether to do this or not depends on the cost of (re-)generating the data, the effort in your code required to keep it coherent, and the extra amount of database space to store the computed values.