DocumentDb partitioning strategy for real estate data - partitioning

Say I am building a documentdb collection of real estate properties in US and Canada (eventually, I may need to add other countries as well) and I expect to have several million documents in my collection. Also, let's assume that the most popular query will be to retrieve the top X properties within a certain radius from a given location.
Given these requirements, what would be a good partitioning strategy? Would using the ZIP code/Postal code be a good partitioning key? Would a strategy involving the geo location be better? Any other suggestion?

Actually, I suggest that you use partitioned collections and use the id as your partition key then use geo queries. It's dirt simple and will get you maximum fan out on your queries which will give you the best throughput. Later, if that doesn't work, you can think about a more performant partitioning strategy.

Related

Best practices for creating a huge SQL table

I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.

What is an optimal approach to storing locations that can be queried by distance?

I want to implement a feature where a list of nearby venues can be presented sorted by the distance from user's location. The approach I have right now is to store lat and lon values as floats and to make a query that is looking for +/- values of the location of the user (searching for a square that extends north, south, east and west of the user). Then I do a quick calculation across the resultset determining the distance and sort in my business logic. Now I am approaching this with the perspective of someone who has primarily used relational databases (the app is running MySQL with Hibernate), but is there a better approach (in a different database like Neo4J or with a better column type?)
Also the approach I have has a semi complex workaround for queries at or near 0 lat or 0 lon).
As for my working definition of optimal I'm looking for approaches that are scalable to potentially hundreds of venues in a 10 mile radius and hundreds of thousands of venues in total. To put it another way approximately 1% of SimpleGEO, so if the scale of this problem doesn't require an optimal solution then "you're alright" would also be an interesting answer, though I'd be intested in knowing why)
You could have a look at Lucene/Solr. Lucene supported location-aware search at least since v2.9.
If you're worried about the Lucene complexities, there's Hibernate Search which is meant to replicated all database changes across to Lucene transparently.
MongoDB has native support for geospatial indexes and extensions to the query language to support a lot of different ways of querying your geo spatial documents.
But if you are looking for relation database try PostgreSQL with PostGIS.
Are you look to Hibernate Spatial?
Hibernate Spatial is a generic extension to Hibernate for handling geographic data. And HS have MySQL Provider.
http://www.hibernatespatial.org/

Follower System, better in MySQL or Redis?

I'm just wondering what solution to chose to implement a follower system?
In MySQL i would have a table
userID INT PRIMARY,
followID INT PRIMARY
And in Redis I would just use a SET and add to the UserID all the followIDs.
What would be faster for lets say someone having 2000 followers and you want to list all the followers?(in a table that has about 1M entries)
What would be faster to find out if two Users follow each other?
Thank you very much!
By modern standards, 1M items are nothing. Any database or NoSQL system will work fine with such volume, so you just have to pick the one you are the most comfortable with.
In term of absolute performance, Redis will be faster than MySQL on this use case, because:
the whole dataset will be in memory
hash tables are faster than btrees
there is no SQL query to parse or execute
However, please note a relational database is far more flexible than a key/value store like Redis. If you can anticipate all the access paths to your data, then Redis is a good solution. Otherwise you will be better served by a more traditional database.
In my opinion, go with MySQL.
The two biggest points you will think about when making the decision are:
1) Have you thought about your use-cases?
You said you want to implement a follower system. If you're only going to be displaying a list of followers which each user has, then the Redis SET will be enough.
But what if you want to get a list of "A list of users which you are currently following"? You can't dig that up easily from your Redis SET, right? Or how about if you wanted to know if User-X is following User-A ? If User-A had 10,000 followers, this wouldn't be easy either would it?
MySQL is much more flexible when querying different types of results in different scenes.
2) Do you really need the performance difference?
As you know, Redis IS faster than MySQL in these kinds of cases.
It is a simple Key-Value system, so it will exceed the performance of MySQL.
Checking out performance results like these:
http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/
http://ruturaj.net/redis-memcached-tokyo-tyrant-and-mysql-comparision/
But the performance difference between Redis and MySQL really starts to kick in
only after about 5,000request/sec .
Otherwise you'd wouldn't be seeing a difference of more than 50ms.
Performance difference will not be an issue until you have a VERY large traffic.
So, after thinking about these two points, MySQL would be a better answer.
Redis will be good only if:
1) The purpose of the set/list is specific, and there is no need for flexibility in the future
2) You feel that the performance difference will actually have an effect on your architecture.
It depends on what you want to do with the data. You gave some examples but it does not sound as though you are really giving a full definition of what the product needs to do. If all you really want to do is show users if they follow each other? Then either is fine as you are just talking about 2 simple queries. However, what if you want to show two users the intersection of users they share or you want to make suggestions off of the data based on profile data for the users. Then, it becomes more interesting as Redis has functionality to easily give you the intersection of sets very very quickly (we're talking magnitude differences in terms of speed not just milliseconds - and the difference gets exponentially larger as there are more users/relationships to parse as the sql joins required to get the data can become prohibitive if you want to give the data in real time).
sadd friends:alex george paul bart
sadd friends:alice mary sarah bart
sinterstore friends:alex_alice friends:alex friends:alice
Note that the above can be done with mysql as well, but your performance will suffer and it would be something that you are more likely to run as a batch job and then store the results for future use. On the other hand, keep in mind that the largest "friends" network in the world, Facebook, started with mysql to store relationships. The graphs of those relationships were batched and heavily denormalized for storage in thousands of memcached servers to get decent performance.
Then if you are looking for more options beyond mysq1 or redis, you might want to read what Michael Stonebaker has to say (he helped create Postgres and Ingres) about using an RDBMS system for graph data such as friend relationships. http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/. Of course, he's trying to sell his new VoltDB but it is interesting food for thought.
So I think you really need to map out the requirements for the app (as I assume it will do more than just show you who your friends are) in terms of both expected load (did you just throw out 2000 or is that really what you expect to handle) and features and budget. Then really examine many of the different options on the market.

Normalize database or not? Read only MyISAM table, performance is the main priority (MySQL)

I'm importing data to a future database that will have one, static MyISAM table (will only be read from). I chose MyISAM because as far as I understand it's faster for my requirements (I'm not very experienced with MySQL / SQL at all).
That table will have various columns such as ID, Name, Gender, Phone, Status... and Country, City, Street columns. Now the question is, should I create tables (e.g Country: Country_ID, Country_Name) for the last 3 columns and refer to them in the main table by ID (normalize...[?]), or just store them as VARCHAR in the main table (having duplicates, obviously)?
My primary concern is speed - since the table won't be written into, data integrity is not a priority. The only actions will be selecting a specific row or searching for rows that much a certain criteria.
Would searching by the Country, City and/or Street columns (and possibly other columns in the same search) be faster if I simply use VARCHAR?
EDIT: The table has about 30 columns and about 10m rows.
It can be faster to search if you normalize as the database will only have to compare an integer instead of a string. The table data will also be smaller which makes it faster to search as more can be loaded into memory at once.
If your tables are indexed correctly then it will be very fast either way - you probably won't notice a significant difference.
You might also want to look at a full text search if you find yourself writing LIKE '%foo%' as the latter won't be able to use an index and will result in a full table scan.
I'll try to give you something more than the usual "It Depends" answer.
#1 - Everything is fast for small N - if you have less than 100,000 rows, just load it flat, index it as you need to and move on to something higher priority.
Keeping everything flat in one table is faster for reading everything (all columns), but to seek or search into it you usually need indexes, if your data is very large with redundant City and Country information, it might be better to have surrogate foreign keys into separate tables, but you can't really say hard and fast.
This is why some kind of data modeling principles are almost always used - either traditional normalized (e.g. Entity-Relationship) or dimensional (e.g. Kimball) is usually used - the rules or methodologies in both cases are designed to help you model the data without having to anticipate every use case. Obviously, knowing all the usage patterns will bias your data model towards supporting them - so a lot of aggregations and analysis is a strong indicator to use a denormalized dimensional model.
So it really depends a lot on your data profile (row width and row count) and usage patterns.
I don't have much more than the usual "It Depends" answer, unfortunately.
Go with as much normalization as you need for the searches you actually do. If you never actually search for people who live on Elm Street in Sacramento or on Maple Avenue in Denver, any effort to normalize those columns is pretty much wasted. Ordinarily you would normalize something like that to avoid update errors, but you've stated that data integrity is not a risk.
Watch your slow query log like a hawk! That will tell you what you need to normalize. Do EXPLAIN on those queries and determine whether you can add an index to improve it or whether you need to normalize.
I've worked with some data models that we would called "hyper-normalized." They were in all the proper normal forms, but often for things that just didn't need it for how we used the data. Those kinds of data models are difficult to understand with a casual glance, and they can be very annoying.

Top k problem - finding usage for my academic work

Top k problem - searching BEST k (3 or 1000) elements in DB
There is fundamental problem with relational DB, that to find top k elems, there is a need to process ALL rows in table. Which make it useless on big data.
I'm making application (for university research, not really my invention, I'm implementing and trying to improve original idea) that allows you to effectively find top k elements by visiting only 3-5% of stored data. Which make it really fast.
There are even user preferences, so on some domain, you can specify function that specify best value for user and aggregation function that specify most significant attributes.
For example DB of cars: attributes:(price, mileage, age of car, ccm, fuel/mile, type of car...) and user values for example 10*price + 5*fuel/mile + 4*mileage + age of car, (s)he doesn't care about type of car and other. - this is aggregation specification
Then for each attribute (price, mileage, ...), there can be totally different "value-function" that specifies best value for user. So for example (price: lower, the better, then value go down, up to $50k, where value is 0 (user don't want car more expensive than 50k). Mileage: other function based on his/hers criteria, ans so on...
You can see that there is quite freedom to specify your preferences and acording to it, best k elements in DB will be found quickly.
I've spent many sleepless night thinking about real-life usability. Who can benefit from that query db? But I failed to whomp up anything and sticking to only academic write-only stance. :-( I hope there can be some real usage for that, but I don't see any....
.... do YOU have any idea how to use that in real-life, real problem, etc...
I'd love to hear from You.
Have a database of people's CVs and establish hiring criteria for different jobs, allowing for a dynamic display of the top k candidates.
Also, considering the fast nature of your solution, you can think of exploiting it in rendering near real-time graphs of highly dynamic data, like stock market quotes or even applications in molecular or DNA-related studies.
New idea: perhaps your research might have applications in clustering, where you would use it to implement a fast k - Nearest Neighbor clustering by complex criteria without having to scan the whole data set each time. This would lead to faster clustering of larger data sets in respect with more complex criteria in picking the K-NN for each data node.
There are unlimited possible real-use scenarios. Getting the top-n values is used all the time.
But I highly doubt that it's possible to get top-n objects without having an index. An index can only be built if the properties that will be searched are known ahead of searching. And if that's the case, a simple index in a relational database is able to provide the same functionality.
It's used in financial organizations all the time, you need to see the most profitable assets / least profitable, etc.