I'm helping someone do homework for their databases class, but the syllabus has changed since I took it. Specifically there is one problem about selectivity that I have no background knowledge for and relentless Google-ing isn't turning up proper results. If anyone could look at the problem and point me towards an online resource that covers this type of problem I would be really grateful!
"Let R be a relation instance with schema(NAME, GENDER, AGE, INCOME). The selectivity of δGENDER='F'(R) is 0.4, the selectivity of δAGE<=30(R) is 0.3 and the selectivity of δINCOME>90,000(R) is 0.6. What is the selectivity of δGENDER='F' AND (AGE<=30 OR INCOME<=90,000)(R)"
To me this looks like a probabilities problem (Which meshes with what little I can find of selectivity, it's just the ratio of things that survive a filter) so my first instinct is to say that the answer to this is 0.4 * (0.3 + 0.6) but I have no idea if I'm in the right ballpark.
Related
I have to build the structure of a posts table to handle a big number of data (let's say, 1 million of rows) with with notably those two fields:
latitude
longitude
What I'd like to do is optimise the time consumed by read queries, when sorting by distance.
I have chosen this type: decimal (precision: 10, scale: 6), thinking it is more precise than float, and relevant.
Would it be appropriate to add an index on latitude and an index on longitude?
I'm always scared watching all the operation, such as SIN(), that ORM are performing to build such queries. I'd like to follow the best practices, to be sure it will scale, event with a lot of rows.
Note: If a general solution is not possible, let's say the database is MySQL.
Thanks.
INDEX(latitude) will help some. But to make it significantly faster you need complicated data structure and code. See my blog .
In there, I point out that 6 decimal places is probably overkill in resolution, unless you are trying to distinguish two persons standing next to each other.
There is also reference code that includes the trigonometry to handle great circle distances.
Every single user has say, 3 of GROUP_A, 10 GROUP_B's per GROUP_A, and 20 GROUP_C's per GROUP_B. And each of the 20 GROUP_C's involve lots of inserts/deletes...And every piece of data is unique to one another for all GROUPs/users.
I'm not an expert, but I've done research but it's all theoretical at this point of course, and I don't have hands on experience with the implementation that's for sure. I think my options are something like 'adjacency lists' or 'nested sets'?
Any guidance into the right direction would be very much appreciated!
(I posted this on DBA stackexchange too but I'd really appreciate if I could get more opinions and help from the community!)
I know the trivial solution is just to have simple tables with foreign keys to the parent 'container' but I'm thinking about in the long term, in the event there's a million users or so.
I know the trivial solution is just to have simple tables with foreign keys to the parent 'container' but I'm thinking about in the long term, in the event there's a million users or so.
I would go with just that approach. As long as the number of hierarchy levels remains fixed, the resulting scheme will likely scale well because it is so trivial. Fancy table structures and elaborate queries might work well enough for small data sets, but for large amounts of data, simple structures will work best.
Things would be a lot more difficult if the number of leverls might vary. If you want to be prepared for such cases, you could devise a different approach, but that would probably scale badly if the amount of data increases.
I'm working on a browser-based RPG for one of my websites, and right now I'm trying to determine the best way to organize my SQL tables for performance and maintenance.
Here's my question:
Does the number of columns in an SQL table affect the speed in which it can be queried?
I am not a newbie when it comes to PHP or MySQL. I used to develop things with the common goal of getting them to work, but I've recently advanced to the stage where a functional program is not good enough unless it's fast and reliable.
Anyways, right now I have a members table that has around 15 columns. It contains information such as the player's username, password, email, logins, page views, etcetera. It doesn't contain any information on the player's progress in the game, however. If I added columns for things such as army size, gold, turns, and whatnot, then it could easily rise to around 40 or 50 total columns.
Oh, and my database structure IS normalized.
Will a table with 50 columns that gets constantly queried be a bad idea? Should I split it into two tables; one for the user's general information and one for the user's game statistics?
I know I could check the query time myself, but I haven't actually created the tables yet and I think I'd be better off with some professional advice on this important decision for my game.
Thank you for your time! :)
The number of columns can have a measurable cost if you're relying on table-scans or on caching pages of table data. But the best way to get good performance is to create indexes to assist your queries. If you have indexes in place that benefit your queries, then the width of a row in the table is pretty much inconsequential. You're looking up specific rows through much faster means than scanning through the table.
Here are some resources for you:
EXPLAIN Demystified
More Mastering the Art of Indexing
Based on your caveat at the end of your question, you already know that you should be measuring performance and only fixing code that has problems. Don't try to make premature optimizations.
Unfortunately, there are no one-size-fits-all rules for defining indexes. The best set of indexes need to be designed custom for the queries that you need to be fastest. It's hard work, requiring a lot of analysis, testing, and taking comparative measurements for performance. It also requires a lot of reading to understand how your given RDBMS technology uses indexes.
Top k problem - searching BEST k (3 or 1000) elements in DB
There is fundamental problem with relational DB, that to find top k elems, there is a need to process ALL rows in table. Which make it useless on big data.
I'm making application (for university research, not really my invention, I'm implementing and trying to improve original idea) that allows you to effectively find top k elements by visiting only 3-5% of stored data. Which make it really fast.
There are even user preferences, so on some domain, you can specify function that specify best value for user and aggregation function that specify most significant attributes.
For example DB of cars: attributes:(price, mileage, age of car, ccm, fuel/mile, type of car...) and user values for example 10*price + 5*fuel/mile + 4*mileage + age of car, (s)he doesn't care about type of car and other. - this is aggregation specification
Then for each attribute (price, mileage, ...), there can be totally different "value-function" that specifies best value for user. So for example (price: lower, the better, then value go down, up to $50k, where value is 0 (user don't want car more expensive than 50k). Mileage: other function based on his/hers criteria, ans so on...
You can see that there is quite freedom to specify your preferences and acording to it, best k elements in DB will be found quickly.
I've spent many sleepless night thinking about real-life usability. Who can benefit from that query db? But I failed to whomp up anything and sticking to only academic write-only stance. :-( I hope there can be some real usage for that, but I don't see any....
.... do YOU have any idea how to use that in real-life, real problem, etc...
I'd love to hear from You.
Have a database of people's CVs and establish hiring criteria for different jobs, allowing for a dynamic display of the top k candidates.
Also, considering the fast nature of your solution, you can think of exploiting it in rendering near real-time graphs of highly dynamic data, like stock market quotes or even applications in molecular or DNA-related studies.
New idea: perhaps your research might have applications in clustering, where you would use it to implement a fast k - Nearest Neighbor clustering by complex criteria without having to scan the whole data set each time. This would lead to faster clustering of larger data sets in respect with more complex criteria in picking the K-NN for each data node.
There are unlimited possible real-use scenarios. Getting the top-n values is used all the time.
But I highly doubt that it's possible to get top-n objects without having an index. An index can only be built if the properties that will be searched are known ahead of searching. And if that's the case, a simple index in a relational database is able to provide the same functionality.
It's used in financial organizations all the time, you need to see the most profitable assets / least profitable, etc.
I am working on a roguelike and am using a GA to generate levels. My question is, how many levels should be in each generation of my GA? And, how many generations should it have? It it better to have a few levels in each generation, with many generations, or the other way around?
There really isn't a hard and fast rule for this type of thing - most experiments like to use at least 200 members in a population at the barest minimum, scaling up to millions or more. The number of generations is usually in the 100 to 10,000 range. In general, to answer your final question, it's better to have lots of members in the population so that "late-bloomer" genes stay in a population long enough to mature, and then use a smaller number of generations.
But really, these aren't the important thing. The most critical part of any GA is the fitness function. If you don't have a decent fitness function that accurately evaluates what you consider to be a "good" level or a "bad" level, you're not going to end up with interesting results no matter how many generations you use, or how big your population is :)
Just as Mike said, you need to try different numbers. If you have a large population, you need to make sure to have a good selection function. With a large population, it is very easy to cause the GA to converge to a "not so good" answer early on.