Different data structures & Complexities [closed] - language-agnostic

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I know this wiki link exists which has different data structures.
I want to know if there is a place where I can get the complexities (for insert, delete, update etc.) in a neat table format (for reference).

The page that you linked to in your question has a list of many data structures. Each of them a page that details the specific data structures. I know you want the table of comparisons in a ready made format but since it does not appear to exist then it might be something that you can put together easily by browsing through the various pages. For instance the comparison of the various algorithms in the array is given here, and for the b-tree here. So it may require some work to compile it all into a simple reference. Hmmm...maybe there is a blog post in the making.

Here it is on Wikipedia: Worst-case analysis of data structures
+----------------------+----------+------------+----------+--------------+
| | Insert | Delete | Search | Space Usage |
+----------------------+----------+------------+----------+--------------+
| Unsorted array | O(1) | O(1) | O(n) | O(n) |
| Value-indexed array | O(1) | O(1) | O(1) | O(n) |
| Sorted array | O(n) | O(n) | O(log n) | O(n) |
| Unsorted linked list | O(1)* | O(1)* | O(n) | O(n) |
| Sorted linked list | O(n)* | O(1)* | O(n) | O(n) |
| Balanced binary tree | O(log n) | O(log n) | O(log n) | O(n) |
| Heap | O(log n) | O(log n)** | O(n) | O(n) |
| Hash table | O(1) | O(1) | O(1) | O(n) |
+----------------------+----------+------------+----------+--------------+
* The cost to add or delete an element into a known location in the list
(i.e. if you have an iterator to the location) is O(1).
If you don't know the location, then you need to traverse the list to the location of deletion/insertion, which takes O(n) time.
** The deletion cost is O(log n) for the minimum or maximum, O(n) for an arbitrary element.

Related

Many to many relationship with different data types

I am trying to create a Database for different types of events. Each event has arbitrary, user created, properties of different types. For example "number of guests", "special song to play", "time the clown arrives". Not every event has a clown but one user could still have different events with a clown. My basic concept is
propID | name | type
------ | ---- | -----
1 |#guest| number
2 |clown | time
and another table with every event with a unique eventID. The Problem is that a simple approach like
eventID | propID | value
------ | ------ | -----
1 | 1 | 20
1 | 2 | 10:00
does not really work because of the different DataTypes.
Now I thought about some possible solutions but I don't really know which one is best, or if there is an even better solution?
1. I store all values as strings and use the datatype in the property table. I think this is called EAV and is not considered good practice.
2. There are only a limited amount of meaningful datatypes, which could lead to a table like this:
eventID | propID | stringVal | timeVal | numberVal
------ | ------ | --------- | ------- | --------
1 | 1 | null | null | 20
1 | 2 | null | 10:00 | null
3. Use the possible datatypes for multiple tables like:
propDateEvent propNumberEvent
-------------------------- --------------------------
eventID | propId | value eventID | propId | value
--------|--------|-------- --------|--------|--------
1 | 2 | 10:00 1 | 1 | 20
Somehow I think every solution has its ups and downs. #1 feels like the simplest but least robust. #3 seems like the cleanest solution, but pretty complicated if I wanted to add e.g. a priority for the properties per event.
All the options you propose are variations on entity/attribute/value or EAV. The basic concept is that you store entities (in your case events), their attributes (#guest, clown), and the values of those attributes as rows, not columns.
There are lots of EAV questions on Stack Overflow, discussing the benefits and drawbacks.
Your 3 options provide different ways of storing the data - but you don't address the ways in which you want to retrieve that data, or verify the data you're about to store. This is the biggest problem with EAV.
How will you enforce the rule that all events must have "#guests" as a mandatory field (for instance)? How will you find all events that have at least 20 guest, and no clown booke? How will you show a list of events between 2 dates, ordered by date, and number of guests?
If those requirements don't matter to you, EAV is fine. If they do, consider using a document to store this user-defined data (JSON or XML). MySQL can query those documents natively, you can enforce business logic much more easily, and you won't have to write horribly convoluted queries for even the simplest business cases.

How to add related properties to a table in mysql correctly

We have been developing the system at my place of work for sometime now and I feel the database design is getting out of hand somewhat.
For example we have a table widgets (I'm spoofing these somewhat):
+-----------------------+
| Widget |
+-----------------------+
| Id | Name | Price |
| 1 | Sprocket | 100 |
| 2 | Dynamo | 50 |
+-----------------------+
*There's about 40+ columns on this table already
We want to add on a property for each widget for packaging information. We need to know if it has packaging information, if it doesn't have packaging information or we don't know if it does or doesn't. We then need to also store the type of packaging details (assuming it does or maybe it doesn't and it's reduntant info now).
We already have another table which stores the details information information (I personally think this table should be divided up but that's another issue).
PD = PackageDetails
+--------------------------------+
| System Properties |
+--------------------------------+
| Id | Type | Value |
| 28 | PD | Boxed |
| 29 | PD | Vacuum Sealed |
+--------------------------------+
*There's thousands of rows in the table for all system wide table properties
Instinctively I would create a number of mapping tables to capture this information. I have however been instructed to just add another column onto each table to avoid doing a join.
My solution:
Create tables:
+---------------------------------------------------+
| widgets_packaging |
+---------------------------------------------------+
| Id | widget_id | packing_info | packing_detail_id |
| 1 | 27 | PACKAGED | 2 |
| 2 | 28 | UNKNOWN | NULL |
+---------------------------------------------------+
+--------------------+
| packaging |
+--------------------+
| Id | |
| 1 | Boxed |
| 2 | Vacuum Sealed |
+--------------------+
If I want to know what packaging a widget has I join through to widgets_packaging and join again to packaging if I want to know the exact details. Therefore no more columns on the widgets table.
I have however been told to ignore this and put the value int for the packing information and another as a foreign key to System Properties table to find the packaging details. Therefore adding another two columns to the table and creating yet more rows in the system properties table to store package details.
+------------------------------------------------------------+
| Widget |
+------------------------------------------------------------+
| Id | Name |Price | has_packaging | packaging_details |
| 1 | Sprocket |100 | 1 | 28 |
| 2 | Dynamo |50 | 0 | 29 |
+------------------------------------------------------------+
The reason for this is because it's simpler and doesn't involve a join if you only want to know if the widget has packaging (there are lots of widgets). They are concerned that more joins will slow things down.
Which is the more correctly solution here and are their concerns about speed legitimate? My gut instint is that we can't just keep adding columns onto the widgets table as it is growing and growing with flags for properties at present.
The answer to this really depends on whether the application(s) using this database are read or write intensive. If it's read intensive, the de-normalized structure is a better approach because you can make use of indexes. Selects are faster with fewer joins, too.
However, if your application is write intensive, normalization is a better approach (the structure you're suggesting is a more normalized approach). Tables tend to be smaller, which means they have a better chance of fitting into the buffer. Also, normalization tends to lead to less duplication of data, which means updates and inserts only need to be done in one place.
To sum it up:
Write Intensive --> normalization
smaller tables have a better chance of fitting into the buffer
less duplicated data, which means quicker updates / inserts
Read Intensive --> de-normalization
better structure for indexes
fewer joins means better performance
If your application is not heavily weighted toward reads over writes, then a more mixed approach would be better.

How to store all geographical locations around the world in Database?

I am working for a travel site, where i need to store the tourist spots which tourists traveled to. I need the spots to be unique in the locations table so that i can know the popularity of a particular spot etc.
I will also need also need all countries, states, cities stored with me because i cannot depend on user input.
The database is MySQL.
Seeing the data sets available for such locations i see there is a problem of nesting of cities across countries which may use provinces, states, counties etc.
So, my question is how to design the schema so that i can store all the locations.
I was thinking about having tables for countries, states, cities, and spots.
the spots table will contain spot_name, cityId, stateId, countryId, and some fields to have longitude and latitude bounds.
This way i can identify same spots by their geopositions.
But again, this solution won't work because of the states/provinces/counties etc. problem.
Can you please suggest how to build the schema and go about seeding it with correct data so that dependency on user input is minimum.
you should use a geospatial database as then you can store your locations like countries and states as spatial entities and so can determine the nesting correctly.
If you can't use one you can simulate geospatial positions using strings in a normal table by dividing the world up into a grid, then subdividing each square of the grid recursively.
For example divide the world into 9 squares, numbered 1-9 from top left to bottom right. Anything which is in these large squares has only a single digit reference. Then divide each square into 9 and anything which is at this level has a 2 digit reference. so 11 is the top left square and 99 is the bottom right square.
Repeat this process until you have the precision that you need. a single feature might have a reference 10 digits long 5624357899 but you would know that this would be inside any larger feature which is fewer digits which starts with the same string like 5624357. So your countries would have fewer digits because they are larger, but your individual locations would have more because they are smaller and more accurately located.
This will only give you a course approximation of location (and will be bad for long thin features) but might be suitable enough
The first grid will look like this:
______________________________
| | | |
| 1 | 2 | 3 |
| | | |
|_________|_________|_________|
| | | |
| 4 | 5 | 6 |
| | | |
|_________|_________|_________|
| | | |
| 7 | 8 | 9 |
| | | |
|_________|_________|_________|
The second round looks like this (only first square completed for simplicity):
______________________________
|11|12 |13| | |
|---------| 2 | 3 |
|14|15 |16| | |
|---------| | |
|17|18 |19| | |
|_________|_________|_________|
| | | |
| 4 | 5 | 6 |
| | | |
|_________|_________|_________|
| | | |
| 7 | 8 | 9 |
| | | |
|_________|_________|_________|
you repeat this process until you have fine enough approximation for your purposes.
I think the schema part of your problem would be pretty simple. But the real problem is how you would get the data for your user to select - you are imagining the (almost) impossible! I don't think there is any database in existence which would translate a co-ordinate to a place name. Even Google can't (yet) do that for you - for example, a search for "Lat Long Taj Mahal" provides 27.1750, 78.0419 (Google have used their own and other people's experience to tell you that); but a search for "27.1750, 78.0419" just yields a pin on the map, and then our human eyes can see that the pin is 'pretty close' to a place named "Taj Mahal" (or ताज महल in Hindi, or તાજ મહેલ in Gujarati )...
Just imagine - how you would populate your schema? Think about how many co-ordinates you would need in your table if you wanted decent accuracy (needing at least 6 decimal places)! And who would be the authority on place names?
So I think your best approach might be to:
Use the publically available lists of country/city names translated
to their co-ordinates,
Build your app so it pre-populates the closest co-ordinate to the user's
precise location, and then
Allow the user to qualify the match with their own (more
specific) chosen place name.
Then YOU could store the precise co-ordinate gathered by your app, along with the place name the user specified; and sell the data for $millions! (I suspect Google are already doing this ;)

More efficient to have more columns or more rows?

I'm currently redesigning a database which could contain a lot of data - I have the option to either include a number of different columns in the database or use a lot of rows instead. It's probably easier if I did some kind of outline below:
item_id | user_id | title | description | content | category | template | comments | status
-------------------------------------------------------------------------------------------
1 | 1 | ABC | DEF | GHI | 1 | default | 1 | 1
2 | 1 | ZYX | | QWE | 2 | default | 0 | 1
3 | 1 | A | | RTY | 2 | default | 0 | 0
4 | 2 | ABC | DEF | GHI | 3 | custom | 1 | 1
5 | 2 | CBA | | GHI | 3 | custom | 1 | 1
Versus something in the following structure:
item_id | user_id | attribute | value
---------------------------------------
1 | 1 | title | ABC
1 | 1 | description | DEF
1 | 1 | content | GHI
... | ... | ... | ...
I may want to create additional attributes in the future (50 for arguments sake) - so there could be a lot of empty cells if using multiple columns. The attribute names would be reused, where possible, across different types of content - say a blog entry, event, and gallery - title would easily be reused.
So my question is, is it more efficient to use multiple columns or multiple rows - in terms of query speed and disk space. Or would you instead recommend relationship tables, so there's a table for blogs, a table for events, etc. I'm just trying to come up with an easily expandable solution, where I ideally do not want to create a table for every kind of content as I'm thinking of developers creating new kinds of content via an app/API system (with attributes being tightly controlled).
Supplementary Question if Multiple Rows
How could I, in MySQL, convert multiple rows into a usable column format (I guess temporary tables) - so I could do some filtering by content type, as an example.
Basically, mysql has a variable row length as long as one does not change the on a per table level. Thus, empty cols will not use any space (well, almost).
But with blobs or text columns, it might be better to normalize those, as these may have large data to store and this needs to be read / skipped every time a table is scanned. Even if the column is not in the result set and you're doing queries outside of an index, it will take it's time on a large amount of rows.
As a good practice I think it will be fast to put all administrative and often used cols in one table and normalize all the rest. A kind of "vertical" design as in your second example will be complex to read and as soon as you work with temporary tables you will run in to performance issues sooner or later.
For a traditional row-based store, the cost of spooling through rows will depend on their width, so scanning a table with wide rows will take longer than one with narrow rows.
That said, it you're using an index to locate the rows that are of interest, this won't be so much of an issue.
If you normalise your data by replacing columns with keys to rows in other tables, you can reduce the amount of storage if the linked tables end up being significantly smaller than the original table, however any query will need to include the cost of required joins into the related table.
As with all these things, it's a balancing act that depends on your requirements, but understanding what's going on under the hood can certainly help you to make more informed decisions.
This question is very difficult to answer as it all comes down to what you are looking for and how your database will grow in size and complexity over time. I find the best way to answer these types of questions is to read case studies from other successful sites. For example Reddit would be a case study where they use a lot of rows but very little tables and/or columns. The article is here and a question on it is here.
There is also the option of exploring a NoSQL solution which may be more applicable to what you are trying to achieve.
Google case studies of sites that would have a similar structure to your own and see how they accomplished it as they have most likely encountered all the issues you will and already overcome them.

I've hit the DB performance bottleneck, where now?

I have some queries that are taking too long (300ms) now that the DB has grown to a few million records. Luckily for me the queries don't need to look at the majority of this data, that latest 100,000 records will be sufficient so my plan is to maintain a separate table with the most recent 100,000 records and run the queries against this. If anyone has any suggestions for a better way of doing this that would be great. My real question is what are the options if the queries did need to run against the historic data, what is the next step? Things I've thought of:
Upgrade hardware
Use an in memory database
Cache the objects manually in your own data structure
Are these things correct and are there any other options? Do some DB providers have more functionality than others to deal with these problems, e.g. specifying a particular table/index to be entirely in memory?
Sorry, I should've mentioned this, I'm using mysql.
I forgot to mention indexing in the above. Indexing have been my only source of improvement thus far to be quite honest. In order to identify bottlenecks I've been using maatkit for the queries to show whether or not indexes are being utilised.
I understand I'm now getting away from what the question was intended for so maybe I should make another one. My problem is that EXPLAIN is saying the query takes 10ms rather than 300ms which jprofiler is reporting. If anyone has any suggestions I'd really appreciate it. The query is:
select bv.*
from BerthVisit bv
inner join BerthVisitChainLinks on bv.berthVisitID = BerthVisitChainLinks.berthVisitID
inner join BerthVisitChain on BerthVisitChainLinks.berthVisitChainID = BerthVisitChain.berthVisitChainID
inner join BerthJourneyChains on BerthVisitChain.berthVisitChainID = BerthJourneyChains.berthVisitChainID
inner join BerthJourney on BerthJourneyChains.berthJourneyID = BerthJourney.berthJourneyID
inner join TDObjectBerthJourneyMap on BerthJourney.berthJourneyID = TDObjectBerthJourneyMap.berthJourneyID
inner join TDObject on TDObjectBerthJourneyMap.tdObjectID = TDObject.tdObjectID
where
BerthJourney.journeyType='A' and
bv.berthID=251860 and
TDObject.headcode='2L32' and
bv.depTime is null and
bv.arrTime > '2011-07-28 16:00:00'
and the output from EXPLAIN is:
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+-------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+-------------------------------------------------------+
| 1 | SIMPLE | bv | index_merge | PRIMARY,idx_berthID,idx_arrTime,idx_depTime | idx_berthID,idx_depTime | 9,9 | NULL | 117 | Using intersect(idx_berthID,idx_depTime); Using where |
| 1 | SIMPLE | BerthVisitChainLinks | ref | idx_berthVisitChainID,idx_berthVisitID | idx_berthVisitID | 8 | Network.bv.berthVisitID | 1 | Using where |
| 1 | SIMPLE | BerthVisitChain | eq_ref | PRIMARY | PRIMARY | 8 | Network.BerthVisitChainLinks.berthVisitChainID | 1 | Using where; Using index |
| 1 | SIMPLE | BerthJourneyChains | ref | idx_berthJourneyID,idx_berthVisitChainID | idx_berthVisitChainID | 8 | Network.BerthVisitChain.berthVisitChainID | 1 | Using where |
| 1 | SIMPLE | BerthJourney | eq_ref | PRIMARY,idx_journeyType | PRIMARY | 8 | Network.BerthJourneyChains.berthJourneyID | 1 | Using where |
| 1 | SIMPLE | TDObjectBerthJourneyMap | ref | idx_tdObjectID,idx_berthJourneyID | idx_berthJourneyID | 8 | Network.BerthJourney.berthJourneyID | 1 | Using where |
| 1 | SIMPLE | TDObject | eq_ref | PRIMARY,idx_headcode | PRIMARY | 8 | Network.TDObjectBerthJourneyMap.tdObjectID | 1 | Using where |
+----+-------------+-------------------------+-------------+---------------------------------------------+-------------------------+---------+------------------------------------------------+------+---------------------------------------
7 rows in set (0.01 sec)
Make sure all your indexes are optimized. Use explain on the query to see if it is using your indexes efficiently.
If you are doing some heavy joins then start thinking about doing this calculation in java.
Think of using other DBs such NoSQL. You maybe able to do some preprocessing and put data in Memcache to help you a little.
Considering a design change like this is not a good sign - I bet you still have plenty of performance to squeeze out using EXPLAIN, adjusting db variables and improving the indexes and queries. But you're probably past the point where "trying stuff" works very well. It's an opportunity to learn how to interpret the analyses and logs, and use what you learn for specific improvements to indexes and queries.
If your suggestion were a good one, you should be able to tell us why already. And note that this is a popular pessimization--
What is the most ridiculous pessimization you've seen?
Well, if you have optimised the database and queries, I'd say that rather than chop up the data, the next step is to look at:
a) the mysql configuration and make sure that it is making the most of the hardware
b) look at the hardware. You don't say what hardware you are using. You may find that replication is an option in your case if you can buy a two or three servers to divide up the reads from the database (writes have to be done to a central server, but reads can be read from any number of slaves).
Instead of creating a separate table for latest results, think about table partitioning. MySQL has this feature built in since version 5.1
Just to make it clear: I am not saying this is THE solution for your issues. Just one thing you can try
I would start by trying to optimize the tables/indexes/queries before before taking any of the measures you listed. Have you dug into the poorly performing queries to the point where you are absolutely convinced you have reached the limit of your RDBMS's capabilities?
Edit: if you are indeed properly optimized, but still have problems, consider creating a Materialized View for the exact data you need. That may or may not be a good idea based on more factors than you have provided, but I would put it at the top of the list of things to consider.
Searching in the last 100,000 records should be terribly fast, you definitely have problems with the indexes. Use EXPLAIN and fix it.
I understand I'm now getting away from what the question was intended for
so maybe I should make another one. My problem is that EXPLAIN is saying
the query takes 10ms rather than 300ms which jprofiler is reporting.
Then your problem (and solution) must be in java, right?