MySQL table structure for JSTree - mysql

I'm looking for the most practical way to save the node data for my JSTree.
Currently I have everything stored in a single MySQL table with each row holding the data for each full branch. Each row is unique, meaning there are no 2 rows with all the columns exactly the same. The problem with this is it leads to lots of data duplication.
I have tried setting up an adjacency list, but again with only allowing the reference of one parent ID it leads to lots of duplication. This also increases the possibility of linking errors.
I also considered the nested set model, however with having 100,000+ branches, adding data gets rather expensive.
I'm also currently stuck using MySQL.
So the question becomes, what is my best option to store the tree data keeping retrieval time to a minimum, keeping duplicates to a minimum, and easy to update/add new data?

Related

Are there any (potential) problems with having large gaps in auto-increment IDs in the neighbouring rows of a table?

I have a mysql web app that allows users to edit personal information.
A single record is stored in the database across multiple tables. There is a single row in one table for the record, and then additional one-to-many tables for related information. Rows in the one-to-many tables can additionally point to other one-to-many-tables.
All this is to say, data for a single personal information record is a tree that is very spread out in the database.
To update a record, rather than trying to deal with a hodgepodge of update and delete and insert statements to address all the different information that may change from save to save, I simply delete the entire old tree, and then re-insert a new one. This is much simpler on the application side, and so far it has been working fine for me without any problems.
However I do note that some of the auto-incrementing IDs in the one-to-many tables are starting to creep higher. It will still be decades at least before I am anywhere close to this bumping against the limits of INT, let alone BIGINT -- however I am still wondering if there are any drawbacks to this approach that I should be aware of.
So I guess my question is: For database structures like mine, which consist of large trees of information spread across multiple tables, when updating the information, any part of which may have changed, is it ok to just delete the old tree and re-insert a new one? Or should I be rethinking this. IOW is it ok or not ok for there to be large gaps between the IDs of the rows in a table?
Thanks (in advance) for your help.
If your primary keys are indexed (which they should) you shouldn't get problems, apart from the database files needing some compacting from time to time.
However the kind of data you are storing could probably stored better in a document database, like MogoDB, have you consider using one of these?

Large amounts of Data

I've been working with MS Access 2010 for a while now and for the most part everything works. YAY. However, I have large amounts of data to eventually plot (x-axis y-axis pairs) that come from a piece of equipment that I use for work. I can import this data as a seperate table, but I am not particularly fond of the idea of having my database overloaded with seperate tables that are purely to store this data. To my undertanding each table should represent an entity that fits into the large context of the database. Also, for the equipment I'm using right now all the x-axis data is redundant. The question is, what is the best way to divid the data for effecient storage?
Considerations:
I keep running into the same problems as I think about this question. Suppose that in either case I made two tables, one to store the x-axis data and another to store the y-axis data, and then had a linking table between the two allowing for a many to many relationship.
On the one hand, I could store one value per Record (all values in one Column). But, then there would need to be a tag field in each of these two tables, thus defeating the purpose of the split.
On the other hand, I could store one value per Field (all value in one Row), which in my case would yield over 2000 fields in each table.
There is a third option, the one I'm currently using, to store one pair per row in a single table. However, there is much redundancy.
You should stick with your current method. This is by far the simplest method to both retrieve and add to the data. Below I have my reactions to your other suggestions.
Suppose that in either case I made two tables, one to store the x-axis
data and another to store the y-axis data, and then had a linking
table between the two allowing for a many to many relationship.
This might provide a slight hard drive space improvement if X and Y are not integers. However, it would complicate things significantly for questionable benefit.
On the one hand, I could store one value per Record (all values in one
Column). But, then there would need to be a tag field in each of these
two tables, thus defeating the purpose of the split.
This would make it a lot more complicated to work with the data and is a bad idea. You would need to use complicated querying to get both data points in the same row. You could do this, but it complicate both input and retrieval.
On the other hand, I could store one value per Field (all value in one
Row), which in my case would yield over 2000 fields in each table.
If you do this, you will regret it. This would make it nearly impossible to do any meaningful data analysis later on.
There is a third option, the one I'm currently using, to store one
pair per row in a single table. However, there is much redundancy.
This is ideal. You can easily import your data into the two columns, the data is easily retrievable. The redundancies are not important unless the value is irrelevant.

Loading a fact table in SSIS when obtaining the dimension key isn't easy

I have a fact table that needs a join to a dimension table however obtaining that relationship from the source data isn't easy. The fact table is loaded from a source table that has around a million rows, so in accordance with best practice, I'm using a previous run date to only select the source rows that have been added since the previous run. After getting the rows I wish to load I need to go through 3 other tables in order to be able to do the lookup to the dimension table. Each of the 3 tables also has around a million rows.
I've read that best practice says not to extract source data that you know won't be needed. And best practice also says to have as light-touch as possible on the source system and therefore avoid sql joins. But in my case, those two best practices become mutually exlusive. If I only extract changed rows in the itermediary tables then I'll need to do a join in the source query. If I extract all the rows from the source system then I'm extracting much more data than I need and that may cause SSIS memory/performance problems.
I'm leaning towards a join in the extraction of the source data but I've been unable to find any discussions on the merits and drawbacks of that approach. Would that be correct or incorrect? (The source tables and the DW tables are in Oracle).
Can you stage the 3 source tables that you are referencing? You may not need them in the DW, but you could have them sitting in a staging database purely for this purpose. You would still need to keep these up-to-date however, but assuming you can just pull over the changes, this may not be too bad.

How to decide between row based and column based table structures?

I've some data set, which has hundreds of parameters (with more coming in)
If I dump them in one table, it'll probably end up having hundreds of columns (and I am not even sure how many, at this point)
I could do row based, with a bunch of meta tables, but somehow row based structure feels unintuitive
One more way would be to keep column based, but have multiple tables (split the tables logically) which seems like a good solution.
Is there any other way to do it? If yes, could you point me to some tutorial? (I am using mysql)
EDIT:
based on the answers, I should clarify one thing - updates and deletes are going to be much lesser, than inserts and selects. as it is, selects are going to be the bulk of the operations, so selects have to be fast.
I ran across several designs where a #4 was possible:
Split your columns into searchable and auxiliary
Define a table with only searchable columns, and an extra BLOB column
Put everything in one table: searchable columns go as-is, auxiliary go as a BLOB
We used this approach with BLOBs of XML data or even binary data, representing the entire serialized object. The downside is that your auxiliary columns remain non-searchable for all practical purposes. The upside is that you can add new auxiliary columns at will without changing the schema. You can also make schema changes to make previously auxiliary columns searchable with a schema change and a very simple program.
It all depends on the kind of data you need to store.
If it's not "relational" at all - for instance, a collection of web pages, documents, etc - it's usually not a good fit for a relational database.
If it's relational, but highly variable in schema - e.g. a product catalogue - you have a number of options:
single table with every possible column (your option 1)
"common" table with the attributes that each type shares, and joined tables for attributes for subtypes
table per subtype
If the data is highly variable and you don't want to make schema changes to accommodate the variations, you can use "entity-attribute-value" or EAV - though this has some significant drawbacks in the context of relational database. I think this is what you have in mind with option 2.
If the data is indeed relational, and there is at least the core of a stable model in the data, you could of course use traditional database design techniques to come up with a schema. That seems to correspond with option 3.
Does every item in the data set have all those properties? If yes, then one big table might well be fine (although scary-looking).
On the other hand, perhaps you can group the properties. The idea being that if an item has one of the properties in the group, then it has all the properties in that group. If you can create such groupings, then these could be separate tables.
So should they be separate? Yes, unless you can prove that the cost of performing joins is unacceptable. Perform all SELECTs via stored procedures and you can denormalise later on without much trouble.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.