Migrate data model from MySQL to Cassandra

Migrate data model from MySQL to Cassandra - mysql

Structure in MySql (for compactness i am using a simplified notation)
Notation: table name->[column1(key or index), column2, …]
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index), title, description]
Each document can contain a large number of elements (between 1 and 100k+)
We have two key requirements:
Load all elements for a given doc_id quickly
Update the value of one individual element by his element_id quickly
Structure in Cassandra
1st solution
documents->[doc_id(primary key), title, description, elements] (elements could be a SET or a TEXT, each time new elements are added (they are never removed) we would append it to this column)
elements->[element_id(primary key), title, description]
To load a document we would need:
Load document with given and get all element ids: SELECT * from documents where doc_id=‘id’
Load all elements with the given ids: SELECT * FROM elements where element_id IN (ids loaded from query a)
Updating elements would be done by their primary key.
2nd solution
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(secondary index), title, description]
To load a document we would need:
SELECT * from elements where doc_id=‘id’
Updating elements would be done by their primary key.
Questions regarding our solutions:
1st: Will it be efficient to query 100k+ primary keys in the elements table?
SELECT * FROM elements WHERE element_id IN (element_id1,.... element_id100K+)?
2nd: Will it be efficient to query just by a secondary index?
Could anyone give any advice how would we create a model for our use case?

With cassandra it's all about the access pattern (I hope I understood it correctly, if not please comment)
1st
documents should not use sets because set is limited to 65 535 elements and has to be read, updated in it's entirety every time a change is made. Since you need 100k+ it's not what you want. You could use frozen collections etc, but then again, reading everything in memory every time is bound to be slow.
2nd
secondary indexes, well, small cardinality data might be fine But from I understand you have 100k per document, this might even be fine but then again It's not the best practice. I would simply try it out in your concrete case.
3rd - disk is cheap approach - always write the data the way you are going to read it - cassandra's writes are dirt cheap so prepare the views at write time,
this one satisfies reading of all the elements belonging to doc_id
documents->[doc_id(primary key), title_doc (static), description_doc(static), element_id(clustering key), title, description]
elements remain pretty much as they were:
elements->[element_id(primary key), doc_id, title, description]
When doing updates, you update it in documents and elements (for consistency you can use batch operation - should you need it) If having element_id you can quickly issue another query after you get it's doc Id.
Depending on your updating needs documentId could also be a set. (I might have not gotten this part right because not sure what data is available when updating the element do you have the doc_id also and can one element be in more docs?)
Also since having 100k+ elements in single partition is not the best thing to have because of the retrievals ( all requests will go to one node) I would propose to have composite partitioning key (bucket) I think in your case a simple fixed int would be just fine. So every time you go to retrieve the elements you just issue selects to documentid + (1, 2, 3, 4 ...) and then merge the result at the client - this will be significantly faster.
One tricky part would be that you don't go into every single bucket for elementid that is stored in the document ... when I think about it then it would be better to use a base of two for buckets. In your case 16 would be ideal ... then when you look to update specific element just use some simple hash function known to you and use last 4 bits.
Now when I think about it if the element id + doc id is always known to you you might not even need the elements table at all.
Hope this helps

Based on the suggestion of Marko, our solution is:
CREATE TABLE documents (
doc_id uuid,
description text,
title text,
PRIMARY KEY (doc_id)
);
CREATE TABLE nodes (
doc_id uuid,
element_id uuid,
title text,
PRIMARY KEY (doc_id, element_id)
);
We can retrieve all elements with the following query:
SELECT * FROM elements WHERE doc_id='id'
And update the elements:
UPDATE elements SET title='Hello' WHERE doc_id='id' AND element_id='id';

Related

Performance of modelling inheritance in database using superclass table

My Question, is actually a question about the usability / performance of a concept / idea I had:
The Setup:
Troughout my Database, two (actually three) fields always re-appear constantly: title and description (and created). The title is always a VARCHAR(100) and the description always a TEXT.
Now, to simplify those tables, I thought about something (and changed it in that way): Wouldnt it be more useful to just create a table named content, with id, title, description and created as only fields, and always point to that table from all others?
Example:
table tab has id, key and content_id (instead of title, description and created)
table chapter has id, story_id and content_id (" ")
etc
The Question:
Everything works fine so far, but my only fear is performance. Will I run into a bottleneck, doing it this way, or should I be fine? I have about 23 different tables pointing to content right now, and some of them will hold user-defined content (journals, comments, etc) - so the number of entries in content could get quite high.
Is this setup better, or equal to having title and description in every separate table?
Edit: And if it turns out to be a bad idea, what are alternatives to mantain/copying certain fields like title and description into ~25 tables?
Thanks in advance for the help!

There is no clear answer for your question because it mainly depends on usage of the tables, so just consider following points:
How often will you need write to the tables? In case of many inserts/updates having data in one big table can cause problems because all write operations will target the same table.
How often do you need data stored in table with common data? If title or description are not needed most of the time for your select this can be OK. If you need title every time then take into account that you wile always have to JOIN table with common data.
How do you manage your database schema? It can be easier to write some simple tool for creation/checking table structure. In MySQL you can easily access data dictionary with DESCRIBE table_name or through INFORMATION_SCHEMA database.
I'm working on project with 700+ tables where some of the fields have to be present in every table (when was record created, timestamp of last modification). We have simple script that helps with this, because having all data in one table would be disastrous.

MySQL 5.5 Database design. Problem with friendly URLs approach

I have a maybe stupid question but I need to ask it :-)
My Friendly URL (furl) database design approach is fairly summarized in the following diagram (InnoDB at MySQL 5.5 used)
Each item will generate as many furls as languages available on the website. The furl_redirect table represents the controller path for each item. I show you an example:
item.id = 1000
item.title = 'Example title'
furl_redirect = 'item/1000'
furl.url = 'en/example-title-1000'
furl.url = 'es/example-title-1000'
furl.url = 'it/example-title-1000'
When you insert a new item, its furl_redirect and furls must be also inserted. The problem appears becouse of the (necessary) unique constraint in the furl table. As you see above, in order to get unique urls, I use the title of the item (it is not necessarily unique) + the id to create the unique url. That means the order of inserting rows should be as follow:
1. Insert item -- (and get the id of the new item inserted) ERROR!! furl_redirect_id must not be null!!
2. Insert furl_redirect -- (need the item id to create de path)
3. Insert furl -- (need the item id to create de url)
I would like an elegant solution to this problem, but I can not get it!
Is there a way of getting the next AutoIncrement value on an InnoDB Table?, and is it recommended to use it?
Can you think of another way to ensure the uniqueness of the friendly urls that is independent of the items' id? Am I missing something crucial?
Any solution is welcome!
Thanks!

You can get an auto-increment in InnoDB, see here. Whether you should use it or not depends on what kind of throughput you need and can achieve. Any auto-increment/identity type column, when used as a primary key, can create a "hot spot" which can limit performance.
Another option would be to use an alphanumeric ID, like bit.ly or other URL shorteners. The advantage of these is that you can have short IDs that use base 36 (a-z+0-9) instead of base 10. Why is this important? Because you can use a random number generator to pick a number out of a fairly big domain - 6 characters gets you 2 billion combinations. You convert the number to base 36, and then check to see if you already have this number assigned. If not, you have your new ID and off you go, otherwise generate a new random number. This helps to avoid hotspots if that turns out to be necessary for your system. Auto-increment is easier and I'd try that first to see if it works under the loads that you're anticipating.
You could also use the base 36 ID and the auto-increment together so that your friendly URLs are shorter, which is often the point.

You might consider another ways to deal with your project.
At first, you are using "en/" "de/" etc, for changing language. May I ask how does it work in script? If you have different folders for different languages your script and users must suffer a lot. Try to use gettext or any other localisation method (depends on size of your project).
About the friendly url's. My favorite method is to have only one extra column in item's table. For example:
Table picture
id, path, title, alias, created
Values:
1, uploads/pics/mypicture.jpg, Great holidays, great-holidays, 2011-11-11 11:11:11
2, uploads/pics/anotherpic.jpg, Great holidays, great-holidays-1, 2011-12-12 12:12:12
Now in the script, while inserting the item, create alias from title, check if the alias exists already and if does, you can add id, random number, or count (depending on how many same titles u have already).
After you store the alais like this its very simple. User try to access
http://www.mywebsite.com/picture/great-holidays
So in your script you just see that user want to see picture, and picture with alias great-holidays. Find it in DB and show it.

How to store these field descriptions in mysql?

Apologize for the long topic, I didn't intend for it to be this long, but it's a pretty simple issue I've been having. :)
Let's say you have a simple table called tags that has columns tag_id and tag. The tag_id is simply an auto increment column and the tag is the title of the tag. If I need to add a description field, that would be around 1-2 paragraphs on average (max around 3-4 paragraphs probably), should I simply add a description field to the table or should I create a new table called tag_descriptions and store the descriptions with the tag_id?
I remember reading that it is better to do this because if you do a query that doesn't select the description, that description field will still slow down mysql. Is this true? I don't even remember where I read that from, but I've been kind of following it for a couple years now... Finally I question if I need to do this, I have a feeling I don't. You'd also need to inner join whenever you need the description field.
Another question I have is, is it generally bad to create new tables that will only hold very few rows at the max? What if this data doesn't fit anywhere else?
I have a simple case below which relates to these two questions.
I have three tables content, tags, and content_tags that make up a many to many relationship:
content
content_id
region (enum column with
about 6-7 different values and most
likely won't grow later on)
tags
tag_id
tag
content_tags
content_id
tag_id
I want to store a description around 1-2 paragraphs for each tag, but also for each region. I'm wondering what would be the best way to do this?
Option A:
Just add a description column to the
tags table
Create a new table for
region_descriptions
Option B:
Create a new table called
descriptions with fields: id,
description, and type
The id would be id of the content or
id of the enum field
The type would be whether it is a tag
description, or region description
(Would use the enum column for this)
Maybe have a primary key on the id and type?
Option C:
Create a new table for tag_descriptions
Create a new table for region_descriptions
Option A seems to be a good choice if adding the description column doesn't slow down mysql select queries that don't need the description.
Assuming the description column would slow down mysql, option B might be a good choice. It also removes the need for a small table with just 6-7 rows that would hold the region descriptions. Although now that I think of it, would it be slow to connect to this table if originally to get a region description you'd only need to go through very little rows.
Option C would be ideal if the description columns would slow down mysql and if a small table like region descriptions would not matter.
Maybe none of these options are the best, feel free to offer another option. Thanks.
P.S. What would be an ideal column type to use to hold data that usually 1-2 paragraphs, but might be a little more sometimes?

I don't think it really matters if you don't handle thousands of queries per minute. If you are going to have a zillion queries per minute, then I would implement the various options and perform benchmarks for all these options. Based on the results, you can make a decision.

In my (admittedly somewhat uninformed) opinion, it really depends on how much you'll be using both of them.
If properly indexed, that JOIN should not be very expensive. Also, a larger table will be slower. It inhibits caching, and takes longer to access stuff, although indexing seriously mitigates this problem.
If you'll be joining tag names to tag IDs a LOT, and only rarely will be using the descriptions, I'd say go with separate tables. If you'll be using the descriptions more often, go with one table.

For the first part of your question: if you have a tag with an id, a name and a description, you should save it in 1 table.
Now, this query
SELECT name FROM tags WHERE id = 1;
will NOT slow down if you have 1, 2 or 20 extra fields in there.

The best way to structure this database?

At the moment I'm doing this:
gems(id, name, colour, level, effects, source)
id is the primary key and is not auto-increment.
A typical row of data would look like this:
id => 40153
name => Veiled Ametrine
colour => Orange
level => 80
effects => +12 sp, +10 hit
source => Ametrine
(Some of you gamers might see what I'm doing here :) )
But I realise this could be sorted a lot better. I have studied database relationships and secondary keys in my A-Level computing class but never got as far as to set one up properly. I just need help with how this database should be organised, like what tables should have what data with what secondary and foreign keys?
I was thinking maybe 3 tables: gem, effects, source. Which then have relationships to each other?
Can anyone shed some light on this? Is a complex way like I'm proposing really the way to go or should I just carry on with what I'm doing?
Cheers.

I happen to be passingly familiar with the environment you're describing (:))
Despite what you have convinced yourself, what you are doing is not particularly complex.
Anyway, currently, you have a table with no relationships. It's simple. It's easy. Each gem exists in the database.
If you were to move to the three tables that you proposed, you would also need to include link tables to assemble the tables into useable data, especially since (and mind, I'm not quite sure how your distinctions boil out) the effects and source table are involved in a many-to-x relationship: each gem has up to two effects, and each effect has up to Y gems where it is present // each source has up to Z gems.
I'd stick with the single table. The individual records may be longer, but its much simpler, and you'll encounter fewer errors than if you were trying to establish linking tables or the like.

Questions to ask yourself:
Is there a 1 to 1 relationship between gem, effects, and source?
Would you more often be pulling effects without pulling data from gem?
If the proposed tables have a 1 to 1 relationship then I'd suggest leaving them combined in one table. The only time I would consider splitting them out in this condition is if I only needed data from effects without needing other data AND these tables were going to be large enough to justify having them stored on different drives. Otherwise, you're just making work for yourself, adding more storage requriements and reaping exactly zero benefits.

You should also consider whether you will need the effects information for actual usage, or display only. If it is display only, no big deal to have it in one column in a table. If you have to use it, for example to apply the +12 and +10 appropriately, then I think you should put each occurrence of it in a separate column. Accordingly, you should have a separate table for effects, and then a separate table storing which gems have which effects, maybe gemeffects. The Effects table might have better descriptions of what "sp" stands for, maybe the min and max ranges, etc. The GemEffects table would just have the gem id, the value, and the effect itself. For example
Effects
effect => hit
desc => How many hit points
minimum => 0
maximum => 100
GemEffects
id => 40153
effect => sp
value => 12
and
id => 40153
effect => hit
value => 10

You would answer your own question if you do a simple exercise: describe in a natural, descriptive language your system. Which entities, their attributes, how they interact with other entities, etc. Underline substantives and verbs. Ask what entities do you mean to manage (eg: will there be an interface to manage the "effects" table?) You'll be surprised how it all gets assembled naturally.
Now for your example, I'd suggest two approaches (without syntactic details)
1) to gain experience in relational design, with some complexity overhead, and granular over each entity
gem (id, name,color_id,source_id,effect_assoc_id)
color (id, name)
source (id, name)
effect (id,value,nature_id)
nature (id, name)
effect_assoc (id, gem_id, effect_id)
2) straight to the point, possibly valid depending on the cardinality of your relations
just carry on ;)
From your description, I'd go with #1.

I would recommend the following:
Move all effects into their own table (e.g., ID, Name, Description, Enabled, ...)
Move source into its own table (e.g., ID, Name, Description, Enabled, ...)
Drop gems "effects" column (migrates to step 5 below)
Convert the gems "source" column into a foreign key value that corresponds to the PK from the "source" table
Add a new table to link a single gem entity to zero or more effect entities
Example: tbl_GemsEffectsLink, with two columns named "GemID" and "EffectID," that by
themselves are foreign keys back to the entity tables and when taken together, make up the
composite primary key.
A sample view of this link table would be as follows:
GemID EffectID
1 1
1 2
2 1
2 2
2 3
So, in summary, you would have the following tables:
gems
effects
source
gemseffectslink
With each table having the following columns:
gems
id (PK)
name
colour
level
sourceid (FK)
effects
id (PK)
name
description
enabled
...
source
id (PK)
name
description
enabled
...
gemseffectslink
gemid (FK)
effectid (FK)
Lastly, this assume each gem can have zero or more effects, a single source (you can enforce NULL or NOT NULL for this gem.sourceid FK field), and that the level integer value is just that (i.e., not representing something more robust and exhaustive in that there exists some type of "Level" entity and the value of "80" in your sample data row uniquely identifies one of these "Level" entities).
Hope this helps!
Michael

How to best get 3 prior image and 3 later image records in MySQL query?

I'll explain briefly what I want to accomplish from a functional perspective. I'm working on an image gallery. On the page that shows a single image, I want the sidebar to show thumbnails of images uploaded by the same user. At a maximum, there should be 6, 3 that were posted before the current main image, and 3 that were posted after the main image. You could see it as a stream of images by the same user through which you can navigate. I believe Flickr has a similar thing.
Technically, my image table has an autoincremented id, a user id and a date_uploaded field, amongst many other columns.
What would your advise be on how to implement such a query? Can I combine this in a single query? Are there any handy MySQL utilities that can deal with offsets and such?
PS: I prefer not to create an extra "rank" column, since that would make managing deletions difficult. Also, using the autoincrement id seems risky, I might change it for a GUID later on. Finally, I'm of course looking for a query that performs and scales.
I know I ask for a lot, but it seems simpler than it is?

The query could look like the following.
With a UserID+image_id index (and possibly additional fields for covering purposes), this should perform relatively well.
SELECT field1, field2, whatever
FROM myTable
WHERE UserID = some_id
-- AND image_id > id_of_the_previously_first_image
ORDER BY image_id
LIMIT 7;
To help with scaling, you should consider using a bigger LIMIT value and cache accordingly.
Edit (answering remarks/questions):
The combined index...
is made of several fields, specifically
CREATE [UNIQUE] INDEX UserId_Image_id_idx
ON myTable (UserId, image_ida [, field1 ...] )
Note that optional elements of this query are in brackets ([]). I would assume the UNIQUE constraint would be a good thing. The additional "covering" fields (field1,...) maybe beneficiary, but would depend on the "width" of such additional fields as well as on the overall setup and usage patterns (since [large] indexes slow down INSERTs/UPDATEs/DELETEs, one may wish to limit the number and size of such indexes etc.)
Such an index data "type" is neither numeric nor string etc. It is simply made of the individual data types. For example if UserId is VARCHAR(10) and Image_id is INT, the resulting index would use these two types for the underlying search criteria, i.e.
... WHERE UserId = 'JohnDoe' AND image_id > 12389
in other words one needn't combine these criteria into a single key.
On image_id
when you say image_id, you mean the combined user/image id, right?
No, I mean only image_id. I'm assuming this field is a separate field in the table. The UserID is taken care of in the other predicate of the WHERE clause.
The original question write up indicates that this field is auto-generated, and I'm assuming we can rely on this field for sorting purposes. Alternatively we could rely on other fields such as the timestamp when the image was uploaded and such.
Also, an afterthought, whether ordered by a [monotonically increasing] Image_id or by the Timestamp_of_upload, we may want to use a DESC order, to show the latest "stuff" first.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008