How to move from MySQL to Cassandra modeling - mysql

I am trying to move from MySQL to Cassandra for a music service application I am building.
I have read the following stackexchange: MySQL Data Model to Cassandra Help?
and checked out https://wiki.apache.org/cassandra/DataModel - also the DataStax Cassandra Modeling they did with the music service also, but the documentation so far are very small and narrow that I can't ditch MySql type queries away, so I would need help on.
This is my album table that works so far in mysql
CREATE TABLE `albums` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(150) NOT NULL,
`description` varchar(300) NOT NULL,
`release_date` int(10) unsigned NOT NULL,
`status` enum('active','inactive','pending') NOT NULL,
`licensor_id` int(11) NOT NULL,
`score` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`),
KEY `licensor_id` (`licensor_id`),
KEY `batch_id` (`batch_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1720100 ;
I also have a one to many relationship on the following tables:, artist (many artist to one album), genre(many genre to one album), songs(1 album contains many songs).
I have many pivot tables going around in order to couple these around.
So because Cassandra doesn't allow joins, I figure that doing set,list,map would help me resolve to the proper dataset.
at first my thoughts were to solve my maping by just reusing the same table:
CREATE TABLE `albums` (
`id` int(10) ,
`title` varchar(150) ,
`description` varchar(300) ,
`release_date` date ,
`status` enum('active','inactive','pending') ,
`licensor_id` int(11) ,
`data_source_provider_id` int(10) ,
`score` int(10)
`genre` <set>
`artist` <set>
PRIMARY KEY (`id`),
) ;
(apologies if the above are not the correct syntax for Cassandra, Ive only begun installing the system on a dev system)
My queries are of the following:
Give me all albums sorted by Score (Descending)
Give me all albums from a particular genre, sorted by score
Give me all albums from a particular artist, sorted by score
Give me all albums sorted by release date, then by score.
In SQL the 4 are easy when doing the join - however since Cassandra doesn't allow joins i figure that my modelling was adequent enough however #4 cannot be satisified (there are no double order by as far as i can tell).
Multiple indexes are slow, and considering that its on a large dataset (there are 1.8M records for now, but I'm planning on pumping triple the amount at least, hence why Cassandra would be useful)
My question are:
1) Is my path to go from MySQL to Cassandra correct despite being stuck on my 4 questions - or did it do it wrong? (I've done some Active Records before with MongoDB where you can have a sub entity within the document, but Cassandra only has set,list and map).
2) If I want to expand my modelling to: " I want to create a list X that contains a predefined number elements from the albums table". Would tagging each Albums element with a new field "tag" that has X be the smart way to filter things OR would it be best to create a new table, that contains all the elements that I need and just query that.

The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)
That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.
Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist set<text>,
PRIMARY KEY (id)
);
This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist text,
PRIMARY KEY (artist, releasedate, title)
);
Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.
Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.
Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.

Related

Will a Insert or Update query affect from indexed columns which are not included in query?

Mysql and SQL users. This question related to both of you. Its about indexing. I have this table structure for a classified website. I have a one common table to store title, description, user who post etc.. Also I have this table structure to store detail attributes about a particular ad category.
CREATE TABLE `ad_detail` (
`ad_detail_id_pk` int(10) NOT NULL AUTO_INCREMENT,
`header_id_fk` int(10) NOT NULL,
`brand_id_fk` smallint(5) NULL,
`brand_name` varchar(200) NULL,
`is_brand_new` bool,
.......
`transmission_type_id_fk` tinyint(3) NULL,
`transmission_type_name` varchar(200) NULL,
`body_type_id_fk` tinyint(3) unsigned NULL,
`body_type_name` varchar(200) NULL,
`mileage` double NULL,
`fuel_type_id_fk` tinyint(3) NULL,
......
PRIMARY KEY (`ad_detail_id_pk`)
)
SO as you can see first part of the attributes will belong to mobile ads and second part belongs to vehicle ads like so on I have other attributes for other categories. header_id_fk will hold the relationship to header table which have common information. So all of these foreign keys are some what involves in filtering ads. Some may wants to find all the mobile phones which made by Nokia. SO then the brand_id_fk will be use. Some may wants to filter vehicle by fuel type. So as you can see I need to index every filtering attributes in this table. So now this is my question.
So when user post a mobile ad insert statement will contain certain no of fields to store data. But as we all know index will gain the performance when data retrieval but it will make additional cost to insert and update queries. So if I insert mobile ad, will that insert query suffer from other attributes which are relevant to vehicles ads' index fields?
Yes, normal indexes contain one row for every row in the table (unless you use oracle http://use-the-index-luke.com/sql/where-clause/null). So therefore every index will have a new row inserted every time you insert a row into the table, and the associated index maintenance issues (page splits etc.)
You could create a filtered/partial index which excludes nulls which would solve the particular issue of INSERT performance being slowed down by indexes on fields into which you're inserting NULL but you would need to test the solution thoroughly to make sure that the indexes are still being used by the queries that you expect them to be used. Note that mysql does not support partial indexes, AFAIK, the following is for sql-server.
Create Index ix_myFilteredIndex On ad_detail (brand_id_fk) Where brand_id_fk Is Not Null;

Store huge json data of indefinite size with key

I need to store the geo path data comprising of geo-points which should be indexed by unique key. For example: The path traveled by vehicle indexed by its trip id. This path can be of indefinite length.
As of now, I am thinking to store the path in the form of JSON object. The options that I have in my mind are Riak and MongoDB. I want to go with open-source technology. It will be nice if it supports clustering. In case one node goes down, we won't have any downtime in our application.
MySQL is currently our source of raw data (which we will be anyhow moving to the NoSQL DB but not as of now). But with the huge amount of data (2 million geo-point entries per day), it takes MYSQL a lot of time to filter the data based on timestamp. MySQL will still be our primary data source. The solution I am looking for will act as a cache for faster path retrieval based on id.
In current MySQL schema, the fields I have are:
system_timestamp,
gps_timestamp,
speed,
lat,
lot
This table store all the geo-points of the vehicle whether vehicle is on trip or not. Here trip is based on whether driver wants to track the movement or not. If he want to track the movement, we generate a unique trip id and associate it to the driver along with the trip's start time and the end time. Later for displaying the path based on trip id, we use the start & end time of the trip to filter the data from the raw table.
I want to store the trip path into secondary database as a cache so that it's retrieval will be fast.
Which database should be my ideal choice? What other options do I have?
I'm going to go out on a limb here and say that I believe there is a less complicated way of fixing your performance issue.
I assume you are using MySQL with InnoDB and you are indexing the timestamp field(s).
If I were you, I would simply turn the relevant timestamp (system or gps) into the primary key. With InnoDB, the table data is physically organized to do ultra-fast lookups based on the primary key column(s). Also, make sure that the relevant timestamp column is of the unsigned non-null type.
Now, instead of doing a lookup for the paths in between start and end time (as you're currently doing), I would create a separate table within the same MySQL database containing pairs of trip ID/path timestamp, where "path timestamp" is the primary key from the paths table, as mentioned earlier. Primary index the trip ID. Populate this table using the same logic/mechanism you initially imagined for Riak or MongoDB. This will basically be your "caching" system, using nothing but MySQL.
A typical lookup would take the trip ID to find all of the path timestamps associated and thus all of the path data.
CREATE TABLE IF NOT EXISTS `paths` (
`system_timestamp` int(10) unsigned NOT NULL,
`gps_timestamp` int(10) NOT NULL,
`speed` smallint(8) unsigned NOT NULL,
`lat` decimal(10,6) NOT NULL,
`lng` decimal(10,6) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `paths` ADD PRIMARY KEY (`system_timestamp`);
CREATE TABLE IF NOT EXISTS `trips` (
`trip_id` int(10) unsigned NOT NULL,
`system_timestamp` int(10) unsigned NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `trips` ADD PRIMARY KEY (`trip_id`);
SELECT * FROM `trips`
INNER JOIN `paths` ON
`trips`.`system_timestamp` = `paths`.`system_timestamp`
WHERE `trip_id` = 1;

How to create index on massive data (mysql)

I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);
If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.

Approach for multiple "item sets" in Database Design

I have a database design where i store image filenames in a table called resource_file.
CREATE TABLE `resource_file` (
`resource_file_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`resource_id` int(11) NOT NULL,
`filename` varchar(200) NOT NULL,
`extension` varchar(5) NOT NULL DEFAULT '',
`display_order` tinyint(4) NOT NULL,
`title` varchar(255) NOT NULL,
`description` text NOT NULL,
`canonical_name` varchar(200) NOT NULL,
PRIMARY KEY (`resource_file_id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
These "files" are gathered under another table called resource (which is something like an album):
CREATE TABLE `resource` (
`resource_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`description` text NOT NULL,
PRIMARY KEY (`resource_id`)
) ENGINE=InnoDB AUTO_INCREMENT=285 DEFAULT CHARSET=utf8;
The logic behind this design comes handy if i want to assign a certain type of "resource" (album) to a certain type of "item" (product, user, project & etc) for example:
CREATE TABLE `resource_relation` (
`resource_relation_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`module_code` varchar(32) NOT NULL DEFAULT '',
`resource_id` int(11) NOT NULL,
`data_id` int(11) NOT NULL,
PRIMARY KEY (`resource_relation_id`)
) ENGINE=InnoDB AUTO_INCREMENT=328 DEFAULT CHARSET=utf8;
This table holds the relationship of a resource to a certain type of item like:
Product
User
Gallery
& etc.
I do exactly this by giving the "module_code" a value like, "product" or "user" and assigning the data_id to the corresponding unique_id, in this case, product_id or user_id.
So at the end of the day, if i want to query the resources assigned to a product with the id of 123 i query the resource_relation table: (very simplified pseudo query)
SELECT * FROM resource_relation WHERE data_id = 123 AND module_code = 'product'
And this gives me the resource's for which i can find the corresponding images.
I find this approach very practical but i don't know if it is a correct approach to this particular problem.
What is the name of this approach?
Is it a valid design?
Thank you
This one uses super-type/sub-type. Note how primary key propagates from a supert-type table into sub-type tables.
To answer your second question first: the table resource_relation is an implementation of an Entity-attribute-value model.
So the answer to the next question is, it depends. According to relational database theory it is bad design, because we cannot enforce a foreign key relationship between data_id and say product_id, user_id, etc. It also obfuscates the data model, and it can be harder to undertake impact analysis.
On the other hand, lots of people find, as you do, that EAV is a practical solution to a particular problem, with one table instead of several. Although, if we're talking practicality, EAV doesn't scale well (at least in relational products, there are NoSQL products which do things differently).
From which it follows, the answer to your first question, is it the correct approach?, is "Strictly, no". But does it matter? Perhaps not.
" I can't see a problem why this would "not" scale. Would you mind
explaining it a little bit further? "
There are two general problems with EAV.
The first is that small result sets (say DATE_ID=USER_ID) and big result sets (say DATE_ID=PRODUCT_ID) use the same query, which can lead to sub-optimal execution plans.
The second is that adding more attributes to the entity means the query needs to return more rows, whereas a relational solution would return the same number of rows, with more columns. This is the major scaling cost. It also means we end up writing horrible queries like this one.
Now, in your specific case perhaps neither of these concerns are relevant. I'm just explaining the reasons why EAV can cause problems.
"How would i be supposed to assign "resources" to for example, my
product table, "the normal way"?"
The more common approach is to have a different intersection table (AKA junction table) for each relationship e.g.USER_RESOURCES, PRODUCT_RESOURCES, etc. Each table would consist of a composite primary key, e.g. (USER_ID, RESOURCE_ID), and probably not much else.
The other approach is to use a generic super-type table with specific sub-type tables. This is the implementation which Damir has modelled. The normal use caee for super-types is when we have a bunch of related entities which have some attributes, behaviours and usages in common plus seom distinct features of their own. For instance, PERSON and USER, CUSTOMER, SUPPLIER.
Regarding your scenario I don't think USER, PRODUCT and GALLERY fit this approach. Sure they are all consumers of RESOURCE, but that is pretty much all they have in common. So trying to map them to an ITEM super-type is a procrustean solution; gaining a generic ITEM_RESOURCE table is likely to be a small reward for the additiona hoops you're going to have to jump through elsewhere.
I have a database design where i store images in a table called
resource_file.
You're not storing images; you're storing filenames. The filename may or may not identify an image. You'll need to keep database and filesystem permissions in sync.
Your resource_file table structure says, "Image filenames are identifiable in the database, but are unidentifiable in the filesystem." It says that because resource_file_id is the primary key, but there are no unique constraints besides that id. I suspect your image files actually are identifiable in the filesystem, and you'd be better off with database constraints that match that reality. Maybe a unique constraint on (filename, extension).
Same idea for the resource table.
For resource_relation, you probably need a unique constraint on either (resource_id, data_id) or (resource_id, data_id, module_code). But . . .
I'll try to give this some more thought later. It's kind of hard to figure out what you're trying to do resource_relation, which is usually a red flag.

Should I normalize this table?

I have a table Items which stores fetched book data from Amazon. This Amazon data is inserted into Items as users browse the site, so any INSERT that occurs needs to be efficient.
Here's the table:
CREATE TABLE IF NOT EXISTS `items` (
`Item_ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`Item_ISBN` char(13) DEFAULT NULL,
`Title` varchar(255) NOT NULL,
`Edition` varchar(20) DEFAULT NULL,
`Authors` varchar(255) DEFAULT NULL,
`Year` char(4) DEFAULT NULL,
`Publisher` varchar(50) DEFAULT NULL,
PRIMARY KEY (`Item_ID`),
UNIQUE KEY `Item_Data` (`Item_ISBN`,`Title`,`Edition`,`Authors`,`Year`,`Publisher`),
KEY `ISBN` (`Item_ISBN`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT AUTO_INCREMENT=1 ;
Normalizing this table would presumably mean creating tables for Titles, Authors, and Publishers. My concern with doing this is that the insert would become too complex.. To insert a single Item, I'd have to:
Check for the Publisher in Publishers to SELECT Publisher_ID, otherwise insert it and use mysql_insert_id() to get Publisher_ID.
Check for the Authors in Authors to SELECT Authors_ID, otherwise insert it and use mysql_insert_id() to get Authors_ID.
Check for the Title in Titles to SELECT Title_ID, otherwise insert it and use mysql_insert_id() to get Title_ID.
Use those ID's to finally insert the Item (which may in fact be a duplicate, so this whole process would have been a waste..)
Does that argue against normalization for this table?
Note: The goal of Items is not to create a comprehensive database of books, so that a user would say "Show me all the books by Publisher X." The Items table is just used to cache Items for my users' search results.
Considering your goal, I definitely wouldn't normalize this.
You've answered your own question - don't normalize it!
YES you should normalize it if you don't think it is already. However, as far as I can tell it's already in 5th Normal Form anyway - at least it seems to be based on the "obvious" interpretation of those column names and if you ignore the nullable columns. Why do you doubt it? Not sure why you want to allow nulls for some of those columns though.
1.Check for the Publisher in Publishers to SELECT Publisher_ID,
otherwise insert it and use
mysql_insert_id() to get Publisher_ID
There is no "Publisher_ID" in your table. Normalization has nothing to do with inventing a new "Publisher_ID" attribute. Substituting a "Publisher_ID" in place of Publisher certainly wouldn't make it any more normalized than it already is.
The only place where i can see normalization useful in your case is if you want to store information about each author.
However -
Where normalization could help you - Saving space! Especially if there is a lot of repetition in terms of publishers, authors (that is, if you normalize individual authors table).
So if you are dealing with 10s of millions of rows, normalization will show an impact in terms of space(even performance). If you don't face that situation (which i believe should be the case) you don't need to normalize.
ps - Also think of the future... will there ever be a need? DBs are a long term infrastructure... never design them keeping the now in mind.