Translating a MySQL data/query-set into the equivalent Cassandra representation - mysql

Consider a 500 million row MySQL table with the following table structure ...
CREATE TABLE foo_objects (
id int NOT NULL AUTO_INCREMENT,
foo_string varchar(32),
metadata_string varchar(128),
lookup_id int,
PRIMARY KEY (id),
UNIQUE KEY (foo_string),
KEY (lookup_id),
);
... which is being queried using only the following two queries ...
# lookup by unique string key, maximum of one row returned
SELECT * FROM foo_objects WHERE foo_string = ?;
# lookup by numeric lookup key, may return multiple rows
SELECT * FROM foo_objects WHERE lookup_id = ?;
Given those queries, how would you represent the given data-set using Cassandra?

you have two options:
(1) is sort of traditional: have one CF (columnfamily) with your foo objects, one row per foo, one column per field. then create two index CFs, where the row key in one is the string values, and the row key in the other is lookup_id. Columns in the index rows are foo ids. So you do a GET on the index CF, then a MULTIGET on the ids returned.
Note that if you can make id the same as lookup_id then you have one less index to maintain.
High-level clients like Digg's lazyboy (http://github.com/digg/lazyboy) will automate maintaining the index CFs for you. Cassandra itself does not do this automatically (yet).
(2) is like (1), but you duplicate the entire foo objects into subcolumns of the index rows (that is, the index top-level columns are supercolumns). If you're not actually querying by the foo id itself, you don't need to store it in its own CF at all.

Related

which database should i use for multi-value column

I don't know much about databases. I have a table on paper that have 4 or 5(1 for id) columns. One is primary and single value while others columns are secondary and can have multiple-value. Now i have some values and i have to search those value in secondary columns and return highest matched primary column. i.e
Id Primary Secondary_1 Secondary_2 Secondary_3
1 ABCD 12,11,9 51,52 77
2 ABCE 9,15,17 12,14,7 71,77
3 ABEF 8,9,14,12 51,7 77,71
4 ABEG 7,9,15 52,14 77,78
Secondary columns can have string type. Now suppose i have to search (8,9,14,77)
it should return ABEF. if search(9,51,77) it should return (ABCD,ABEF) and so on. So my problem is how i store database which schema should i use for this type of problem.
You should create 3 tables additionally to the primary table, and use foreign keys to connect their rows to the primary table:
primary table: Id, Primary
secondary table 1: ForeignId, Value
secondary table 2: ForeignId, Value
secondary table 3: ForeignId, Value
Then, create a foreign key on all "ForeignId" columns, connected with the "Id" column of the primary table.
Of course names of these columns are horrible. Don't name them "Value", be more precise.
The goal is to have single values in one field, not multiple ones. With such a normalized design, you can query your secondary tables with simple string comparisons, while joining your primary table rows with the matched rows.

How to use an auto-incrementing integer primary key to combine multiple files?

How do you set up a valid auto-incrementing integer primary key on a table if you want to join it with separate files? I get data like this on a daily basis:
Interaction data:
Date | PersonID | DateTime | CustomerID | Other values...
The primary key there would be PersonID + DateTime + CustomerID. If I have an integer key, how can I get that to relate back to another table? I want to know the rows where a specific person interacted with a specific customer so I can tie back those pieces of data together into one master-file.
Survey return data:
Date | PersonID | DateTime | CustomerID | Other values...
I am normally processing all raw data first in pandas before loading it into a database. Some other files also do not have a datetime stamp and only have a date. It is rare for one person to interact with the same customer on the same day so I normally drop all rows where there are duplicates (all instances) so my sample of joins are just purely unique.
Other Data:
Date | PersonID | CustomerID | Other values...
I can't imagine how I can set it up so I know row 56,547 on 'Interaction Data' table matches with row 10,982 on 'Survey Return Data' table. Or should I keep doing it the way I am with a composite key of three columns?
(I'm assuming postgresql since you have tag-spammed this post; it's up to you to translate for other database systems).
It sounds like you're loading data with a complex natural key like (PersonID,DateTime,CustomerID) and you don't want to use the natural key in related tables, perhaps for storage space reasons.
If so, for your secondary tables you might want to CREATE UNLOGGED TABLE a table matching the original input data. COPY the data into that table. Then do an INSERT INTO ... SELECT ... into the final target table, joining on the table with the natural key mapping.
In your case, for example, you'd have table interaction:
CREATE TABLE interaction (
interaction_id serial primary key,
"PersonID" integer
"DateTime" timestamp,
"CustomerID" integer,
UNIQUE("PersonID", "DateTime", "CustomerID"),
...
);
and for table survey_return just a reference to interaction_id:
CREATE TABLE survey_return (
survey_return_id serial primary key,
interaction_id integer not null foreign key references interaction(interaction_id),
col1 integer, -- data cols
..
);
Now create:
CREATE UNLOGGED TABLE survey_return_load (
"PersonID" integer
"DateTime" timestamp,
"CustomerID" integer,
PRIMARY KEY ("PersonID","DateTime", "CustomerID")
col1 integer, -- data cols
...
);
and COPY your data into it, then do an INSERT INTO ... SELECT ... to join the loaded data against the interaction table and insert the result with the derived interaction_id instead of the original natural keys:
INSERT INTO survey_return
SELECT interaction_id, col1, ...
FROM survey_return_load l
LEFT JOIN interaction i ON ( (i."PersonID", i."DateTime", i."CustomerID") = (l."PersonID", l."DateTime", l."CustomerID") );
This will fail with a null violation if there are natural key tuples in the input survey returns that do not appear in the interaction table.
There are always many ways. Here might be one.
A potential customer (table: cust) walking into a car dealership and test driving 3 cars (table: car). An intersection/junction table between cust and car in table cust_car.
3 tables. Each with int autoinc.
Read this answer I wrote up for someone. Happy to work your tables if you need help.
SQL result table, match in second table SET type
That question had nothing to do with yours. But the solution is the same.

Query with primary key obtained from primary keys of underlying tables

I have a table in Access which I'd like to substitute with a query which gathers data from the table and other new tables. The table is used by many queries which look to a primary key (autonumber) in the table, so the new query must have a primary key which is a unique combination of the primary keys of the tables used by the query. What can I do?
--EDIT--
Solution found: Since I want to "merge" tables with a query, and since the pk is an autonumber, I can define the new pk (of the query) by "expanding the numbering": I multiply both pkeys by 2 (because I have two tables) and add or subtract 1 to one of the two (or 1 for the first table and 2 for the second, and so on).
For example:
PK1 = 1,2,3,4,5,6
PK2 = 1,3,4,5,8,9,10 (some records may have been deleted, so the number is skipped)
new PK = (2*PK1, (2*PK2 + 1)) = (2,4,6,8,10,12),(3,7,9,11,17,19,21)
as you can see they will never overlap (no new value of PK2 can be obtained from any value of PK1, because of the "+1") because math says they belong to different vector spaces.
Hope it may help somebody
Use composite key (Multiple-field primary key)

Sphinx Search, compound key

After my previous question (http://stackoverflow.com/questions/8217522/best-way-to-search-for-partial-words-in-large-mysql-dataset), I've chosen Sphinx as the search engine above my MySQL database.
I've done some small tests with it, and it looks great. However, i'm at a point right now, where I need some help / opinions.
I have a table articles (structure isn't important), a table properties (structure isn't important either), and a table with values of each property per article (this is what it's all about).
The table where these values are stored, has the following structure:
articleID UNSIGNED INT
propertyID UNSIGNED INT
value VARCHAR(255)
The primary key is a compound key of articleID and propertyID.
I want Sphinx to search through the value column. However, to create an index in Sphinx, I need a unique id. I don't have right here.
Also when searching, I want to be able to filter on the propertyID column (only search values for propertyID 2 for example, which I can do by defining it as attribute).
On the Sphinx forum, I found I could create a multi-value attribute, and set this as query for my Sphinx index:
SELECT articleID, value, GROUP_CONCAT(propertyID) FROM t1 GROUP BY articleID
articleID will be unique now, however, now I'm missing values. So I'm pretty sure this isn't the solution, right?
There are a few other options, like:
Add an extra column to the table, which is unique
Create a calculated unique value in the query (like articleID*100000+propertyID)
Are there any other options I could use, and what would you do?
In your suggestions
Add an extra column to the table, which is unique
This can not be done for an existing table with large number of records as adding a new field to a large table take some time and during that time the database will not be responsive.
Create a calculated unique value in the query (like articleID*100000+propertyID)
If you do this you have to find a way to get the articleID and propertyID from the calculated unique id.
Another alternative way is that you can create a new table having a key field for sphinx and another two fields to hold articleID and propertyID.
new_sphinx_table with following fields
id - UNSIGNED INT/ BIGINT
articleID - UNSIGNED INT
propertyID - UNSIGNED INT
Then you can write an indexing query like below
SELECT id, t1.articleID, t1.propertyID, value FROM t1 INNER JOIN new_sphinx_table nt ON t1.articleID = nt.articleID AND t1.propertyID = nt.propertyID;
This is a sample so you can modify it to fit to your requirements.
What sphinx return is matched new_sphinx_table.id values with other attributed columns. You can get result by using new_sphinx_table.id values and joining your t1 named table and new_sphinx_table

MySQL unique index by multiple fields

We have a special kind of table in our DB that stores the history of its changes in itself. So called "self-archived" table:
CREAT TABLE coverages (
id INT, # primary key, auto-increment
subscriber_id INT,
current CHAR, # - could be "C" or "H".
record_version INT,
# etc.
);
It stores "coverages" of our subscribers. Field "current" indicates if this is a current/original record ("C") or history record ("H").
We could only have one current "C" coverage for the given subscriber, but we can't create a unique index with 2 fields (*subscriber_id and current*) because for any given "C" record there could be any number of "H" records - history of changes.
So the index should only be unique for current == 'C' and any subscriber_id.
That could be done in Oracle DB using something like "materialized views": where we could create a materialized view that would only include records with current = 'C' and create a unique index with these 2 fields: *subscriber_id, current*.
The question is: how can this be done in MySQL?
You can do this using NULL values. If you use NULL instead of "H", MySQL will ignore the row when evaluating the UNIQUE constraint:
A UNIQUE index creates a constraint such that all values in the index must be
distinct. An error occurs if you try to add a new row with a key value that
matches an existing row. This constraint does not apply to NULL values except
for the BDB storage engine. For other engines, a UNIQUE index permits multiple
NULL values for columns that can contain NULL.
Now, this is cheating a bit, and it means that you can't have your data exactly as you want it. So this solution may not fit your needs. But if you can rework your data in this way, it should work.