Rapid Miner Correlation Table - rapidminer

I'm trying to produce a correlation table of two nominal attributes from a clustered dataset (Ripley). I'm striving for a correlation table of [label]:[cluster]. My problem is that the attributes are shown as "?" in the correlation table. Anybody knows?
correlation
design
dataset

The generated attribute cluster is a nominal attribute. The Correlation Matrix operator calculates the Pearson correlation coefficient, which cannot be computed for nominal (=discrete) attributes, thus the correlation is unknown ("missing", displayed as a ?).

Related

Database design for autoencoder output

I have a dataset of images on which I'm running a autoencoder to encode them in a vector of floats of length 32. To store these float values should I create 32 named columns or just put it in a BLOB of text and parse this text when need be? What would be the perf benefits of using the former vs latter?
Example of the data:
key:72
value:[1.8609547680625838e-8,2.9573993032272483e-8,0.9999995231628418,0.03153182193636894,
0.000003173188815708272,0.9999996423721313,0.8707512617111206,0.00005991563375573605,
0.9999498128890991,0.9999982118606567,0.947956383228302,0.9749470353126526,
0.9999994039535522,5.490094281412894e-7,0.9999681711196899,0.9958689212799072]
I would always be retrieving all the values for given image IDs.
Tables don't have performance. Queries have performance. Any consideration you have to make your database storage give optimal performance, it has to be made in the context of what types of queries you will run against the data.
If you will always query for the full array of values as a single entity, then use a blob.
If you will always query for a specific value in the Nth position in the array, then maybe a series of columns is good.
If you want to do aggregate queries like MIN(), MAX(), AVG() on the data using SQL, then make a second table with one float value per row.
You can't make this decision until you know the queries you will need to run.
Well usually you would use a mapping table to map which values belong to which vector.
But since the array you provided is all part of one value, one vector, and because using a mapping table would require adding 32 rows to the table for each vector maybe it is best to just save it as TEXT/BLOB.

how to avoid duplicate vertex entries in DSE graph/Titan

I have the following graph definition.
schema.propertyKey("ID").text().create()
schema.vertexLabel("Student").properties("ID").create()
When I execute the below Gremlin query, a new vertex is created.
g.addV(label, 'Student').property('ID', '1234')
when I execute it again, new Vertex with same ID has been created.I'm looking for a way to make ID value unique. Meaning I should get error when I try to add a new student with same ID(1234). Any help highly appreciated on this.
I don't know about DSE Graph, but in Titan you can create an index and configure it to be unique. But it is not recommended to do that (at least not if it affects many vertices) as Titan has to use locks to insert vertices with such an index in order to avoid duplicates.
You will get a better performance if you check whether the vertex exists already before inserting it.
Daniel Kuppitz provided a query for that on the mailing list [1]:
g.V().has('Student','ID','1234').tryNext().orElseGet(g.addV(T.label,'Student','ID', '1234').next())
Of course, you could get into race conditions here where two of those queries are evaluated for the same ID at the same time. But this should only occur very rarely and you can probably perform a regular clean-up with an OLAP job with an upcoming version of TinkerPop. (Unfortunately, it is currently not possible to modify the graph with an OLAP job.)
[1] https://groups.google.com/forum/#!topic/gremlin-users/pCYf6h3Frb8
When you define your schema for your graph set the cardinality of the ID property to SINGLE
From the Titan schema docs
SINGLE: Allows at most one value per element for such key. In other
words, the key→value mapping is unique for all elements in the graph.
The property key birthDate is an example with SINGLE cardinality since
each person has exactly one birth date.
Here's a link to the docs http://s3.thinkaurelius.com/docs/titan/1.0.0/schema.html

Data modeling in multiple aggregation levels

I have a question about data modeling.
I have a table called "sales" where I store different levels of aggregation of customer sales. It has the following attributes:
id (integer)
period_id (integer)
customer_id (integer)
product_category_id (integer)
channel_id (integer)
value (float)
Depending on what "id" attributes are filled, I know the level of aggregation. For example:
If period_id, customer_id and product_category_id are filled, but channel_id is NULL, I know it's aggregated by all channels. If also product_category_id is NULL, I know it's aggregated by all channels and product categories.
Associated to each row of that sales table, I have an associate row in performance_analysis table, which store statistical analysis of those sales. This table has the following attributes:
sales_id (integer)
and a bunch of numerical statistical values
I believe that storing those different levels of aggregation in the sames (sales) table is not a good practice, and I'm planning to make some changes. My idea is to score just the most disaggregated level, and get each level of aggregation on-the-fly, using SQL to aggregate. In that scenario, all the references attributes of "sales" table will be filled, and I'll just GROUP BY and SUM according to my needs.
The problem is: by doing this, I lose the 1x1 association with the performance_analysis table. Then, I would have to move the reference attributes to the analysis table and the problem persists.
I would still have to use that NULL attributes hack to know which level of aggregation is.
It is important to notice that aggregate that analysis data is not trivial. I can't just SUM the attributes, they're specific to the analyzed values. So it's not data duplication as it is on the "sales" case. But it still have different levels of "aggregation" on the same table.
What is the best way to store that data?
You're certainly on the right track as far as holding the sales data at its most granular. What you're describing is very much like a dimensional model's fact table, and Ralph Kimball (a key figure in dimensional modelling) would always advise that you hold your measures at the lowest grain possible. If you're not already familiar with dimensional modelling, I would suggest you do some reading into it, as you are working in a very similar way and might find some useful information, both for this particular issue and perhaps for other design decisions you need to make.
As far as your statistical values, the rules of dimensional modelling would also tell you that you simply cannot store measures which are at different grains in the same table. If you really cannot calculate them on-the-fly, then make separate tables at each aggregation level, and include the appropriate ID columns for each level.
It could be worth looking into multidimensional tools (OLAP cubes, etc.), as it's possible that rather than carrying these calculations out and then storing them in the database, you might be able to add a layer which allows those - and more - calculations to be carried out at run time. For some use cases this has obvious benefits over being restricted to only those calculations which have been defined at design time. They would certainly be an obvious fit on top of the dimensional data structure that you are creating.

SSIS look up vs fuzzy lookup

In SQL Server Integration Services, there are two types of look ups:
Normal look ups
Fuzzy look ups
What is the difference between them?
There are good descriptions of all of SSIS transformations on MSDN.
Lookup transformations perform lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.
As an example, if you are populating a fact table, you might need to use a lookup to get the surrogate key from a dimension table by joining based upon the business key.
Fuzzy Lookup transformations perform data cleaning tasks such as standardizing data, correcting data, and providing missing values. The Fuzzy Lookup transformation differs from the Lookup transformation in its use of fuzzy matching. The Lookup transformation uses an equi-join to locate matching records in the reference table. It returns records with at least one matching record, and returns records with no matching records. In contrast, the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches in the reference table.
Fuzzy lookups are commonly used to standardize addresses and names.

Do the "columns" in a table in a RMDB have order?

I learned that there is no concept of order in terms of tuples (e.g. rows) in a table but according to wikipedia "a tuple is an ordered list of elements". Does that mean that attributes do have an order? If yes why would they be treated differently, couldn't one add another column to a table (which is why the tuples don't have order)?
"In this notation, attribute–value pairs may appear in any order." does this mean attributes have no order?
There are 2 kinds of tuples, so to speak. There is "pure mathematics", and there indeed tuples are typically defined as "ordered lists of values". Thus, in mathematical theories, it makes sense to speak of "the first value in a tuple" and such. This may be the sense or the context that your Wikipedia article is referring to.
The Haskell language supports this kind of tuple and, e.g., it also has a fst() operator to extract the "first" value out of such a tuple.
Codd realized, however, that this would be extremely impractical when applying this mathematical concept of tuples-as-ordered-lists to the field of data management. In data management, he wanted addressability of the values by attribute name, rather than by ordinal position. Indeed, imagine the devastating consequences if "the second attribute out of five" is removed from a table, and now all the programs that address "the third" and "the fourth" attribute of that same table now have to be inventarised and adapted.
So in the relational model, tuples are sets-of-named-values instead, and consequently, in the kind of tuples that play a role in the relational model of data, there is indeed not any concept of ordering of the values.
And then as indicated in that other response, there is SQL and its blasphemous deviations from relational theory. In SQL, the ordering of attributes in tuples and headings is quite meaningful, and the consequences are all over the place. In FK declarations, correspondance of the respective referring and referred attributes is by ordinal position, not by name. Other cases are with UNIONs and EXCEPTs. Take a table T, columns X and Y the same type.
SELECT X,Y FROM T UNION SELECT Y,X FROM T
is not invalid per se, but the standard prescribes that the column names in the result are system-defined (!). Implementations that "do the sensible thing" and deviate from this, producing a table with columns named X and Y, respectively, then face their users with the consequence that the former expression is not identical to
SELECT Y,X FROM T UNION SELECT X,Y FROM T
(because the column ordering X,Y is another ordering than Y,X and thence the headings are unequal, and consequently the tables are unequal.)
SELECT X,Y FROM T EXCEPT SELECT Y,X FROM T
gives results that will leave many novice SQL users scratching their heads for quite a while.
The operations of the relational database model as it is normally understood and used certainly do not depend on the order of attributes in a relation. The attributes of a relation variable can always be identified by name rather than position. However, the notation used in relational database theory doesn't always specify attribute names (or attribute types) and sometimes does imply ordered attributes. This is primarily a matter of written notation rather than the structure or behaviour of the relational model. There's some more discussion of these different "named" and "ordered" perspectives in the following reference.
http://webdam.inria.fr/Alice/pdfs/Chapter-3.pdf
E.F.Codd's earliest papers on the relational model actually proposed a relational system supporting both ordered and unordered versions of relations simultaneously. That idea seems to have been forgotten or ignored ever since, even by Codd himself. It doesn't form part of modern relational database theory.
Unfortunately, SQL definitely does have the concept of column order. It is an essential part of SQL syntax and semantics and it's something that every SQL DBMS supports in order to comply with the ISO SQL standard. SQL is not relational and this is just one of the ways in which it differs from the relational model. The awkward differences between the SQL model and the relational one cause a certain amount of confusion for students of the relational model who also use SQL.
Mathematical and philosophical arguments aside, look at it practically with some real-world examples.
Suppose you write this SQL:
INSERT INTO mytable
SELECT
a1, a2, ... a99
FROM anothertable;
Suppose this works just fine, because the order in which the SELECT returns the 99 columns is the same at each run, and matches the order of the colums needed for mytable.
But suppose that I am given the task to review this code, written by my colleague. How do you review if the columns are in the correct order? I will have to do some digging in other parts of the system, where mytable is constructed, to check if the columns are in the correct order. So the correctness of this SQL depends on other code, perhaps in other code in far and obscure places.
Case 2: suppose, the hypothetical list of columns looks like this:
...
a63,
apt_inspecties.buurtcode_dominant,
apt_inspecties.buurtnaam_dominant,
--
apt_inspecties.buurtnaam,
apt_inspecties.buurtcode,
--
apt_inspecties.wijknaam_dominant,
apt_inspecties.wijkcode_dominant,
--
apt_inspecties.wijknaam,
apt_inspecties.wijkcode,
--
apt_inspecties.stadsdeelnaam_dominant,
apt_inspecties.stadsdeelcode_dominant,
--
apt_inspecties.stadsdeelnaam,
apt_inspecties.stadsdeelcode,
--
apt_inspecties.ggwnaam_dominant,
apt_inspecties.ggwcode_dominant,
--
apt_inspecties.ggwnaam,
apt_inspecties.ggwcode,
a80,
a81,
...
Then there will be someone sometime who wants to reorder the first naam/code lines to get the list in a more systematic order with 8 times naam then code order, like this:
a63,
apt_inspecties.buurtnaam_dominant,
apt_inspecties.buurtcode_dominant,
But this would not be possible, unless, in other code the order of attributes is also changed. But then there is the risk that again other code that also relies on the implicit order of attributes goes wrong.
The obvious solution is to ALWAYS practice defensive coding like:
INSERT INTO mytable(a1, ... a99)
SELECT a1...a99
FROM anothertable;
Or, to assume that you cannot rely on implicit constant order of attributes.
Case 3:
Suppose mytable gets lost and needs to be recreated, for example from backup data. And suppose that it is recreated with another ordering of attributes. Or, someone does an ALTER TABLE to remove one attribute, and later another ALTER TABLE to add this attribute back in, thereby changing the order of attributes.
I believe that everybody would agree that the resulting table is the same as the original table. You would look at the data and see that the relation between a1 and a2 (etc) is the same. Same number of rows.
If you do any query like
SELECT ai, aj FROM mytable;
the output will be what you would expect from the original table.
However, the above SQL would fail, because it relies on one particular implicit ordering of attributes.
Conclusion:
For a relational database, such as Oracle, Postgresql, MySql, whatever, the ordering of rows and of attributes has no meaning. Name the attributes in your SQL to force the desired order of attributes, and use an ORDER BY clause to force the ordering of rows.
This is related to strong typing in programming languages. Not everybody likes that, and some modern languages do not have strong typing, like javascript, python. However, for writing larger code, strong typing is very important, and then typescript seems to replace (supplement) javascript.