Name for the opposite of 'interleaved' data layout - terminology

In computer graphics as well as in data-oriented design, there is the term 'interleaved', referring to a specific way to lay out data in memory. So for some data with attributes a, b and c, the memory layout a_1, b_1, c_1, a_2, b_2, c_2, ... would be called 'interleaved'. The opposite layout would be a_1, a_2, ..., a_n, b_1, b_2, ..., b_n, c_1, c_2, ..., c_n.
Is there a single word that describes this other layout that could be used to name a type? So InterleavedBuffer vs. ???Buffer.
I know the terms 'Struct-of-Arrays' (SoA) and 'Array-of-Structs' (AoS), but neither one is a single word and I don't want to use abbreviations, especially ones that are so similar.

In the context of image formats and the storage of color channel data, the opposite of interleaved would be planar.
See https://stackoverflow.com/a/26459276/1084865.

Related

How to replicate only the intersection of certain channels in Couchbase Mobile

I have four channels in my application: A, B, C, D. Some application users are only interested in documents contained in both channels A and B only. Also can be expressed as: A ∩ B. Others may be interested in a different combination like: A ∩ B ∩ D.
UPDATE
I don't think the following will work anyway
What has been suggested so far is that I can create a new channel (like A_B and A_B_D) for each combination and then tag the documents that meet the intersection criteria accordingly. But you can see how this could easily get out of hand since with just 4 channels, you end up with 15 combinations (11 extra channels).
Is there a way to do this with channels or perhaps some other feature I have missed in Couchbase?
The assignment of channels to a document is done via the sync function. So a document is not "contained" in a channel, but it may have attributes from which the channels to which it is routed can be derived. Only in the simplest default case, the document's channel attribute will route it to the channel having that value of that attribute.
So what you intend can be achieved by putting statements like
if (doc.areas.includes("A") && doc.areas.includes("B") {
channel("AB");
}
into the sync function. (I renamed the channels attribute to areas to make clear to the reader of the program that these are not the actual channels, but that channels are only derived from combinations of them.)

Comparision of data in huge databases

I have a database in mysql which has a collection of attributes (ex. 'weight', 'height', 'no of pages' etc) and attribute values (ex. '30 tons', '12 inches', '2 pgs' etc) and mapped with the respective product ids.
The data has been collected from different sites and hence the attribute values have different formats (ex. '222 pgs' or '222 pages' or '222') (ex2. '12 inches', '12 meters', '12 cms').
What I need to do is that I have to compare the values of same attributes of different products. So I have to compare '222 pgs' with '222 pages' for all the attributes which differ in formats.
There are around 4000 attributes and the number will increase further. Is there any way to compare these without having to assign each attribute a specific type individually? Or what is the fastest way to compare these?
Well, until they invent a clairvoyant computer, a human being will have to tell it that pgs and pages mean the same thing and that inches and meters are convertible.
You'll have to sanitize the data one way or another. I'd probably start by identifying units that measure the same dimension1 and common aliases2 for each unit, then parse the data to split the quantity from the unit and normalize3 the unit. Once you have done that, the data becomes directly comparable.
But all this is really just a remedy for the problem that should not have been there in the first place, were the database designed properly.
1 A "mass" is a dimension measured by units such as kg, t, lb etc. A "length" is a dimension measured by m, km, in etc.
2 E.g. an in and inch denote exactly the same unit, pgs and pages are the same etc.
3 I.e. make sure a particular dimension is always represented by the same unit: for example convert all lengths to m, all masses to kg, all pages to pages etc.
You haven't explained what you want to do after you find out that attributes for a pair of products differ (while still meaning the same thing).
I.e.: if I see that in Instance A has field Length set to "12 pgs" and Instance B has Length reporting "12 pages" what do you do?
List this? Autocorrect? Drop one of the two values? Open a window for a human user to correct?
Personally I'd go for a "select attribute,count(*) from X group by attribute" so that you can find out the most common spelling of the unit, and then you can also write corrective scripts that may automatically convert ".. pgs" to " pages" as soon as you have decided the correct representation.
Of course this will not help at all unless you enforce correct spelling of the units, and this requires for sure better input-output filters, including the main UI, but also any kind of bulk uploader utility you may use to create or update products.
A redesign of the DB to add "Unit" as an extra, categorized attribute for each measure would also help a lot.

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.

Examples of monoids/semigroups in programming

It is well-known that monoids are stunningly ubiquitous in programing. They are so ubiquitous and so useful that I, as a 'hobby project', am working on a system that is completely based on their properties (distributed data aggregation). To make the system useful I need useful monoids :)
I already know of these:
Numeric or matrix sum
Numeric or matrix product
Minimum or maximum under a total order with a top or bottom element (more generally, join or meet in a bounded lattice, or even more generally, product or coproduct in a category)
Set union
Map union where conflicting values are joined using a monoid
Intersection of subsets of a finite set (or just set intersection if we speak about semigroups)
Intersection of maps with a bounded key domain (same here)
Merge of sorted sequences, perhaps with joining key-equal values in a different monoid/semigroup
Bounded merge of sorted lists (same as above, but we take the top N of the result)
Cartesian product of two monoids or semigroups
List concatenation
Endomorphism composition.
Now, let us define a quasi-property of an operation as a property that holds up to an equivalence relation. For example, list concatenation is quasi-commutative if we consider lists of equal length or with identical contents up to permutation to be equivalent.
Here are some quasi-monoids and quasi-commutative monoids and semigroups:
Any (a+b = a or b, if we consider all elements of the carrier set to be equivalent)
Any satisfying predicate (a+b = the one of a and b that is non-null and satisfies some predicate P, if none does then null; if we consider all elements satisfying P equivalent)
Bounded mixture of random samples (xs+ys = a random sample of size N from the concatenation of xs and ys; if we consider any two samples with the same distribution as the whole dataset to be equivalent)
Bounded mixture of weighted random samples
Let's call it "topological merge": given two acyclic and non-contradicting dependency graphs, a graph that contains all the dependencies specified in both. For example, list "concatenation" that may produce any permutation in which elements of each list follow in order (say, 123+456=142356).
Which others do exist?
Quotient monoid is another way to form monoids (quasimonoids?): given monoid M and an equivalence relation ~ compatible with multiplication, it gives another monoid. For example:
finite multisets with union: if A* is a free monoid (lists with concatenation), ~ is "is a permutation of" relation, then A*/~ is a free commutative monoid.
finite sets with union: If ~ is modified to disregard count of elements (so "aa" ~ "a") then A*/~ is a free commutative idempotent monoid.
syntactic monoid: Any regular language gives rise to syntactic monoid that is quotient of A* by "indistinguishability by language" relation. Here is a finger tree implementation of this idea. For example, the language {a3n:n natural} has Z3 as the syntactic monoid.
Quotient monoids automatically come with homomorphism M -> M/~ that is surjective.
A "dual" construction are submonoids. They come with homomorphism A -> M that is injective.
Yet another construction on monoids is tensor product.
Monoids allow exponentation by squaring in O(log n) and fast parallel prefix sums computation. Also they are used in Writer monad.
The Haskell standard library is alternately praised and attacked for its use of the actual mathematical terms for its type classes. (In my opinion it's a good thing, since without it I'd never even know what a monoid is!). In any case, you might check out http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Monoid.html for a few more examples:
the dual of any monoid is a monoid: given a+b, define a new operation ++ with a++b = b+a
conjunction and disjunction of booleans
over the Maybe monad (aka "option" in Ocaml), first and last. That is,first (Just a) b = Just a
first Nothing b = band likewise for last
The latter is just the tip of the iceberg of a whole family of monoids related to monads and arrows, but I can't really wrap my head around these (other than simply monadic endomorphisms). But a google search on monads monoids turns up quite a bit.
A really useful example of a commutative monoid is unification in logic and constraint languages. See section 2.8.2.2 of 'Concepts, Techniques and Models of Computer Programming' for a precise definition of a possible unification algorithm.
Good luck with your language! I'm doing something similar with a parallel language, using monoids to merge subresults from parallel computations.
Arbitrary length Roman numeral value computation.
https://gist.github.com/4542999

Can coordinates of constructable points be represented exactly?

I'd like to write a program that lets users draw points, lines, and circles as though with a straightedge and compass. Then I want to be able to answer the question, "are these three points collinear?" To answer correctly, I need to avoid rounding error when calculating the points.
Is this possible? How can I represent the points in memory?
(I looked into some unusual numeric libraries, but I didn't find anything that claimed to offer both exact arithmetic and exact comparisons that are guaranteed to terminate.)
Yes.
I highly recommend Introduction to constructions, which is a good basic guide.
Basically you need to be able to compute with constructible numbers - numbers that are either rational, or of the form a + b sqrt(c) where a,b,c were previously created (see page 6 on that PDF). This could be done with algebraic data type (e.g. data C = Rational Integer Integer | Root C C C in Haskell, where Root a b c = a + b sqrt(c)). However, I don't know how to perform tests with that representation.
Two possible approaches are:
Constructible numbers are a subset of algebraic numbers, so you can use algebraic numbers.
All algebraic numbers can be represented using polynomials of whose they are roots. The operations are computable, so if you represent a number a with polynomial p and b with polynomial q (p(a) = q(b) = 0), then it is possible to find a polynomial r such that r(a+b) = 0. This is done in some CASes like Mathematica, example. See also: Computional algebraic number theory - chapter 4
Use Tarski's test and represent numbers. It is slow (doubly exponential or so), but works :) Example: to represent sqrt(2), use the formula x^2 - 2 && x > 0. You can write equations for lines there, check if points are colinear etc. See A suite of logic programs, including Tarski's test
If you turn to computable numbers, then equality, colinearity etc. get undecidable.
I think the only way this would be possible is if you used a symbolic representation,
as opposed to trying to represent coordinate values directly -- so you would have
to avoid trying to coerce values like sqrt(2) into some numerical format. You will
be dealing with irrational numbers that are not finitely representable in binary,
decimal, or any other positional notation.
To expand on Jim Lewis's answer slightly, if you want to operate on points that are constructible from the integers with exact arithmetic, you will need to be able to operate on representations of the form:
a + b sqrt(c)
where a, b, and c are either rational numbers, or representations in the form given above. Wikipedia has a pretty decent article on the subject of what points are constructible.
Answering the question of exact equality (as necessary to establish colinearity) with such representations is a rather tricky problem.
If you try to compare co-ordinates for your points, then you have a problem. Leaving aside co-linearity for a moment, how about just working out whether two points are the same or not?
Supposing that one has given co-ordinates, and the other is a compass-straightedge construction starting from certain other co-ordinates, you want to determine with certainty whether they're the same point or not. Either way is a theorem of Euclidean geometry, it's not something you can just measure. You can prove they aren't the same by spotting some difference in their co-ordinates (for example by computing decimal places of each until you encounter a difference). But in general to prove they are the same cannot be done by approximate methods. Compute as many decimal places as you like of some expansions of 1/sqrt(2) and sqrt(2)/2, and you can prove they're very close together but you won't ever prove they're equal. That takes algebra (or geometry).
Similarly, to show that three points are co-linear you will need theorem-proving software. Represent the points A, B, C by their constructions, and attempt to prove the theorem "A, B and C are colinear". This is very hard - your program will prove some theorems but not others. Much easier is to ask the user for a proof that they are co-linear, and then verify (or refute) that proof, but that's probably not what you want.
In general, constructable points may have an arbitrarily complex symbolic form, so you must use a symbolic representation to work them exactly. As Stephen Canon noted above, you often need numbers of the form a+b*sqrt(c), where a and b are rational and c is an integer. All numbers of this form form a closed set under arithmetic operations. I have written some C++ classes (see rational_radical1.h) to work with these numbers if that is all you need.
It is also possible to construct numbers which are sums of any number of terms of rational multiples of radicals. When dealing with more than a single radicand, the numbers are no longer closed under multiplication and division, so you will need to store them as variable length rational coefficient arrays. The time complexity of operations will then be quadratic in the number of terms.
To go even further, you can construct the square root of any given number, so you could potentially have nested square roots. Here, the representations must be tree-like structures to deal with root hierarchy. While difficult to implement, there is nothing in principle preventing you from working with these representations. I'm not sure just what additional numbers can be constructed, but beyond a certain point, your symbolic representation will be expressive enough to handle very large classes of numbers.
Addendum
Found this Google Books link.
If the grid axes are integer valued then the answer is fairly straight forward, the points are either exactly colinear or they are not.
Typically however, one works with real numbers (well, floating points) and then draws the rounded values on the screen which does exist in integer space. In this case you have no choice but to pick a tolerance and use it to determine colinearity. Keep it small and the users will never know the difference.
You seem to be asking, in effect, "Can the normal mathematics (integer or floating point) used by computers be made to represent real numbers perfectly, with no rounding errors?" And, of course, the answer to that is "No." If you want theoretical correctness, then you will be stuck with the much harder problem of symbolic manipulation and coding up the equivalent of the inferences that are done in geometry. (In short, I'm agreeing with Steve Jessop, above.)
Some thoughts in the hope that they might help.
The sort of constructions you're talking about will require multiplication and division, which means that to preserve exactness you'll have to use rational numbers, which are generally easy to implement on top of a suitable sort of big integer (i.e., of unbounded magnitude). (Common Lisp has these built-in, and there have to be other languages.)
Now, you need to represent square roots of arbitrary numbers, and these have to be mixed in.
Therefore, a number is one of: a rational number, a rational number multiplied by a square root of a rational number (or, alternately, just the square root of a rational), or a sum of numbers. In order to prove anything, you're going to have to get these numbers into some sort of canonical form, which for all I can figure offhand may be annoying and computationally expensive.
This of course means that the users will be restricted to rational points and cannot use arbitrary rotations, but that's probably not important.
I would recommend no to try to make it perfectly exact.
The first reason for this is what you are asking here, the rounding error and all that stuff that comes with floating point calculations.
The second one is that you have to round your input as the mouse and screen work with integers. So, initially all user input would be integers, and your output would be integers.
Beside, from a usability point of view, its easier to click in the neighborhood of another point (in a line for example) and that the interface consider you are clicking in the point itself.