I am Clustering bunch of words with k-means algorithm in RapidMiner 5.2
I am converting nominal to numerical before the clustering. However, to really view my clustering, i need to view numbers back as word. How can i convert it back?
Use the Parse Numbers or Guess Types operators.
Related
I have Lat/Lons as POINT coordinates and want to cluster them based on location with presto. For now, I am rounding the lat/lons to 2 decimals and converting them to strings, concatening them and finally grouping by. But this way I lose the information of individual points. Is there any good and clean way to do this (like may be ST_Cluster* functions in postgis) using presto?
Trino (formerly known as Presto SQL) seems to have https://trino.io/docs/current/functions/geospatial.html
geospatial functions, but nothing equivalent to st_cluster as you asked for. Probably, you may use function like ST_Distance for concatenation instead of converting to decimal, strings..
Though not as clean as directly using st_cluster, but a workaround to create clustering like behavior using existing geospatial functions
I'd like to store a simple map of key-value strings as a field in my PostgreSQL table. I intend to treat the map as a whole; i.e, always select the entire map, and never query by its keys nor values.
I've read articles comparing between hstore, json and jsonb, but those didn't help me choose which data-type is most fitting for my requirements, which are:
Only key-value, no need for nesting.
Only strings, no other types nor null.
Storage efficiency, given my intended use for the field.
Fast parsing of the queried maps.
What data-type would best fit my use case?
you could use hstore which is a keymap, however I personally would us jsonb. It's a little overkill, however, most languages can convert json natively without having to decode it yourself.
In json, I'd just store a simple object or array for the info you're trying to store.
Both support indexes and are efficiently stored.
Hstore in text is a custom type format that your language may be unaware of and thus require processing the data to insert or query.
I was wondering why a computer would need binary code converters to convert from BCD to Excess-3 for example. Why is this necessary can't computers just use one form of binary code.
Some older forms of binary representation persist even after a newer, "better" form comes into use. For example, legacy hardware that is still in use running legacy code that would be too costly to rewrite. Word lengths were not standardized in the early years of computing, so machines with words varying from 5 to 12 bits in length naturally will require different schemes for representing the same numbers.
In some cases, a company might persist in using a particular representation for self-compatibility (i.e., with the company's older products) reasons, or because it's an ingrained habit or "the company way." For example, the use of big-endian representation in Motorola and PowerPC chips vs. little-endian representation in Intel chips. (Though note that many PowerPC processors support both types of endian-ness, even if manufacturers typically only use one in a product.)
The previous paragraph only really touches upon byte ordering, but that can still be an issue for data interchange.
Even for BCD, there are many ways to store it (e.g., 1 BCD digit per word, or 2 BCD digits packed per byte). IBM has a clever representation called zoned decimal where they store a value in the high-order nybble which, combined with the BCD value in the low-order nybble, forms an EBCDIC character representing the value. This is pretty useful if you're married to the concept of representing characters using EBCDIC instead of ASCII (and using BCD instead of 2's complement or unsigned binary).
Tangentially related: IBM mainframes from the 1960s apparently converted BCD into an intermediate form called qui-binary before performing an arithmetic operation, then converted the result back to BCD. This is sort of a Rube Goldberg contraption, but according to the linked article, the intermediate form gives some error detection benefits.
The IBM System/360 (and probably a bunch of newer machines) supported both packed BCD and pure binary representations, though you have to watch out for IBM nomenclature — I have heard an old IBMer refer to BCD as "binary," and pure binary (unsigned, 2's complement, whatever) as "binary coded hex." This provides a lot of flexibility; some data may naturally be best represented in one format, some in the other, and the machine provides instructions to convert between forms conveniently.
In the case of floating point arithmetic, there are some values that cannot be represented exactly in binary floating point, but can be with BCD or a similar representation. For example, the number 0.1 has no exact binary floating point equivalent. This is why BCD and fixed-point arithmetic are preferred for things like representing amounts of currency, where you need to exactly represent things like $3.51 and can't allow floating point error to creep in when adding.
Intended application is important. Arbitrary precision arithmetic will require a different representation strategy compared to the fixed-width registers in your CPU (e.g., Java's BigDecimal class). Many languages support arbitrary precision (e.g., Scheme, Haskell), though the underlying implementation of arbitrary precision numbers varies. I'm honestly not sure what is preferable for arbitrary precision, a BCD-type scheme or a denser pure binary representation. In the case of Java's BigDecimal, conversion from binary floating point to BigDecimal is best done by first converting to a String — this makes such conversions potentially inefficient, so you really need to know ahead of time whether float or double is good enough, or whether you really need arbitrary precision, and when.
Another tangent: Groovy, a JVM language, quietly treats all floating point numeric literals in code as BigDecimal values, and uses BigDecimal arithmetic in preference to float or double. That's one reason Groovy is very popular with the insurance industry.
tl;dr There is no one-size-fits-all numeric data type, and as long as that remains the case (probably the heat death of the universe), you'll need to convert between representations.
I am tasked with finding the binary representation for the number 3.4219087*10^12. This is a very large number (and I have to do this by hand), so I was wondering if there was some sort of short cut or technique I could use to convert it instead.
I want to store a large number of Time Series (time vs value) data points. I would prefer using MySQL for the same. Right now I am planning to store the time series as a Binary Blob in MySQL. Is this the best way, what would be the best approach.
You should store your values as whatever type they are (int, boolean, char) and your times as either date, or int containing the UNIX timestamp, whatever fits your application better.
If you want to process the information in any way using mysql you should store it as a date type, numeric type .
The only scaling issue i see (if you only intend to store the information) is extra disk size.
As both Tom and aeon said, you should store data in its native format in MySQL if you want to do anything with that data (-> process the data with SQL).
If, on the other end, you don't want to work on that data, but just store/retrieve it, then you are using MySQL just as a blob container where every blob is a mess/group of multiple time/data points, and it could not be the best tool for the job: that's what files are designed for.
You could investigate an hybrid approach where you possibly store the data non-structured, but you store timestamps as discrete values. In other words, a key/value store with your timestamps as keys. That should open up the possibility to work with NoSQL solutions and maybe you could find out that they fit better (eg. think about running map/reduce jobs directly on the db in a Riak cluster)
Depending on your application, using the spatial data extensions of MySQL could be an option. You could then use spatial indexing to query the data fast.
To represent a time series, the LineString class might be the most appropriate choice. It represents a sequence of tuples you could use to store time and value.
Here it says that "On a world map, LineString objects could represent rivers."
Creating line strings is easy:
-- if you see a blob, use Convert( AsText(...) using 'utf8')
SELECT AsText(GeomFromText('LineString(1 1, 2 2, 3.5 3.9)'));
Some links to get started:
https://dev.mysql.com/doc/refman/5.1/en/spatial-extensions.html
https://dev.mysql.com/doc/refman/5.1/en/populating-spatial-columns.html