How to weight data of different orders of magnitude - data-analysis

The magnitude of index 1 is 10^-2, which may be smaller, such as 0.056; The magnitude of index 2 is 10^5, which may be larger, such as 1000000; The magnitude of index 3 may be 10^2, such as 105. The comprehensive scores of these indicators are obtained by weighted summation. Because there are many possible combinations, it is impossible to list them completely. The normalization strategy is used to dimensionalize, so I want to give a weight to all indicators through a direct weighting method, and then calculate the combined scores of the indicators

Related

When ranking is more important than fitting the value of y in a regression model

Let's say you have a model that predicts the purchases of a specific user over a specific period of time.
It seems to work well when we build a model that predicts whether or not to buy and sort users based on that probability.
In the same way, when a model that predicts the purchase amount is constructed and users are sorted based on the predicted amount, the expected performance does not seem to be achieved.
For me, it is important to predict that A will pay more than B. Matching the purchase amount is not important.
What metrics and models can be used in this case?
I am using lightgbm regression as a base.
There is a large variance in the purchase amount. Most users spend 0 won for a certain period, but purchasers spend from a minimum of $1,000 to a maximum of $100,000.

Is the effectiveness of a column index related to the entropy of the column data

As the consumer and occasionally of relational databases (Postgres, MySQL) I often have to consider query speeds in the context of various queries. However often you don't know how a database will be used or where the bottlenecks might be until it's in production.
This makes me wonder, can I use a rule of thumb about the predicted entropy of a column as a heuristic for guessing the speed increase of indexing that column?
A quick Google results in papers written by Computer Science graduates for Computer Science graduates. Can you sum it up in "layman" terms for a self taught programmer?
Entropy?: I'm defining entropy as calculated by number of rows divided by number of times a value is repeated on average (mean). If this is a poor choice of words for those with a CS vocabulary, please suggest a better word.
This question is really too broad to answer thoroughly, but I'll attempt to sum up the situation for PostgreSQL (I don't know enough about other RDBMS, but some of what I write will apply to most of them).
Instead of entropy as you propose above, the PostgreSQL term is the selectivity of a certain condition, which is a number between 0 and 1, defined as the number of rows that satisfy the condition, divided by the total number of rows in the table. A condition with a low selectivity value is (somewhat counterĀ­ intuitively) called highly selective.
The only sure way to figure out if an index is useful or not is to compare the execution times with and without the index.
When PostgreSQL decides if using an index for a condition on a table is effective or not, it compares the estimated cost of a sequential scan of the whole table with the cost of an index scan using an applicable index.
Since sequential reads and random I/O (as used for accessing indexes) often differ in speed, there are a few parameters that influence the cost estimate and hence the decision:
seq_page_cost: Cost of a sequentially fetched disk page
random_page_cost: Cost of a non-sequentially fetched disk page
cpu_tuple_cost: Cost of processing one table row
cpu_index_tuple_cost: Cost of processing an index entry during an index scan
These costs are measured in imaginary units, it is customary to define seq_page_cost as 1 and the others in relation.
The database collects table statistics to so that it knows how big each table is and how the column values are distributed (most common values and their frequency, histograms, correlation to physical position).
To see an example how all these numbers are used by PostgreSQL, look at this example from the documentation.
Using the default settings, a rule of thumb might be that an index will not help much unless the selectivity is less than 0.2.
What I think you are asking is what the impact of an index is relating to the data distribution of data in a column. There is a bunch of theory here. In GENERAL, you will find that index lookup efficiency depends on the distribution of data in the index. In other words, an index is more efficient if you are pulling 0.01% of the table than if you are pulling 5% of the table. This is because random disk I/O is always less efficient (even on SSDs due to read-ahead caching by the OS) than sequential reads.
Now this is not the only consideration. There are always questions about the best way to retrieve a set, particularly if ordered, using an index. Do you scan the ordering index or the filtering index and then sort? Usually you have an assumption here that data is evenly distributed between the two but where this is a bad assumption you can get bad query plans.
So what you should do here is look up index cardinality and get experience with query plans, particularly when the planner makes a mistake so you can understand why it is in error.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.

Which of these would be safer/better to run?

I have 451 cities with coordinates. Now I want to calculate the distance between each city and then order some results by that distance. Now I have 2 options:
I can run a loop that would calculate distance for every possible combination of cities and storing them into a table, which would result in roughly 200k rows.
Or, I can leave the cities without pre-calculating and then, when results are displayed (about 30 per page), and calculate the distance for each city separately.
I don't know which would be better for performance, but I would prefer going for option one, in which case I have another concern: Is there a way I could get out as little rows as possible? Currently, I would count the possibilities as 451^2, but I think I could divide that by 2, since the distance in case of City1-City2 is the same as City2-City1.
Thanks
If your table of cities is more or less static, then you should definitely per-calculate all distances and store them in a separate table. In this case you will have (451^2/2) rows (just make sure thet id of City1 is always lower then id of City2 (or another way round, doesn't really matter)).
Normally the cost of a single MySQL query is quite high and the cost of mathematical operations really low. Especially if the scale of your map is small and the required precision is low, so you can calculate with a fixed distance between degrees, you will be faster with calculating.
Furthermore you would have a problem if the number of cities rises because of a change in your project and therefore the number of combinations you'd have to store in the DB exceeds the limits.
So you'd probably better off without pre-calculating.

Are SPATIAL Geometry indices performance dependant on the size and density of geometry shapes?

Spatial Indexes
Given a spatial index, is the index utility, that is to say the overall performance of the index, only as good as the overall geometrys.
For example, if I were to take a million geometry data types and insert them into a table so that their relative points are densely located to one another, does this make this index perform better to identical geometry shapes whose relative location might be significantly more sparse.
Question 1
For example, take these two geometry shapes.
Situation 1
LINESTRING(0 0,1 1,2 2)
LINESTRING(1 1,2 2,3 3)
Geometrically they are identical, but their coordinates are off by a single point. Imagine this was repeated one million times.
Now take this situation,
Situation 2
LINESTRING(0 0,1 1,2 2)
LINESTRING(1000000 1000000,1000001 10000001,1000002 1000002)
LINESTRING(2000000 2000000,2000001 20000001,2000002 2000002)
LINESTRING(3000000 3000000,3000001 30000001,3000002 3000002)
In the above example:
the lines dimensions are identical to the situation 1,
the lines are of the same number of points
the lines have identical sizes.
However,
the difference is that the lines are massively futher apart.
Why is this important to me?
The reason I ask this question is because I want to know if I should remove as much precision from my input geometries as I possibly can and reduce their density and closeness to each other as much as my application can provide without losing accuracy.
Question 2
This question is similar to the first question, but instead of being spatially close to another geometry shape, should the shapes themselves be reduced to the smalest possible shape to describe what it is that the application requires.
For example, if I were to use a SPATIAL index on a geometry datatype to provide data on dates.
If I wanted to store a date range of two dates, I could use a datetime data type in mysql. However, what if I wanted to use a geometry type, so that I convery the date range by taking each individual date and converting it into a unix_timestamp().
For example:
Date("1st January 2011") to Timestamp = 1293861600
Date("31st January 2011") to Timestamp = 1296453600
Now, I could create a LINESTRING based on these two integers.
LINESTRING(1293861600 0,1296453600 1)
If my application is actually only concerned about days, and the number of seconds isn't important for date ranges at all, should I refactor my geometries so that they are reduced to their smallest possible size in order to fulfil what they need.
So that instead of "1293861600", I would use "1293861600" / (3600 * 24), which happens to be "14975.25".
Can someone help fill in these gaps?
When inserting a new entry, the engine chooses the MBR which would be minimally extended.
By "minimally extended", the engine can mean either "area extension" or "perimeter extension", the former being default in MySQL.
This means that as long as your nodes have non-zero area, their absolute sizes do not matter: the larger MBR's remain larger and the smaller ones remain smaller, and ultimately all nodes will end up in the same MBRs
These articles may be of interest to you:
Overlapping ranges in MySQL
Join on overlapping date ranges
As for the density, the MBR are recalculated on page splits, and there is a high chance that all points too far away from the main cluster will be moved away on the first split to their own MBR. It would be large but be a parent to all outstanding points in few iterations.
This will decrease the search time for the outstanding points and will increase the search time for the cluster points by one page seek.