Equations for 2 variable Linear Regression

Equations for 2 variable Linear Regression - equation

We are using a programming language that does not have a linear regression function in it. We have already implemented a single variable linear equation:
y = Ax + B
and have simply calculated the A and B coefficents from the data using a solution similar to this Stack Overflow answer.
I know this problem gets geometrically harder as variables are added, but for our purposes, we only need to add one more:
z = Ax + By + C
Does anyone have the closed form equations, or code in any language that can solve for A, B and C given an array of x, y, and z's?

so you have three linear equations
k = aX1 + bY1 + cZ1
k = aX2 + bY2 + cZ2
k = aX3 + bY3 + cZ3
What you can do is rewrite it as matriz
| x1 y1 z1 | | a | | k |
| x2 y2 z2 | | b | = | k |
| x3 y3 y3 | | c | | k |
to work out [a b c ] do the following matrix operation
| a | | x1 y1 z1 | | k |
| b | = inverse( | x2 y2 z2 | ) | k |
| c | | x3 y3 y3 | | k |
The formula for a 3x3 matrix inverse can be found here

Yes, it's an easy linear algebra problem if you think of it the way Gil Strang does it. Here's a written explanation.

Can you use MatLab or does the calculation have to occur inside your software?
MatLab instructions on multiple regression analysis.
Integrating MatLab with C#.

Related

Grafana visualize cartesian data points (coordinates)

Is it possible to generate any chart for example line chart that plots on Y axis a function of X where X is not expressed in time but any other unit?
let say you have
f(x) = 2x +1
and would like to plot it for a data range between 1 and 3 so:
x = 1; y = 3;
x = 2; y = 5;
x = 3; y = 7;
If it is possible then how to crate such chart, perfectly working with MySQL DB where in one column you have X, and other column is Y?
I have such example test table (MySQL), X is X axis, Y is value (Y axis) Z is series of data:
+---+---+------+
| x | y | z |
+---+---+------+
| 1 | 3 | 1 |
| 2 | 5 | 1 |
| 3 | 7 | 1 |
+---+---+------+
how to write a simplest Grafana MySQL query so it shows data with Plotly?

Stacked bar charts not based on time = huge advantage for Grafana if they can ever figure it out. Until then I will stick it out with Kibana.

combining dataframes, and adding values of common elements

I have multiple data sets like this
data set 1
index| name | val|
1 | a | 1 |
2 | b | 0 |
3 | c | 3 |
data set 2
index| name | val|
1 | g | 4 |
2 | a | 2 |
3 | k | 3 |
4 | l | 2 |
I want to combine these data sets in such a way that if the both the data sets have a row with a common element name, in this example, "a", i want to have only a single row for the combined dataset, where the value is sum of that a and this a, in this case the combined row a would have a val of 3 (2+1). index number for elements does not matter. is there an effective way to do this in excel itself? I'm new to querying data, but im trying to learn. If i can do this in pandas(i'm trying to make myself familiar in this language) or sql, I will do so. My data sets are of different sizes

use:
df3 = df1.groupby('name').sum().add(df2.groupby('name').sum(), fill_value=0).reset_index()
df3['val'] = df3.fillna(0)[' val']+df3.fillna(0)['val']
df3 = df3.drop([' val'], axis=1)
print(df3)
Output:
name index val
0 a 3.0 3.0
1 b 2.0 0.0
2 c 3.0 3.0
3 g 1.0 4.0
4 k 3.0 3.0
5 l 4.0 2.0

IN Sql you can try below query:
select name,sum(val)
from
(select index,name,val from dataset1
union all
select index,name,val from dataset2) tmp
group by name
In Pandas:
df3=pd.concat([df1,df2],ignore_index=True)
df3.groupby(['name']).sum()

sum of minterm vs product of maxterm

Given the following Boolean expression of F(A,B,C): F(A,B,C) = A' + B + C'
Which of the following statements is/are true about the above expression?
(i) It is an SOP expression
(ii) It is a POS expression
(iii) It is a sum-of-minterms expression
(iv) It is a product-of-maxterms expression
The model answer for this question is i),ii) and iv)
My question is why is iii) not one of the answers? i drew the K-map and found out that its possible to derive such a sum-of-minters expression

A cluster of literals in a boolean expression forms a minterm or a maxterm only, if there are all literals (variables of the given function or their negation) included in it.
A minterm is a product of all literals of a function, a maxterm is a sum of all literals of a function.
In a K-map a minterm or a maxterm marks out only one cell. In a truth table a maxterm or a minterm matches only one row.
The following truth-table corresponds to the given function:
index | a | b | c || f(a,b,c) | term matching the row/K-map cell
-------|---|---|---||----------|----------------------------------
0 | 0 | 0 | 0 || 1 | minterm: m0 = (¬a⋅¬b⋅¬c)
1 | 0 | 0 | 1 || 1 | minterm: m1 = (¬a⋅¬b⋅c)
2 | 0 | 1 | 0 || 1 | minterm: m2 = (¬a⋅b⋅¬c)
3 | 0 | 1 | 1 || 1 | minterm: m3 = (¬a⋅b⋅c)
-------|---|---|---||----------|----------------------------------
4 | 1 | 0 | 0 || 1 | minterm: m4 = (a⋅¬b⋅¬c)
5 | 1 | 0 | 1 || 0 | MAXTERM: M5 = (¬a + b + ¬c)
6 | 1 | 1 | 0 || 1 | minterm: m6 = (a⋅b⋅¬c)
7 | 1 | 1 | 1 || 1 | minterm: m7 = (a⋅b⋅c)
There is only one maxterm present in the truth table (and your K-map) and the only maxterm determining the function's output as logical 0. It is a valid product-of-maxterms expression, even if there is only one. It is also the same boolean expression as the original one, so that is a valid product-of-maxterms expression too.
However, this is not a valid sum of minterms, because there is none:
f(a,b,c) = ∏(5) = M5 = (¬a + b + ¬c)
For the original expression to be also the sum of minterms, it would need to mark out every single true/one cell in your K-map separately like this:
f(a,b,c) = ∑(0,1,2,3,4,6,7) = m0 + m1 + m2 + m3 + m4 + m6 + m7 =
= (¬a⋅¬b⋅¬c)+(¬a⋅¬b⋅c)+(¬a⋅b⋅¬c)+(¬a⋅b⋅c)+(a⋅¬b⋅¬c)+(a⋅b⋅¬c)+(a⋅b⋅c)
As you can see, even if these two boolean expressions are equivalent to each other, the original one (on the left side of the equation) is not written as the sum-of-minterms expression (on the right side of the equation).
(¬a+b+¬c) = (¬a⋅¬b⋅¬c)+(¬a⋅¬b⋅c)+(¬a⋅b⋅¬c)+(¬a⋅b⋅c)+(a⋅¬b⋅¬c)+(a⋅b⋅¬c)+(a⋅b⋅c)
Just any product is not a minterm, so the original expression could be in the form of both the product of sum and the sum of products, but not the valid sum-of-minterms.
f(a,b,c) = (¬a + b + ¬c) = (¬a) + (b) + (¬c)
In the picture (created using latex) you can see the expression – it is the same in it's minimal DNF and minimal CNF – and the sum of minterms equivalent to it.

Openquery insert not working

I have a linked MySQL -server to MSSQL-server and I am trying to INSERT data to the table admin_user on the MySQL -server, but end up getting the error:
Cannot process the object "dbo.admin_user". The OLE DB provider
"MSDASQL" for linked server "MYDB" indicates that either the object
has no columns or the current user does not have permissions on that
object.
This works fine:
SELECT * FROM openquery([MYDB], 'SELECT * FROM admin_user')
This gets the error:
INSERT into openquery([MYDB], 'dbo.admin_user') values ('Testi','Testaaja','me#google.com','koe','','','','','','1','N;','','')
Here are the rights of the user whom I used for creating the ODBC-connection
| xx.xxx.xxx.xx | me | *qweqweqwdq2edqdadasd|
Y | Y | Y | Y | Y |
Y | Y | Y | Y | Y | N
| Y | Y | Y | Y | Y
| Y | Y | Y | Y
| Y | Y | Y | Y
| Y | Y | Y | Y |
Y | | | |
| 0 | 0 | 0 | 0
| | NULL |
> | % | me | *asdasadasdsadasdasdsad| Y | Y | Y | Y | Y |
> Y | Y | Y | Y | Y | Y
> | Y | Y | Y | Y | Y
> | Y | Y | Y | Y
> | Y | Y | Y | Y
> | Y | Y | Y | Y |
> Y | | | |
> | 0 | 0 | 0 | 0
> | | NULL |
My catalog is bitnami_magento, I have the provider string configured with
DRIVER=(MySQL ODBC 5.3 ANSI Driver); SERVER=
XX.XXX.XXX.XXX;PORT=3306;DATABASE=bitnami_magento;
USER=me;PASSWORD=mypass;OPTION=3;
Also I have unchecked the "Level zero only" box from Provider Options (MSDASQL) and made sure that ad_hoc queries are allowed. What I am doing wrong?
There are the instructions that I followed
http://dbperf.wordpress.com/2010/07/22/link-mysql-to-ms-sql-server2008/

You have an error in your query:
In the OPENQUERY() you have to use the MySQL table name instead of the MSSQL one (if you want to insert into the MySQL table).
The following syntax should work
INSERT INTO OPENQUERY([MYDB], 'SELECT * FROM mysqlDbName.mysqlTableName') VALUES
('Testi','Testaaja','me#google.com','koe','','','','','','1','N;','','')
Please change the mysqlDbName.mysqlTableName to you MySQL database and table name accordingly.

The problem was I am an idiot. The syntax for Openquery expects a result set to be returned.
So it apparently needs a "dummy query" to be incorporated as a part of the actual query so it will get the result set in response. Writing "where 1=0" makes the query faster as it will not get any actual results in response.
Working example:
insert openquery(MYDB, 'select firstname from admin_user where 1=0') values ('3','Testi','Testaaja','me#google.com','koe12','koe22','','','','0','0','1','','','')
OpenQuery requires a result set to be returned, but UPDATE, DELETE,
and INSERT statements that are used with OpenQuery do not return a
result set.
http://support.microsoft.com/kb/270119/fi

How can I optimize this stored procedure?

I need some help optimizing this procedure:
DELIMITER $$
CREATE DEFINER=`ryan`#`%` PROCEDURE `GetCitiesInRadius`(
cityID numeric (15),
`range` numeric (15)
)
BEGIN
DECLARE lat1 decimal (5,2);
DECLARE long1 decimal (5,2);
DECLARE rangeFactor decimal (7,6);
SET rangeFactor = 0.014457;
SELECT `latitude`,`longitude` into lat1,long1
FROM world_cities as wc WHERE city_id = cityID;
SELECT
wc.city_id,
wc.accent_city as city,
s.state_name as state,
c.short_name as country,
GetDistance(lat1, long1, wc.`latitude`, wc.`longitude`) as dist
FROM world_cities as wc
left join states s on wc.state_id = s.state_id
left join countries c on wc.country_id = c.country_id
WHERE
wc.`latitude` BETWEEN lat1 -(`range` * rangeFactor) AND lat1 + (`range` * rangeFactor)
AND wc.`longitude` BETWEEN long1 - (`range` * rangeFactor) AND long1 + (`range` * rangeFactor)
AND GetDistance(lat1, long1, wc.`latitude`, wc.`longitude`) <= `range`
ORDER BY dist limit 6;
END
Here is my explain on the main portion of the query:
+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
| 1 | SIMPLE | B | range | idx_lat_long | idx_lat_long | 12 | NULL | 7619 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | s | eq_ref | PRIMARY | PRIMARY | 4 | civilipedia.B.state_id | 1 | |
| 1 | SIMPLE | c | eq_ref | PRIMARY | PRIMARY | 1 | civilipedia.B.country_id | 1 | Using where |
+----+-------------+-------+--------+---------------+--------------+---------+--------------------------+------+----------------------------------------------+
3 rows in set (0.00 sec)
Here are the indexes:
mysql> show indexes from world_cities;
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| world_cities | 0 | PRIMARY | 1 | city_id | A | 3173958 | NULL | NULL | | BTREE | |
| world_cities | 1 | country_id | 1 | country_id | A | 23510 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | city | 1 | city | A | 3173958 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | accent_city | 1 | accent_city | A | 3173958 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | idx_pop | 1 | population | A | 28854 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | idx_lat_long | 1 | latitude | A | 1057986 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | idx_lat_long | 2 | longitude | A | 3173958 | NULL | NULL | YES | BTREE | |
| world_cities | 1 | accent_city_2 | 1 | accent_city | NULL | 1586979 | NULL | NULL | YES | FULLTEXT | |
+--------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
8 rows in set (0.01 sec)
The function you see in the query I wouldn't think would cause the slow down, but here is the function:
CREATE DEFINER=`ryan`#`%` FUNCTION `GetDistance`(lat1 numeric (9,6),
lon1 numeric (9,6),
lat2 numeric (9,6),
lon2 numeric (9,6) ) RETURNS decimal(10,5)
BEGIN
DECLARE x decimal (20,10);
DECLARE pi decimal (21,20);
SET pi = 3.14159265358979323846;
SET x = sin( lat1 * pi/180 ) * sin( lat2 * pi/180 ) + cos(
lat1 *pi/180 ) * cos( lat2 * pi/180 ) * cos( (lon2 * pi/180) -
(lon1 *pi/180)
);
SET x = atan( ( sqrt( 1- power( x, 2 ) ) ) / x );
RETURN ( 1.852 * 60.0 * ((x/pi)*180) ) / 1.609344;
END

As far as I can tell there is not something directly wrong with your logic that would make this slow, so the problems ends up being that you can't use any indexes with this query.
MySQL needs to do a full table scan and apply the functions of your WHERE clause to each row to determine if it passed the conditions. Currently there's 1 index used: idx_lat_long.
It's a bit of a bad index, the long portion will never be used, because the lat portion is a float. But at the very least you managed to effectively filter out all rows that are outside the latitude range. But it's likely.. these are still a lot though.
You'd actually get slightly better results on the longitude, because humans only really live in the middle 30% of the earth. We're very much spread out horizontally, but not really vertically.
Regardless, the best way to further minimize the field is to try to filter out as many records in the general area. Right now it's a full vertical strip on the earth, try to make it a bounding box.
You could naively dice up the earth in say, 10x10 segments. This would in a best case make sure the query is limited to 10% of the earth ;).
But as soon as your bounding box exceeds to separate segments, only the first coordinate (lat or lng) can be used in the index and you end up with the same problem.
So when I thought of this problem I started thinking about this differently. Instead, I divided up the earth in 4 segments (lets say, north east, north west, south east, south west on map). So this gives me coordinates like:
0,0
0,1
1,0
1,1
Instead of putting the x and y value in 2 separate fields, I used it as a bit field and store both at once.
Then every 1 of the 4 boxes I divided up again, which gives us 2 sets of coordinates. The outer and inner coordinates. I'm still encoding this in the same field, which means we now use 4 bits for our 8x8 coordinate system.
How far can we go? If we assume a 64 bit integer field, it means that 32bit can be used for each of the 2 coordinates. This gives us a grid system of 4294967295 x 4294967295 all encoded into one database field.
The beauty of this field is that you can index it. This is sometimes called (I believe) a Quad-tree. If you need to select a big area in your database, you just calculate the 64bit top-left coordinate (in the 4294967295 x 4294967295 grid system) and the bottom-left, and it's guaranteed that anything that lies in that box, will also be within the two numbers.
How do you get to those numbers. Lets be lazy and assume that both our x and y coordinate have range from -180 to 180 degrees. (The y coordinate of course is half that, but we're lazy).
First we make it positive:
// assuming x and y are our long and lat.
var x+=180;
var y+=180;
So the max for those is 360 now, and (4294967295 / 360 is around 11930464).
So to convert to our new grid system, we just do:
var x*=11930464;
var y*=11930464;
Now we have to distinct numbers, and we need to turn them into 1 number. First bit 1 of x, then bit 1 of y, bit 2 of x, bit 2 of y, etc.
// The 'morton number'
morton = 0
// The current bit we're interleaving
bit = 1
// The position of the bit we're interleaving
position = 0
while(bit <= latitude or bit <= longitude) {
if (bit & latitude) morton = morton | 1 << (2*position+1)
if (bit & longitude) morton = morton | 1 << (2*position)
position += 1
bit = 1 << position
}
I'm calling the final variable 'morton', the guy who came up with it in 1966.
So this leaves us finally with the following:
For each row in your database, calculate the morton number and store it.
Whenever you do a query, first determine the maximum bounding box (as the morton number) and filter on that.
This will greatly reduce the number of records you need to check.
Here's a stored procedure I wrote that will do the calculation for you:
CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC
BEGIN
-- 11930464 is round(maximum value of a 32bit integer / 360 degrees)
DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;
SET #lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
SET #lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
SET bit = 1;
WHILE bit <= #lat || bit <= #lng DO
IF(bit & #lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
IF(bit & #lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;
SET pos = pos + 1;
SET bit = 1 << pos;
END WHILE;
RETURN morton;
END;
A few caveats:
The absolute worst case scenario will still scan 50% of your entire table. This chance is extremely low though, and I've seen absolutely significant performance increases for most real-world queries.
The bounding box in this case assumes a Eucllidean space, meaning.. a flat surface. In reality your bounding boxes are not exact squares, and they warp heavily when getting closer to the poles. By just making the boxes a bit larger (depending on how exact you want to be) you can get quite far. Most real-world data is also often not close to the poles ;). Remember that this filter is just a 'rough filter' to get the most of the likely unwanted rows out.
This is based on a so-called Z-Order curve. To get even better performance, if you're feeling adventurous.. you could try to go for the Hilbert Curve instead. This curve oddly rotates, which ensures that in a worst case scenario, you will only scan about 25% of the table.. Magic! In general this one will also filter much more unwanted rows.
Source for all this: I wrote 3 blogposts about this topic when I came to the same problems and tried to creatively get to a solution. I got much better performance with this compared to MySQL's GEO indexes.
http://www.rooftopsolutions.nl/blog/229
http://www.rooftopsolutions.nl/blog/230
http://www.rooftopsolutions.nl/blog/231

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Equations for 2 variable Linear Regression - equation

Yes, it's an easy linear algebra problem if you think of it the way Gil Strang does it. Here's a written explanation.

Can you use MatLab or does the calculation have to occur inside your software? MatLab instructions on multiple regression analysis. Integrating MatLab with C#.

Related

Grafana visualize cartesian data points (coordinates)

combining dataframes, and adding values of common elements

sum of minterm vs product of maxterm

Openquery insert not working

How can I optimize this stored procedure?

Categories

Resources