Is there a way to combine these variables in a way that makes sense? - gis

Hello stack overflow community!
I am a sociology student working on a thesis project comparing home value appreciation and neighborhood racial composition over time.
I'm currently using two separate data sources and trying to combine them in a way that makes sense without aggregating anything.
The first data source is GIS data which has information on home sales in each year by home. The second is census data which has yearly estimates of racial composition by census tract. Both are in .csv formats.
My goal is to create a set of variables for each home row in the GIS data which represents the racial composition for the tract the home is in at the year it was sold (e.g. home 1 | 2010| $500,000 | Census tract 10 | 10% white).
I began doing this by going into Stata and using the following strategy:
For example, if I'm looking at a home sold in 2010 in Census tract 10 and I find that this tract was 10% white in 2010, using something like
If censustract=10 and year=2010, replace percentwhite = 10
However, this seemed incredibly time consuming, as I'm using data that go back decades and a couple dozen Census tracts.
Does anyone have any suggestions on how I might do this smarter, not harder? The first thought I had was to aggregate the data by census tract and year, but was hoping to avoid that if possible. Thank you so much in advance for your help and have a terrific day and start to the new year!

It sounds like you can simply merge census data onto your GIS data. That will be much less painful than using -replace-. Here's an example:
*GIS data: information on home sales in each year by home
clear
input censustract house_id year house_value_k
10 100 2010 200
11 101 2020 500
11 102 1980 100
end
tempfile GIS_data
sa `GIS_data'
*census data: yearly estimates of racial composition by census tract
clear
input censustract year percentwhite
10 2010 20
10 2000 10
11 2010 25
11 2000 5
end
tempfile census_data
sa `census_data'
*easy method: merge the census data onto your GIS data
use `GIS_data', clear
mer m:1 censustract year using `census_data'
drop if _merge==2
list
*hard method: use -replace-
use `GIS_data', clear
gen percentwhite=.
replace percentwhite=20 if censustract==10 & year==2010
replace percentwhite=10 if censustract==10 & year==2000
replace percentwhite=25 if censustract==11 & year==2010
replace percentwhite=5 if censustract==11 & year==2000
list
Both methods "work", but using -merge- is much easier and less prone to errors.
Note: I intentionally created the data sets so that the merge wouldn't be perfect. You will likely want to drop some of the observations in that case. In the code above I dropped when _merge==2

Related

Visualizing two sets of data on one map in D3: Is there a way to map two datasets to a range -1 to 1?

I have two csv files containing countries and values that correspond to each country.
The data from CSV 1 denotes the number of times a country has been attacked on their own soil.
The data from CSV 2 denotes the number of times a country has attacked another country abroad.
There is overlap between the two sets of data and I intend to demonstrate values from both data sets in one grey scale range to be shown on a choropleth map.
I have some (obviously) phony data below to demonstrate what I'm working with.
TARGET.csv
country, code, value
Iran, IRN, 5
Russia, RUS, 4
United States, USA, 0
Egypt, EGY, 2
Spain, ESP, 1
ATTACKER.csv
country, code, value
Iran, IRN, 3
Russia, RUS, 9
United States, USA, 4
Egypt, EGY, 0
Spain, ESP, 0
There are more targets than attackers.
I want to ensure that I represent the data accurately, but do not know how I would create a normalized range of values between -1 and 1.
It is my understanding that displaying the data in this way would accurately represent the reality best, but I feel like I may be wrong.
In summation:
1) Am I thinking about this problem properly? Is this even the right way to think about displaying the data?
2) What is the proper language used to describe my question?
I am usually able to figure these things out but I'm stumped with dead-end search queries.
3) How do I make sure that my range is normalized. Notice that USA above appears as the only attacker who has never been a target, Would that make the USA the nearest value to +1, despite Russia's larger number of attacks?
I would appreciate whatever input you all can offer.

Small sample size for a regression marketing model

I have sales, advertising spend and price data for 10 brands of same industry from 2013-2018. I want to develop an equation to predict 2019 sales.
The variables I have are (price & ad spend by type) :PricePerUnit Magazine, News, Outdoor, Broadcasting, Print.
The confusion I have is I am not sure whether to run regression using only 2018 data with 2018 sales as Target variable and adding additional variable like Past_2Yeas_Sales(2016-17) to above price & ad spend variables (For clarity-Refer the image of data). With this type of data I will have a sample size of only 10 as there are only 10 brands. This I think is too low for linear regression to give correct results.
Second option (which will increase sample size) I figure is could be instead of having a brand as an observation, I take brand+year as an observation which will increase my sample size to 60- for e.g. Brand A has 6 observations like A-2013, A-2014, A-2014...,A-2018, B has B-2013,B-2014..B-2018 and so on for 10 brands(Refer image for data).
Is the second option valid way to run regression? What is the right way to run regression in such situations of small sample size?

Linear Regression of Subset depending on spesific date

I'm currently doing a regression analysis for every company on the Norwegian Stock Market, where i regress the stockreturns for each company against a benchmark. The period is 2009-2018. I've managed to do the regression for each company throughout the whole period, but i also want to do a regression for each month for every company.
The original dataset consists of 26000 observations, which i've then converted into subsets with a total of 390 elements(companies).
What i've done so far is shown below:
data_subset <- by(data,data$Name, subset)
data_lm <-lapply(data_subset,function(data) lm(data$CompanyReturn~data$DJReturn))
data_coef <- lapply(data_lm, coef)
data_tabell <- matrix(0,length(data_subset),2)
for (i in 1:length(data_subset)) {
data_tabell[i,]<-coef(data_lm[[i]])
}
colnames(data_tabell)<-c("Intercept","Coefficient")
rownames(data_tabell)<-names(data_subset)
Do anyone know how i can specify that i want to only do a regression for a company for a specific period, for example each year or each month for every company?
Thank you in advance for the help!

pyephem, libnova, stellarium, JPL Horizons disagree on moon RA/DEC?

MINOR EDIT: I say below that JPL's Horizons library is not open source. Actually, it is, and it's available here: http://naif.jpl.nasa.gov/naif/tutorials.html
At 2013-01-01 00:00:00 UTC at 0 degrees north latitude, 0 degrees east
latitude, sea level elevation, what is the J2000 epoch right ascension
and declination of the moon?
Sadly, different libraries give slightly different answers. Converted
to degrees, the summarized results (RA first):
Stellarium: 141.9408333000, 9.8899166666 [precision: .0004166640, .0000277777]
Pyephem: 142.1278749990, 9.8274722221 [precision .0000416655, .0000277777]
Libnova: 141.320712606865, 9.76909442356909 [precision unknown]
Horizons: 141.9455833320, 9.8878888888 [precision: .0000416655, .0000277777]
My question: why? Notes:
I realize these differences are small, but:
I use pyephem and libnova to calculate sun/moon rise/set, and
these times can be very sensitive to position at higher latitudes
(eg, midnight sun).
I can understand JPL's Horizons library not being open source,
but the other three are. Shouldn't someone work out the
differences in these libraries and merge them? This is my main
complaint. Do the stellarium/pyephem/libnova library authors have
a fundamental difference in how to make these calculations, or do
they just need to merge their code?
I also realize there might be other reasons the calculations are
different, and would appreciate any help in rectifying these
possible errors:
Pyephem and Libnova may be using the epoch of the date instead of J2000
The moon is close enough that observer location can affect its
RA/DEC (parallax effect).
I'm using Perl's Astro::Nova and Python's pyephem, not the
original C implementations of these libraries. However, if these
differences are caused by using Perl/Python, that is important in
my opinion.
My code (w/ raw results):
First, Perl and Astro::Nova:
#!/bin/perl
# RA/DEC of moon at 0N 0E at 0000 UTC 01 Jan 2013
use Astro::Nova;
# 1356998400 == 01 Jan 2013 0000 UTC
$jd = Astro::Nova::get_julian_from_timet(1356998400);
$coords = Astro::Nova::get_lunar_equ_coords($jd);
print join(",",($coords->get_ra(), $coords->get_dec())),"\n";
RESULT: 141.320712606865,9.76909442356909
- Second, Python and pyephem:
#!/usr/local/bin/python
# RA/DEC of moon at 0N 0E at 0000 UTC 01 Jan 2013
import ephem; e = ephem.Observer(); e.date = '2013/01/01 00:00:00';
moon = ephem.Moon(); moon.compute(e); print moon.ra, moon.dec
RESULT: 9:28:30.69 9:49:38.9
- The stellarium result (snapshot):
- The JPL Horizons result (snapshot):
[JPL Horizons requires POST data (not really, but pretend), so I
couldn't post a URL].
I haven't linked them (lazy), but I believe there are many
unanswered questions on stackoverflow that effectively reduce to
this question (inconsistency of precision astronomical libraries),
including some of my own questions.
I'm playing w this stuff at: https://github.com/barrycarter/bcapps/tree/master/ASTRO
I have no idea what Stellarium is doing, but I think I know about the other three. You are correct that only Horizons is using J2000 instead of the epoch-of-date for this apparent, locale-specific observation. You can bring it into close agreement with PyEphem by clicking "change" next to the "Table Settings" and switching from "1. Astrometric RA & DEC" to "2. Apparent RA & DEC."
The difference with Libnova is a bit trickier, but my late-night guess is that Libnova uses UT instead of Ephemeris Time, and so to make PyEphem give the same answer you have to convert from one time to the other:
import ephem
moon, e = ephem.Moon(), ephem.Observer()
e.date = '2013/01/01 00:00:00'
e.date -= ephem.delta_t() * ephem.second
moon.compute(e)
print moon.a_ra / ephem.degree, moon.a_dec / ephem.degree
This outputs:
141.320681918 9.77023197401
Which is, at least, much closer than before. Note that you might also want to do this in your PyEphem code if you want it to ignore refraction like you have asked Horizons to; though for this particular observation I am not seeing it make any difference:
e.pressure = 0
Any residual difference is probably (but not definitely; there could be other sources of error that are not occurring to me right now) due to the different programs using different formulae to predict where the planets will be. PyEphem uses the old but popular VSOP87. Horizons uses the much more recent — and exact — DE405 and DE406, as stated in its output. I do not know what models of the solar system the other products use.

Stock management of assemblies and its sub parts (relationships)

I have to track the stock of individual parts and kits (assemblies) and can't find a satisfactory way of doing this.
Sample bogus and hyper simplified database:
Table prod:
prodID 1
prodName Flux capacitor
prodCost 900
prodPrice 1350 (900*1.5)
prodStock 3
-
prodID 2
prodName Mr Fusion
prodCost 300
prodPrice 600 (300*2)
prodStock 2
-
prodID 3
prodName Time travel kit
prodCost 1200 (900+300)
prodPrice 1560 (1200*1.3)
prodStock 2
Table rels
relID 1
relSrc 1 (Flux capacitor)
relType 4 (is a subpart of)
relDst 3 (Time travel kit)
-
relID 2
relSrc 2 (Mr Fusion)
relType 4 (is a subpart of)
relDst 3 (Time travel kit)
prodPrice: it's calculated based on the cost but not in a linear way. In this example for costs of 500 or less, the markup is a 200%. For costs of 500-1000 the markup is 150%. For costs of 1000+ the markup is 130%
That's why the time travel kit is much cheaper than the individual parts
prodStock: here is my problem. I can sell kits or the individual parts, So the stock of the kits is virtual.
The problem when I buy:
Some providers sell me the Time Travel kit as a whole (with one barcode) and some sells me the individual parts (with a different barcode)
So when I load the stock I don't know how to impute it.
The problem when I sell:
If I only sell kits, calculate the stock would be easy: "I have 3 Flux capacitors and 2 Mr Fusions, so I have 2 Time travel kits and a Flux Capacitor"
But I can sell Kits or individual parts. So, I have to track the stock of the individual parts and the possible kits at the same time (and I have to compensate for the sell price)
Probably this is really simple, but I can't see a simple solution.
Resuming: I have to find a way of tracking the stock and the database/program is the one who has to do it (I cant ask the clerk to correct the stock)
I'm using php+MySql. But this is more a logical problem than a programing one
Update: Sadly Eagle's solution wont work.
the relationships can and are recursive (one kit uses another kit)
There are kit that does use more than one of the same part (2 flux capacitors + 1 Mr Fusion)
I really need to store a value for the stock of the kit. The same database is used for the web page where users want to buy the parts. And I should show the avaliable stock (otherwise they wont even try to buy). And can't afford to calculate the stock on every user search on the web page
But I liked the idea of a boolean marking the stock as virtual
Okay, well first of all since the prodStock for the Time travel kit is virtual, you cannot store it in the database, it will essentially be a calculated field. It would probably help if you had a boolean on the table which says if the prodStock is calculated or not. I'll pretend as though you had this field in the table and I'll call it isKit for now (where TRUE implies it's a kit and the prodStock should be calculated).
Now to calculate the amount of each item that is in stock:
select p.prodID, p.prodName, p.prodCost, p.prodPrice, p.prodStock from prod p where not isKit
union all
select p.prodID, p.prodName, p.prodCost, p.prodPrice, min(c.prodStock) as prodStock
from
prod p
inner join rels r on (p.prodID = r.relDst and r.relType = 4)
inner join prod c on (r.relSrc = c.prodID and not c.isKit)
where p.isKit
group by p.prodID, p.prodName, p.prodCost, p.prodPrice
I used the alias c for the second prod to stand for 'component'. I explicitly wrote not c.isKit since this won't work recursively. union all is used rather than union for effeciency reasons, since they will both return the same results.
Caveats:
This won't work recursively (e.g. if
a kit requires components from
another kit).
This only works on kits
that require only one of a particular
item (e.g. if a time travel kit were
to require 2 flux capacitors and 1
Mr. Fusion, this wouldn't work).
I didn't test this so there may be minor syntax errors.
This only calculates the prodStock field; to do the other fields you would need similar logic.
If your query is much more complicated than what I assumed, I apologize, but I hope that this can help you find a solution that will work.
As for how to handle the data when you buy a kit, this assumes you would store the prodStock in only the component parts. So for example if you purchase a time machine from a supplier, instead of increasing the prodStock on the time machine product, you would increase it on the flux capacitor and the Mr. fusion.