Logit regression including year and industry fixed effects in Python - regression

I am currently working with a dataframe containing financial information post IPO of traditional IPO firms as well as firms that went public through a SPAC merger, both in the years 2020 to 2022. I am trying to model the likelihood of a firm becoming public using the SPAC merger route, from a few key financial post IPO variables (independent variables). I want to employ a logistic regression model with the dependent variable P(SPAC)i, which is binary and equals 1 for SPAC firms and 0 for IPO firms. The main specification is:
P(SPAC)i = 1⁄(1+ e∧(α + β1Xi + β2Xi + β3Xi + ... + ∑βj Year fixed effectsi,j + ∑βl Industry fixed effects i,l + u i))
Where individual firms are indexed by i.
I don't know how to include year and industry fixed effect into my logit regression. Could anybody give me a hand on this?

Related

Testing Data Consistency and its effect on Multilevel Modeling Multivariate Inference

I have a MLM model looking at the effect of demographics of a few cities on a region wide outcome variable as follows:
RegionalProgress = β0j + β1j * Demographics + u0j + e0ij
The data used in this analysis consists of 7 different cities with different data sources. The point I am trying to make is these 7 data sets (that I have combined together) have inconsistant structure and substance, and differences do (or do not) alter or at least complicate multivariate relationships. A tip I got was to use β1j and its variation across cities. I'm having trouble understanding how this would relate to proving inconsistancies in data sets. I'm doing all of this in R and my model looks like this in case that's helpful:
model11 <- lmerTest::lmer(RegionalProgress ~ 1 + (1|CITY) + PopDen + HomeOwn + Income + Black + Asian + Hispanic + Age, data = data, REML = FALSE)
Can anyone help me understand the tip, or give me other tips how to find evidence of:
there are meaningful differences (or not) between the data sets across cities,
how these differences do affect multivariate relationships?

Regression model using two collinear time-related variables?

I am building a regression model to assess how a certain outcome, tracked from 2015-2018, changed in the year 2018 specifically relative to 2015-2017. However, the outcome underwent a natural year-by-year decline that I would also like to capture in the regression model. As a result, I am currently using a variable X as my independent variable (X=0 for 2015-2017 vs. X=1 for 2018), and a variable Y as a confounder (modeled continuously) to adjust for changes across the entire study period (X=0 for 2015, X=1 for 2016, X=2 for 2017, X=3 for 2018).
However, as you can see in the table below, there is a great deal of collinearity between these two variables. One solution would be removing confounder Y from the model but I believe it is important to capture year-by-year change from 2015-2017. Is there an alternative way I can set up this model or an alternative methodology I can use (ex. time series) to perform this analysis? Thank you very much!

How to do reinforcement learning with regression instead of classification

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?
Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.

Naive Bayes: Heterogeneous CPDs for observation variables

I am using a naives bayes model for binary classification using a combination of discrete and continous variables. My question is, can I use a different conditional probability distribution (CPD) functions for continuous and discrete observation variables ?
For example, I use gaussian CPD for continous and some deterministic CPD for the discrete variables ?
Thank you
Yes, it is normal to mix continuous and discrete variables within the same model. Consider the following example.
Suppose I have two random variables:
T - the temperature today
D - the day of the week
Note T is continuous and D is discrete. Suppose I want to predict whether John will go to the beach, represented by the binary variable B. Then I could set up my inference as follows, assuming T and D are conditionally independent given B.
p(T|B) • p(D|B) • p(B)
p(B|T,D) = ━━━━━━━━━━━━ ∝ p(T|B) • p(D|B) • p(B)
p(T) • p(D)
p(T|B) could be a Gaussian distribution, p(D|B) could be a discrete distribution, and p(B) could be a discrete prior on how often John goes to the beach.

How Do I Avoid using Running Totals in My Code?

I am learning programing and software design and Java in school right now. The class that is getting me mixed up is Software Design. We are using Word to run simple VB code to do simple programs. My instructor says I am losing cohesion by using running totals. I am having a hard time thinking of a way to avoid them. Here is an example of some pseudocode I am talking about (the modules are called form a driver module that is not shown):
CaluculateDiscountPrice module
DiscountPrice = (FloorPrice * (1 – DiscountRate))
End module
CalculateDeliveryPrice module
If DeliveryFee = “Yes” Then
DeliveryPrice = DiscountPrice + 20
ElseIf DeliveryFee = “No” Then
DeliveryPrice = DiscountPrice
End If
End module
CalculateTradeInCredit module
If TradeInCredit = “Yes” Then
CreditedPrice = DeliveryPrice – 5
ElseIf TradeInCredit = “No” Then
CreditedPrice = DeliveryPrice
End If
End module
CaluculateCostOfBed module
CostOfBed = CreditedPrice
End module
Basically DiscountPrice is used to join the first two modules and then DeliveryPrice the second two. Supposedly, the last module may not even need to be there is I fixed this problem. Any help to the beginner?
When I look at your example, what jumps out at me is a problem with coupling between modules. (If you haven't already studied that concept, you probably soon will.) However, too much coupling and too little cohesion often go together, so hopefully I can still give you a helpful answer. (Oversimplified but adequate-for-here definitions: Cohesive modules do one focused thing instead of several unrelated things, and Coupled modules depend on one another to do whatever it is they do. We usually want modules to have strong cohesion internally but weak coupling to other modules.)
I infer from your pseudocode that you want to calculate the price of a bed like so:
* start with the floor price
* discount it
* add in a delivery fee
* subtract a trade-in credit
* the result is the cost of the bed
When you express it like that, you might notice that those operations are (or can be) pretty independent of each other. For example, the delivery fee doesn't really depend on the discounted price, just on whether or not a delivery fee is to be charged.
Now, the way you've structured your design, your 'DeliveryPrice' variable is really an "as delivered" price that does depend on the discounted price. This is the kind of thing we want to get rid of. We can say that your modules are too tightly coupled because they depend on each other in ways that are not really required to solve the problem. We can say that they lack cohesion because they are really doing more than one thing - i.e. the delivery price module is adding the delivery fee to the discounted price instead of just calculating the delivery fee.
It's hard to see with toy examples, but this matters as designs get more complex. With just a few lines of pseudocode, it seems perfectly natural to have a "running total" threaded between them. But what if the delivery fee depends on a complex calculation involving the distance to the customer's house, the weight of the purchase, and the day of the week? Now, having it also involve whatever the discounted price would get really confusing.
So, with all that in mind, consider this alternate design:
CalculateDeliveryFee module
If DeliveryFeeCharged = “Yes” Then
DeliveryFee = 20
End If
End module
CalculateTradeInCredit module
If TradeInCreditApplied = “Yes” Then
TradeInCredit = 5
End If
End module
CaluculateCostOfBed module
DiscountPrice = (FloorPrice * (1 – DiscountRate))
AsDeliveredPrice = DiscountPrice + DeliveryFee
WithTradeInPrice = AsDeliveredPrice - TradeInCredit
CostOfBed = WithTradeInPrice
End module
Now, coupling is reduced - the delivery and trade-in modules don't know anything at all about bed prices. This also improves their cohesion, since they are doing something more focused - calculating fees, not summing prices and fees. The actual price calculation does depend on the other modules, but that's inherent in the problem. And the calculation is cohesive - the "one thing" it's doing is calculating the price of the bed!