Seurat cross -species integration - integration

I am currently working with single cell data from human and zebrafish both from brain tissue!
My assignment is to integrate them! So the steps I have followed until now :
Find human orthologs for zebrafish genes in biomart
kept only the one2one
subset the zebrafish Seurat object based on the orthlogs and replace the names with the human gene names
Create an new Object for zebrafish and run Normalization anad FindVariableFeatures
Then use this object with my human object for integration
Human object: 20620 features across 2989 samples
Zebrafish object: 6721 features across 6036 samples
features <- SelectIntegrationFeatures(object.list = double.list)
anchors <- FindIntegrationAnchors(object.list = double.list,
anchor.features = features,
normalization.method="LogNormalize",
nn.method="rann")
This identifies 2085 anchors!
I used nn.method="rann" because if I use the default I have this error
Error: C stack usage 7973252 is too close to the limit
Then I am running the integration like this
ZF_HUMAN.combined <- IntegrateData(anchorset = anchors,
new.assay.name = "integrated")
and the error I am receiving is like this
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 9265 anchors
Filtering anchors
Retained 2085 anchors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=22s
To solve this I tried to play around with the arguments in FindIntegrationAnchors
e.g i used l2.norm=F! The only things that changed is the number of anchors which decreased
I am wondering if the usage of nn.method="rann" at FindIntegrationAnchors messing things up
ANY help will be appreciated because I am struggling for a long time with that, I don't know what else to do

Related

How to get cost/reward current estimation for all arms using Vowpal Wabbit with Python

I am starting to work with Vowpal Wabbit with Python and I am kinda struggling with its lack of documentation.
Do you guys know what modeling it uses as a cost/reward estimation for each arm? Do you know how to retrieve this current estimation?
vw = pyvw.vw("--cb_explore 2 --epsilon 0.2")
input = "2:-20:0.5 | Anna"
vw.learn(initial_input)
input = "1:-10:0.1 | Anna"
vw.learn(initial_input)
vw.predict(" | Anna")
Output would be:
[0.10000000149011612, 0.9000000357627869]
How can I also get the expected value for each arm? Something like
[-10.00, -20.00]
When using _explore you get back a PMF over the given actions. This is true for CB and CB_adf.
However, when using the non-explore version for each of those things differ a bit.
--cb is going to give you the chosen action directly, whereas --cb_adf is going to return the score for each given action.
So in this situation changing to using action dependent features (ADF) should provide the score/estimated cost.

How to download IMF Data Through JSON in R

I recently took an interest in retrieving data in R through JSON. Specifically, I want to be able to access data through the IMF. I know virtually nothing about JSON so I will share what I [think I] know so far, and what I have accomplished.
I browsed their web page for JSON, which helped a little bit. It gave me the start point URL. Here is the web page; http://datahelp.imf.org/knowledgebase/articles/667681-using-json-restful-web-service
I managed to download (using the GET() and the fromJSON() functions) some lists, which are really bulky. I know enough about the lists that the "call" was successful, but I cannot for the life of me get actual data. So far, I have been trying to use the rawToChar() function on the "content" data but I am virtually stuck there.
If anything, I managed to create data frames that contain the codes, which I presume would be used somewhere in the JSON link. Here is what I have.
all.imf.data = fromJSON("http://dataservices.imf.org/REST/SDMX_JSON.svc/Dataflow/")
str(all.imf.data)
#all.imf.data$Structure$Dataflows$Dataflow$Name[[2]] #for the catalogue of sources
catalogue1 = cbind(all.imf.data$Structure$Dataflows$Dataflow$KeyFamilyRef,
all.imf.data$Structure$Dataflows$Dataflow$Name[[2]])
catalogue1 = catalogue1[,-2] # catalogue of all the countries
data.structure = fromJSON("http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/IFS")
info1 = data.frame(data.structure$Structure$Concepts$ConceptScheme$Concept[,c(1,4)])
View(data.structure$Structure$CodeLists$CodeList$Description)
str(data.structure$Structure$CodeLists$CodeList$Code)
#Units
units = data.structure$Structure$CodeLists$CodeList$Code[[1]]
#Countries
countries = data.frame(data.structure$Structure$CodeLists$CodeList$Code[[3]])
countries = countries[,-length(countries)]
#Series Codes
codes = data.frame(data.structure$Structure$CodeLists$CodeList$Code[[4]])
codes = codes[,-length(codes)]
# all.imf.data # JSON from the starting point, provided on the website
# catalogue1 # data frame of all the data bases, International Financial Statistics, Government Financial Statistics, etc.
# codes # codes for the specific data sets (GDP, Current Account, etc).
# countries # data frame of all the countries and their ISO codes
# data.structure # large list, with starting URL and endpoint "IFS". Ideally, I want to find some data set somewhere within this data base.
"info1" # looks like parameters for retrieving the data (for instance, dates, units, etc).
# units # data frame that indicates the options for units
I would just like some advice about how to go about retrieving any data, something as simple as GDP (PPP) for a constant year. I have been following an article in R blogs (which retrieved data in the EU's database) but I cannot replicate the procedure for the IMF. I feel like I am close to retrieving something useful but I cannot quite get there. Given that I have data frames that contain the names for the databases, the series and the codes for the series, I think it is just a matter of figuring out how to construct the appropriate URL for getting the data, but I could be wrong.
Provided in the data frame codes are the codes for the data sets I presume. Is there a way to make a call for the data for, let's say, the US for BK_DB_BP6_USD, which is "Balance of Payments, Capital Account, Total, Debit, etc"? How should I go about doing this in the context of R?

How to deal with missing row when binding column to data frame (a scraping issue!)

I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).
library(XML)
lords_tables <- function (x, y) {
tables <- as.data.frame(readHTMLTable(x))
sitePage <- htmlParse(x) # This parses web code
hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]",
xmlGetAttr, 'href') ## First href child of the a nodes
table_bind <- cbind(tables, hrefs)
row_no <- grep(".+: House of Lords|Royal Assent",
table_bind$NULL.V2) #Gives row position of Lords|Royal Assent
lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ] # Subsets rows containing House of Lords|Royal Assent
write.csv(lords_rows, file = paste0(y, ".csv"))
}
# x = a list of links to the Bill pages/ y = list of simplified names
mapply(lords_tables, x=link_list, y=gsub_URL)
This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:
browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")
For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 21, 19
In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.
Thanks in advance!

Post hoc tukey test for two way mixed model ANOVA

I and some of my students have searched for a solution to this in numerous places with no luck and literally for months. I keep on being referred to the lme command which I do NOT want to use. The output provided is not the one my colleagues or myself have used for over 15 years. Moreover given I am using R as a teaching tool, it does not flow as well following t-tests, and one-way anovas for intro stats students. I am conducting a two way RM ANOVA with one factor repetition. I have succeeded in getting R to replicate what Sigmaplot gives for the main effects. However the post hoc analysis given by R differs significantly from the same post hoc in Sigmaplot. Here is the code I used - with notes (as I am using this also to teach students).
#IV between: IVB1 - Independent variable - between subject factor
#IV within: IVW1 - Independent variable - within subject factor
#DV: DV - Dependent variable.
aov1= aov(DV ~ IVB1*IVW1 + Error(Subject/IVW1)+(IVB1), data=objectL)
summary(aov1)
# post hoc analysis
ph1=TukeyHSD(aov(DV ~ IVB1*IVW1, data=objectL))
ph1
I hope somebody can help.
Thank you!
I have also had this problem and I find convenient alternative with the aov_ez() function from the afex package instead of aov(), and then performed post hoc analysis using lsmeans() instead of TukeyHSD():
model <- aov_ez(data,
id="SubjID",
dv="DV",
within=c("IVW1", "IVW2"),
between = "IVB1")
# Post hoc
comp = lsmeans(model,specs = ~ IVB1: IVW1: IVW2, adjust = "tukey")
contrast(comp,method="pairwise")
You will find a detailed tutorial here:
https://www.psychologie.uni-heidelberg.de/ae/meth/team/mertens/blog/anova_in_r_made_easy.nb.html

Graphs - find common data

I've just started to read upon graph-teory and data structures.
I'm building an example application which should be able to find the xpath for the most common links. Imagine a Google serp, my application should be able to find the xpath for all links pointing to a result.
Imagine that theese xpaths were found:
/html/body/h2/a
/html/body/p/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/div[#class=footer]/span[#id=copyright]/a
From these xpats, i've thought of a graph like this (i might be completely lost here):
html
|
body
h2 - p - div[#class=footer]
| | |
a (1) a - strong span[#id=copyright]
| |
a (3) a (1)
Is this the best approach to this problem?
What would be the best way (data structure) to store this in memory? The language does not mather. We can see that we have 3 links matching the path html -> body -> p -> strong -> a.
As I said, i'm totally new to this so please forgive me if I thought of this completely wrong.
EDIT: I may be looking for the trie data structure?
Don't worry about tries yet. Just construct a tree using standard graph representation (node = {value, count, parent} while immediately collapsing same branches and incrementing the counter. Then, sort all the leaves by count in descending order and traverse from each leaf upwards to get a path.