Rpy2 - Select Results and Output to CSV File - output

I'm currently doing Cox Proportional Hazards Modeling using Rpy2 - I imagine my question will cover other functions and the results from calling them as well though.
After I run the function, I have a variable which contains the results from the function, in the form of a vector. I have tried explicitly converting this to a DataFrame (resultsDataFrame = DataFrame(resultVector)). There are no errors returned when doing this. However, when I do resultsDataFrame.to_csvfile(filename) I get the following error:
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
modelFrame.to_csvfile('/Users/fortylashes/Documents/Matthews_Research/Cox_PH/ResultOutput_Exp1.csv')
File "/Library/Python/2.7/site-packages/rpy2/robjects/vectors.py", line 1031, in to_csvfile
'col.names': col_names, 'qmethod': qmethod, 'append': append})
RRuntimeError: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""coxph"" to a data.frame
Furthermore, when I simply do:
for result in resultVector:
print (result)
I get an extremely long list of results- including information on each entry in the dataset used in the model, for each variable (so 9,000 records x 9 variables = 81,000 unneeded results). The results I really need are at the bottom of this vector and look like this:
coef exp(coef) se(coef) z p
age_age6574 -0.057775 0.944 0.05469 -1.056 2.9e-01
age_age75plus -0.020795 0.979 0.04891 -0.425 6.7e-01
sex_female -0.005304 0.995 0.03961 -0.134 8.9e-01
stage_late -0.261609 0.770 0.04527 -5.779 7.5e-09
access -0.000494 1.000 0.00069 -0.715 4.7e-01
Likelihood ratio test=36.6 on 5 df, p=7.31e-07 n= 9752, number of events= 2601
*NOTE: There were several more variables for which data was reported in the initial results (the 9,000 x 9 that I was talking about) but weren't actually used in the model.
I was wondering if there was a way to explicitly get this data, put it in one long ordered row, and then output it to a csv file?
::::UPDATE::::
When I call theModel.names I get a list of the various measures which can be called by numerical index:
[1] "coefficients" "var" "loglik"
[4] "score" "iter" "linear.predictors"
[7] "residuals" "means" "concordance"
[10] "method" "n" "nevent"
[13] "terms" "assign" "wald.test"
[16] "y" "formula" "call"
From this I can get the coefficients, which can then be exponentiated. I have not found, however, the p-value, the z score or the likelihood test ratio, which I will need.

Related

Why the parsed dicts are equal while the pickled dicts are not?

I'm working on an aggregated config file parsing tool, hoping it can support .json, .yaml and .toml files. So, I have done the next tests:
The example.json config file is as:
{
"DEFAULT":
{
"ServerAliveInterval": 45,
"Compression": true,
"CompressionLevel": 9,
"ForwardX11": true
},
"bitbucket.org":
{
"User": "hg"
},
"topsecret.server.com":
{
"Port": 50022,
"ForwardX11": false
},
"special":
{
"path":"C:\\Users",
"escaped1":"\n\t",
"escaped2":"\\n\\t"
}
}
The example.yaml config file is as:
DEFAULT:
ServerAliveInterval: 45
Compression: yes
CompressionLevel: 9
ForwardX11: yes
bitbucket.org:
User: hg
topsecret.server.com:
Port: 50022
ForwardX11: no
special:
path: C:\Users
escaped1: "\n\t"
escaped2: \n\t
and the example.toml config file is as:
[DEFAULT]
ServerAliveInterval = 45
Compression = true
CompressionLevel = 9
ForwardX11 = true
['bitbucket.org']
User = 'hg'
['topsecret.server.com']
Port = 50022
ForwardX11 = false
[special]
path = 'C:\Users'
escaped1 = "\n\t"
escaped2 = '\n\t'
Then, the test code with output is as:
import pickle,json,yaml
# TOML, see https://github.com/hukkin/tomli
try:
import tomllib
except ModuleNotFoundError:
import tomli as tomllib
path = "example.json"
with open(path) as file:
config1 = json.load(file)
assert isinstance(config1,dict)
pickled1 = pickle.dumps(config1)
path = "example.yaml"
with open(path, 'r', encoding='utf-8') as file:
config2 = yaml.safe_load(file)
assert isinstance(config2,dict)
pickled2 = pickle.dumps(config2)
path = "example.toml"
with open(path, 'rb') as file:
config3 = tomllib.load(file)
assert isinstance(config3,dict)
pickled3 = pickle.dumps(config3)
print(config1==config2) # True
print(config2==config3) # True
print(pickled1==pickled2) # False
print(pickled2==pickled3) # True
So, my question is, since the parsed obj are all dicts, and these dicts are equal to each other, why their pickled codes are not the same, i.e., why is the pickled code of the dict parsed from json different to other two?
Thanks in advance.
The difference is due to:
The json module performing memoizing for object attributes with the same value (it's not interning them, but the scanner object contains a memo dict that it uses to dedupe identical attribute strings within a single parsing run), while yaml does not (it just makes a new str each time it sees the same data), and
pickle faithfully reproducing the exact structure of the data it's told to dump, replacing subsequent references to the same object with a back-reference to the first time it was seen (among other reasons, this makes it possible to dump recursive data structures, e.g. lst = [], lst.append(lst), without infinite recursion, and reproduce them faithfully when unpickled)
Issue #1 isn't visible in equality testing (strs compare equal with the same data, not just the same exact object in memory). But when pickle sees "ForwardX11" the first time, it inserts the pickled form of the object and emits a pickle opcode that assigns a number to that object. If that exact object is seen again (same memory address, not merely same value), instead of reserializing it, it just emits a simpler opcode that just says "Go find the object associated with the number from last time and put it here as well". If it's a different object though, even one with the same value, it's new, and gets serialized separately (and assigned another number in case the new object is seen again).
Simplifying your code to demonstrate the issue, you can inspect the generated pickle output to see how this is happening:
s = r'''{
"DEFAULT":
{
"ForwardX11": true
},
"FOO":
{
"ForwardX11": false
}
}'''
s2 = r'''DEFAULT:
ForwardX11: yes
FOO:
ForwardX11: no
'''
import io, json, yaml, pickle, pickletools
d1 = json.load(io.StringIO(s))
d2 = yaml.safe_load(io.StringIO(s2))
pickletools.dis(pickle.dumps(d1))
pickletools.dis(pickle.dumps(d2))
Try it online!
The output from that code for the json parsed input is (with # comments inline to point out important things), at least on Python 3.7 (the default pickle protocol and exact pickling format can change from release to release), is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes 'ForwardX11'
38: q BINPUT 3 # Assigns the serialized form the ID of 3
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: h BINGET 3 # Looks up whatever object was assigned the ID of 3
57: \x89 NEWFALSE
58: s SETITEM
59: u SETITEMS (MARK at 5)
60: . STOP
highest protocol among opcodes = 2
while the output from the yaml loaded data is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes as before
38: q BINPUT 3 # and assigns code 3 as before
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: X BINUNICODE 'ForwardX11' # Doesn't see this 'ForwardX11' as being the exact same object, so reserializes
70: q BINPUT 6 # and marks again, in case this copy is seen again
72: \x89 NEWFALSE
73: s SETITEM
74: u SETITEMS (MARK at 5)
75: . STOP
highest protocol among opcodes = 2
printing the id of each such string would get you similar information, e.g., replacing the pickletools lines with:
for k in d1['DEFAULT']:
print(id(k))
for k in d1['FOO']:
print(id(k))
for k in d2['DEFAULT']:
print(id(k))
for k in d2['FOO']:
print(id(k))
will show a consistent id for both 'ForwardX11's in d1, but differing ones for d2; a sample run produced (with inline comments added):
140067902240944 # First from d1
140067902240944 # Second from d1 is *same* object
140067900619760 # First from d2
140067900617712 # Second from d2 is unrelated object (same value, but stored separately)
While I didn't bother checking if toml behaved the same way, given that it pickles the same as the yaml, it's clearly not attempting to dedupe strings; json is uniquely weird there. It's not a terrible idea that it does so mind you; the keys of a JSON dict are logically equivalent to attributes on an object, and for huge inputs (say, 10M objects in an array with the same handful of keys), it might save a meaningful amount of memory on the final parsed output by deduping (e.g. on CPython 3.11 x86-64 builds, replacing 10M copies of "ForwardX11" with a single copy would reduce 590 MB for string data to just 59 bytes).
As a side-note: This "dicts are equal, pickles are not" issue could also occur:
When the two dicts were constructed with the same keys and values, but the order in which the keys were inserted differed (modern Python uses insertion-ordered dicts; comparisons between them ignore ordering, but pickle would be serializing them in whatever order they iterate in naturally).
When there are objects which compare equal but have different types (e.g. set vs. frozenset, int vs. float); pickle would treat them separately, but equality tests would not see a difference.
Neither of these is the issue here (both json and yaml appear to be constructing in the same order seen in the input, and they're parsing the ints as ints), but it's entirely possible for your test of equality to return True, while the pickled forms are unequal, even when all the objects involved are unique.

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

How to use cbind to add a vector to a dataframe that has different number of rows

I'm trying to add a column within my dataframe with the values stored in a vector, but I'm getting the error message Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 82, 63
This is the code I used:
Patient5GeneExpressionProfile <- cbind(Patient5GeneExpressionProfile, geneclusters)

How to find intersection or subset of two CSV files

I have 2 CSV files containing two columns and a large number of rows. The first column is the id, and the second is the set of paired values.
e.g.:
CSV1:
1 {[1,2],[1,4],[5,6],[3,1]}
2 {[2,4] ,[6,3], [8,3]}
3 {[3,2], [5,2], [3,5]}
CSV2:
1 {[2,4] ,[6,3], [8,3]}
2 {[3,4] ,[3,3], [2,3]}
3 {[1,4],[5,6],[3,1],[5,5]}
Now I need to get a CSV file which contains either exact matching items or subset which belongs to both CSVs.
Here the result should be:
{[2,4] ,[6,3], [8,3]}
{[1,4],[5,6],[3,1]}
Can anyone suggest python code to do this?
As suggested by this answer you can use set.intersection to get the intersection of two sets, however this does not work with lists as items. Instead you can also use filter (comparable to this answer):
>>> l1 = [[1,2],[1,4],[5,6],[3,1]]
>>> l2 = [[1,4],[5,6],[3,1],[5,5]]
>>> filter(lambda q: q in l2, l1)
[[1, 4], [5, 6], [3, 1]]
In Python 3 you should convert it to list since there filter returns an iterable:
>>> list(filter(lambda x: x in l2,l1))
You can load CSV files (if they are really comma [or some other character] separated files) with csv.reader or pandas.read_csv for example.

permuting a path while keeping end vertices constant?

I have animal movement paths from GPS collars (the animal's location was recorded every 2h). To study how the actual path compares to random paths I need to generate alternate paths by randomly distributing the original route segments between the actual beginning and end locations (first and last vertices). I thought a good way to go would be to use the permute.vertices function in igraph. However, I cannot figure out how to keep the first and last vertices constant.
Here is a sample data set:
I'm starting out with a matrix of from-coordinates and to-coordinates that define the steps:
library(igraph)
path <- matrix (c(-111.52, -111.49, -111.48, -111.47, -111.46,
35.34, 35.35, 35.33, 35.32, 35.31,
-111.49, -111.48, -111.47, -111.46, -111.5,
35.35, 35.33, 35.32, 35.31, 35.4),
nrow=5, ncol=4)
path<-as.data.frame(path)
names(path)<-c("From.x","From.y","To.x","To.y")
From <- 0:(nrow(path)-1)
To <- 1:nrow(path)
path <- cbind(From, To, path)
Turning the data.frame into a graph:
path <- graph.data.frame(path,directed=FALSE)
V(path)
Randomly permuting the vertices:
path2 <- permute.vertices(path, permutation=sample(vcount(path)))
V(path2)
How could I write the code to keep the first and last vertices always "0" and "5"? (or depending on the path, of course, a different number than "5")
I also then need to extract the coordinates from the permuted path and get them into a matrix. I tried it with the tkplot.getcoords command, but am not sure how to transform them back (I suppose tkplot transforms them somehow).
tkplot(path2)
kplot.getcoords(1, norm = TRUE)
I'm using RStudio on Windows 8.
Then just permute the rest of the vertices, and keep 0 and 5:
perm <- c(1, sample(2:(vcount(path)-1)), 5)
perm
# [1] 1 4 5 3 2 5
path2 <- permute.vertices(path, permutation=perm)
V(path2)
# Vertex sequence:
# [1] "0" "4" "3" "1" "5" "0"
For your other question, please explain better what you want, because I am not sure what kind of matrix you want to create.