Sweave: How to get blank lines as in the source? - sweave

How can I get a blank line before the second comment in the PDF file of the following .Rnw file? I tried to work with keep.source and strip.white, but I still don't get the blank line -- all the chunks are "pasted" together.
\documentclass{article}
\usepackage{fancyvrb}
\usepackage{Sweave}
\begin{document}
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
\end{document}

I wouldn't know about any option for keeping the blank line. What you could do:
(1) If you need the named chunk kept together, you could simply put a # at the beginning of the supposedly blank line:
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
#
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
Less attractive, but you have at least something similar to a blank line, and keep your chunk.
(2) If you don't care for keeping the chunk together, you can separate it into two blocks:
\documentclass{article}
\usepackage{fancyvrb}
\usepackage{Sweave}
\begin{document}
<<setup, eval=FALSE>>=
## some comment
a <- 1
b <- 2
#
<<eval=FALSE>>=
## some comment (there is no newline before this comment...)
c <- 3
d <- 4
#
\end{document}
This gives you the appearance you like, at the price of losing the structure of your chunk.
Hope this helps,
Rainer

Related

reverse tail for gnuplot

I would like to display data in a repeated online diagram with:
plot "tail -140 logging.dat | tac -r" with lines
I get an error message file cannot be opened, however in CLI it gives the reverse list of data as expected. Could anyone help me with the correct syntax, please?
Just for the records, here is a gnuplot-only, hence platform-independent solution. Check via stats the total number of lines. If there are less than N lines (here: 140), all lines will be plotted, otherwise only the 140 last ones.
Remark: if you do plot ... with lines, gnuplot will plot column 1 as x and column 2 as y per default. However, the output graph will look the same whether you reverse the data or not. So, I don't see why reversing the data would be necessary, unless you want to plot something what you haven't shown here or e.g. list a reversed table as text.
Script:
### plot N last lines of a file
reset session
FILE = "SO55221113.dat"
# create some random test data
set table FILE
set samples 1000
y0 = rand(0)
plot '+' u 1:(y0=y0+rand(0)-0.5)
unset table
N = 140
stats FILE u 0 nooutput # get total number of lines into STATS_records
M = STATS_records<=N ? STATS_records : STATS_records-N
plot FILE u 1:2 w l lc rgb "green" ti "all values", \
'' u 1:2 every ::M w l lc rgb "red" ti sprintf("last %d values",N)
### end of script
Result:

How to do a low RAM full cross join?

I have a hope to perform a full self-cross join on a large data file of points. However, I cannot use programming language to perform the operation because I cannot store it in memory. I would like to find all combinations of points within the set. Below would be an example of my dataset.
x y
1 9
2 8
3 7
4 6
5 5
I would like to cross join on this data to generate 25-row table containing all the combination of points. Would there be a low memory solution? perhaps with awk ?
Thank you,
Nicholas Hayden
P.S. I am a novice programmer.
perhaps in two steps, create a header, column1 and column2 files and join the column1 and column2 and append to header file
awk 'NR==1{print > "cross"} NR>1 {print $1 > "col1"; print $2 > "col2"}' file
join -j9 col1 col2 -o1.1,2.1 >> cross
rm col1, col2
obviously make sure the temp and final file names won't clash with the existing ones.
Note, the join command on MacOS doesn't have the -j option, so change it to equivalent long form
join -19 -29 col1 col2 -o1.1,2.1 >> cross
in both alternatives we're asking join to use the non-existent 9th field as the key which matches every line of the first file to every line in the second to generate the cross product of the two files.
If the memory usage wasn't an issue I'd probably do this:
$ awk 'NR==1 { print; next } # print the header
{ x[NR]=$1; y[NR]=$2 } # read data ro two hashes x and y
END { for(i=2;i<=NR;i++)
for(j=2;j<=NR;j++)
print x[i],y[j] # print all combinations of x and y
}' file
Keeping the memory usage low obviously requires keeping data out of memory and that means accessing the file a lot. So while processing FILENAME for x, open the same file with another name (file below) and process that record by record for y:
$ awk 'NR==1 { print; next } # print header
{ file=FILENAME; x=$1; nr=1 # duplicate FILENAME, keep $1, create local nr
while((getline <file) > 0) # process file record by record
if(nr++>1) {print x,$2 } # print $1 of FILENAME and $2 of file
close(file) }' file # close the file
x y
1 9
1 8
1 7
1 6
1 5
2 9
...
I'd probably never use that code as it is for anything useful but maybe you can mix those 2 solutions to create something suitable.

Count only a subset of motifs of size k

I want to count the motifs of size 4 in a tree graph:
library(igraph)
g <- barabasi.game(100)
census.motifs <- motifs(g, size=4)[c(4,8,13,30)]
There are 217 possible graphs with 4 vertices, but only 4 of them can appear in a directed rooted tree.
Is there a way to tell igraph that it only has to look for those 4? Or a faster/clever way to do this?
The four motifs in a directed rooted tree could be counted as k-instars using the ergm package http://svitsrv25.epfl.ch/R-doc/library/ergm/html/ergm-terms.html
A k-instar is a set of k nodes all sharing one common root. If n is the number of nodes in your tree, the counts for your 4 motifs will be the number of 3-instars (fully connected), (n-3) times the number of 2-instars (two edges connecting to root and one other node), (n-2) choose 2 times the number of 1-instars (one edge connecting to the the root and two other nodes), and n choose 4 minus the sum of the previous three counts. In R you could use,
library(intergraph)
library(ergm)
library(igraph)
n <- 100
g <- barabasi.game(n)
kistars <- summary(asNetwork(g)~istar(1:3))
kistars[3]
(n-3)*kistars[2]
choose(n-2,2)*kistars[1]
choose(n,4)*sum(kistars)

Jaccard distance between tweets

I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.
intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).

Get the last line with consecutive pattern

I have a pattern in a line and I want to get the last line if there are consecutive occurrence in the file.
Example file:
apple 1
banana 5
banana 6
apple 2
apple 5
apple 7
banana 9
Expected output:
apple 1
banana 6
apple 7
banana 9
Assuming that each line is a proper list, it's a matter of remembering the last line and printing the previous value when it is different to the current one.
gets $fin oldline; # Assume there's at least one line for simplicity of coding
while {[gets $fin newline] >= 0} {
if {[lindex $newline 0] ne [lindex $oldline 0]} {
puts $oldline; # There was a difference, so print out the old one
}
set oldline $newline; # Save the new line we read for the next iteration
}
puts $oldline; # The last line to be read hasn't been printed yet
Determining whether two lines are the same is the main problem; it's likely to be more complex with real data than just applying lindex. This is where you get into using regexp or scan to parse the data, and how you do that is a non-trivial problem that requires actually understanding the format of the real data.
Dealing with the case of having no lines at all is a separate matter. Do that by checking for the return value of that initial gets, and if it is less than zero, not going into the loop or printing the final value at all.