Is there a Counter object in Julia? - containers

In Python, it's possible to count items in list using a high-performance object collections.Counter:
>>> from collections import Counter
>>> l = [1,1,2,4,1,5,12,1,51,2,5]
>>> Counter(l)
Counter({1: 4, 2: 2, 5: 2, 4: 1, 12: 1, 51: 1})
I've search in http://docs.julialang.org/en/latest/search.html?q=counter but I can't seem to find a counter object.
I've also looked at http://docs.julialang.org/en/latest/stdlib/collections.html but I couldn't find it either.
I've tried the histogram function in Julia and it returned a wave of deprecation messages:
> l = [1,1,2,4,1,5,12,1,51,2,5]
> hist(l)
[out]:
WARNING: sturges(n) is deprecated, use StatsBase.sturges(n) instead.
in depwarn(::String, ::Symbol) at ./deprecated.jl:64
in sturges(::Int64) at ./deprecated.jl:623
in hist(::Array{Int64,1}) at ./deprecated.jl:646
in include_string(::String, ::String) at ./loading.jl:441
in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/liling.tan/.julia/v0.5/IJulia/src/execute_request.jl:175
in eventloop(::ZMQ.Socket) at /Users/liling.tan/.julia/v0.5/IJulia/src/eventloop.jl:8
in (::IJulia.##13#19)() at ./task.jl:360
while loading In[65], in expression starting on line 1
WARNING: histrange(...) is deprecated, use StatsBase.histrange(...) instead
in depwarn(::String, ::Symbol) at ./deprecated.jl:64
in histrange(::Array{Int64,1}, ::Int64) at ./deprecated.jl:582
in hist(::Array{Int64,1}, ::Int64) at ./deprecated.jl:645
in hist(::Array{Int64,1}) at ./deprecated.jl:646
in include_string(::String, ::String) at ./loading.jl:441
in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/liling.tan/.julia/v0.5/IJulia/src/execute_request.jl:175
in eventloop(::ZMQ.Socket) at /Users/liling.tan/.julia/v0.5/IJulia/src/eventloop.jl:8
in (::IJulia.##13#19)() at ./task.jl:360
while loading In[65], in expression starting on line 1
WARNING: hist(...) and hist!(...) are deprecated. Use fit(Histogram,...) in StatsBase.jl instead.
in depwarn(::String, ::Symbol) at ./deprecated.jl:64
in #hist!#994(::Bool, ::Function, ::Array{Int64,1}, ::Array{Int64,1}, ::FloatRange{Float64}) at ./deprecated.jl:629
in hist(::Array{Int64,1}, ::FloatRange{Float64}) at ./deprecated.jl:644
in hist(::Array{Int64,1}, ::Int64) at ./deprecated.jl:645
in hist(::Array{Int64,1}) at ./deprecated.jl:646
in include_string(::String, ::String) at ./loading.jl:441
in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/liling.tan/.julia/v0.5/IJulia/src/execute_request.jl:175
in eventloop(::ZMQ.Socket) at /Users/liling.tan/.julia/v0.5/IJulia/src/eventloop.jl:8
in (::IJulia.##13#19)() at ./task.jl:360
while loading In[65], in expression starting on line 1
**Is there a Counter object in Julia?**

If you are using Julia 0.5+, the histogram functions has been deprecated and you are supposed to use the StatsBase.jl module instead. It is also described in the warning:
WARNING: hist(...) and hist!(...) are deprecated. Use fit(Histogram,...) in StatsBase.jl instead.
But if you are using StatsBase.jl, probably countmap is closer to what you need:
julia> import StatsBase: countmap
julia> countmap([1,1,2,4,1,5,12,1,51,2,5])
Dict{Int64,Int64} with 6 entries:
4 => 1
2 => 2
5 => 2
51 => 1
12 => 1
1 => 4

The DataStructures.jl package also has Accumulators / Counters with a more
general set of methods for using and combining counters.
Once you've added the package
using Pkg
Pkg.add("DataStructures")
you can count the elements of a sequence by constructing a counter
# generate some data to count
using Random
seq = [ Random.randstring('a':'c', 2) for _ in 1:100 ]
# count the elements in seq
using DataStructures
counts = counter(seq)

Related

Why the parsed dicts are equal while the pickled dicts are not?

I'm working on an aggregated config file parsing tool, hoping it can support .json, .yaml and .toml files. So, I have done the next tests:
The example.json config file is as:
{
"DEFAULT":
{
"ServerAliveInterval": 45,
"Compression": true,
"CompressionLevel": 9,
"ForwardX11": true
},
"bitbucket.org":
{
"User": "hg"
},
"topsecret.server.com":
{
"Port": 50022,
"ForwardX11": false
},
"special":
{
"path":"C:\\Users",
"escaped1":"\n\t",
"escaped2":"\\n\\t"
}
}
The example.yaml config file is as:
DEFAULT:
ServerAliveInterval: 45
Compression: yes
CompressionLevel: 9
ForwardX11: yes
bitbucket.org:
User: hg
topsecret.server.com:
Port: 50022
ForwardX11: no
special:
path: C:\Users
escaped1: "\n\t"
escaped2: \n\t
and the example.toml config file is as:
[DEFAULT]
ServerAliveInterval = 45
Compression = true
CompressionLevel = 9
ForwardX11 = true
['bitbucket.org']
User = 'hg'
['topsecret.server.com']
Port = 50022
ForwardX11 = false
[special]
path = 'C:\Users'
escaped1 = "\n\t"
escaped2 = '\n\t'
Then, the test code with output is as:
import pickle,json,yaml
# TOML, see https://github.com/hukkin/tomli
try:
import tomllib
except ModuleNotFoundError:
import tomli as tomllib
path = "example.json"
with open(path) as file:
config1 = json.load(file)
assert isinstance(config1,dict)
pickled1 = pickle.dumps(config1)
path = "example.yaml"
with open(path, 'r', encoding='utf-8') as file:
config2 = yaml.safe_load(file)
assert isinstance(config2,dict)
pickled2 = pickle.dumps(config2)
path = "example.toml"
with open(path, 'rb') as file:
config3 = tomllib.load(file)
assert isinstance(config3,dict)
pickled3 = pickle.dumps(config3)
print(config1==config2) # True
print(config2==config3) # True
print(pickled1==pickled2) # False
print(pickled2==pickled3) # True
So, my question is, since the parsed obj are all dicts, and these dicts are equal to each other, why their pickled codes are not the same, i.e., why is the pickled code of the dict parsed from json different to other two?
Thanks in advance.
The difference is due to:
The json module performing memoizing for object attributes with the same value (it's not interning them, but the scanner object contains a memo dict that it uses to dedupe identical attribute strings within a single parsing run), while yaml does not (it just makes a new str each time it sees the same data), and
pickle faithfully reproducing the exact structure of the data it's told to dump, replacing subsequent references to the same object with a back-reference to the first time it was seen (among other reasons, this makes it possible to dump recursive data structures, e.g. lst = [], lst.append(lst), without infinite recursion, and reproduce them faithfully when unpickled)
Issue #1 isn't visible in equality testing (strs compare equal with the same data, not just the same exact object in memory). But when pickle sees "ForwardX11" the first time, it inserts the pickled form of the object and emits a pickle opcode that assigns a number to that object. If that exact object is seen again (same memory address, not merely same value), instead of reserializing it, it just emits a simpler opcode that just says "Go find the object associated with the number from last time and put it here as well". If it's a different object though, even one with the same value, it's new, and gets serialized separately (and assigned another number in case the new object is seen again).
Simplifying your code to demonstrate the issue, you can inspect the generated pickle output to see how this is happening:
s = r'''{
"DEFAULT":
{
"ForwardX11": true
},
"FOO":
{
"ForwardX11": false
}
}'''
s2 = r'''DEFAULT:
ForwardX11: yes
FOO:
ForwardX11: no
'''
import io, json, yaml, pickle, pickletools
d1 = json.load(io.StringIO(s))
d2 = yaml.safe_load(io.StringIO(s2))
pickletools.dis(pickle.dumps(d1))
pickletools.dis(pickle.dumps(d2))
Try it online!
The output from that code for the json parsed input is (with # comments inline to point out important things), at least on Python 3.7 (the default pickle protocol and exact pickling format can change from release to release), is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes 'ForwardX11'
38: q BINPUT 3 # Assigns the serialized form the ID of 3
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: h BINGET 3 # Looks up whatever object was assigned the ID of 3
57: \x89 NEWFALSE
58: s SETITEM
59: u SETITEMS (MARK at 5)
60: . STOP
highest protocol among opcodes = 2
while the output from the yaml loaded data is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes as before
38: q BINPUT 3 # and assigns code 3 as before
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: X BINUNICODE 'ForwardX11' # Doesn't see this 'ForwardX11' as being the exact same object, so reserializes
70: q BINPUT 6 # and marks again, in case this copy is seen again
72: \x89 NEWFALSE
73: s SETITEM
74: u SETITEMS (MARK at 5)
75: . STOP
highest protocol among opcodes = 2
printing the id of each such string would get you similar information, e.g., replacing the pickletools lines with:
for k in d1['DEFAULT']:
print(id(k))
for k in d1['FOO']:
print(id(k))
for k in d2['DEFAULT']:
print(id(k))
for k in d2['FOO']:
print(id(k))
will show a consistent id for both 'ForwardX11's in d1, but differing ones for d2; a sample run produced (with inline comments added):
140067902240944 # First from d1
140067902240944 # Second from d1 is *same* object
140067900619760 # First from d2
140067900617712 # Second from d2 is unrelated object (same value, but stored separately)
While I didn't bother checking if toml behaved the same way, given that it pickles the same as the yaml, it's clearly not attempting to dedupe strings; json is uniquely weird there. It's not a terrible idea that it does so mind you; the keys of a JSON dict are logically equivalent to attributes on an object, and for huge inputs (say, 10M objects in an array with the same handful of keys), it might save a meaningful amount of memory on the final parsed output by deduping (e.g. on CPython 3.11 x86-64 builds, replacing 10M copies of "ForwardX11" with a single copy would reduce 590 MB for string data to just 59 bytes).
As a side-note: This "dicts are equal, pickles are not" issue could also occur:
When the two dicts were constructed with the same keys and values, but the order in which the keys were inserted differed (modern Python uses insertion-ordered dicts; comparisons between them ignore ordering, but pickle would be serializing them in whatever order they iterate in naturally).
When there are objects which compare equal but have different types (e.g. set vs. frozenset, int vs. float); pickle would treat them separately, but equality tests would not see a difference.
Neither of these is the issue here (both json and yaml appear to be constructing in the same order seen in the input, and they're parsing the ints as ints), but it's entirely possible for your test of equality to return True, while the pickled forms are unequal, even when all the objects involved are unique.

Julia CSV.read not recognizing "select" keyword

I am reading in a space-delimited file using the CSV library in Julia.
edgeList = CSV.read(
joinpath(dataDirectory, "out.file"),
types=[Int, Int],
header=["node1", "node2"],
skipto=3,
select=[1,2]
)
This yields the following error:
MethodError: no method matching CSV.File(::String; types=DataType[Int64, Int64], header=["node1", "node2"], skipto=3, select=[1, 2])
Closest candidates are:
CSV.File(::Any; header, normalizenames, datarow, skipto, footerskip, limit, transpose, comment, use_mmap, ignoreemptylines, missingstrings, missingstring, delim, ignorerepeated, quotechar, openquotechar, closequotechar, escapechar, dateformat, decimal, truestrings, falsestrings, type, types, typemap, categorical, pool, strict, silencewarnings, threaded, debug, parsingdebug, allowmissing) at /Users/n.jordanjameson/.julia/packages/CSV/4GOjG/src/CSV.jl:221 got unsupported keyword argument "select"
I am using Julia v. 1.6.2. Here is the output versioninfo():
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i7-5650U CPU # 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)
The version of CSV is 0.10.4. The wiki for this version of CSV is here: https://csv.juliadata.org/stable/reading.html#CSV.read, and it has a select / drop entry.
The file I am trying to read is from here: http://konect.cc/networks/moreno_crime/ (the file I'm using is called "out.moreno_crime_crime"). The first few lines are:
% bip unweighted
% 1476 829 551
1 1
1 2
1 3
1 4
2 5
2 6
2 7
2 8
2 9
2 10
I get a different error than you, can you restart Julia and make sure?
julia> CSV.read("/home/akako/Downloads/moreno_crime/out.moreno_crime_crime"; types=[Int, Int],
header=["node1", "node2"],
skipto=3,
select=[1,2]
)
ERROR: ArgumentError: provide a valid sink argument, like `using DataFrames; CSV.read(source, DataFrame)`
Stacktrace:
[1] read(source::String, sink::Nothing; copycols::Bool, kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:types, :header, :skipto, :select), Tuple{Vector{DataType}, Vector{String}, Int64, Vector{Int64}}}})
# CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:89
[2] top-level scope
# REPL[8]:1
Stacktrace:
this error is telling you you can't CSV.read without a target sink, you might want to use CSV.File
julia> CSV.File("/home/akako/Downloads/moreno_crime/out.moreno_crime_crime"; types=[Int, Int],
header=["node1", "node2"],
skipto=3,
select=[1,2]
)
┌ Warning: thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 1. Parsing extra columns and widening final columnset
└ # CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:579
1476-element CSV.File:
CSV.Row: (node1 = 1, node2 = 1, Column3 = missing)
CSV.Row: (node1 = 1, node2 = 2, Column3 = missing)
CSV.Row: (node1 = 1, node2 = 3, Column3 = missing)
CSV.Row: (node1 = 1, node2 = 4, Column3 = missing)

Nesting displayed when mapping collection in HAML

I'm working on a Rails application, using HAML. I needed to present a list to the user, when I stumbled upon this weird behaviour:
This code generates no output, obviously.
- 5.times.map do |n|
- n + 1
This code generates 1 2 3 4 5, as it's outputting the return value for the inner statement.
- 5.times.map do |n|
= n + 1 # Notice the =
This code generates [1, 2, 3, 4, 5], as it's outputting the mapped array.
= 5.times.map do |n| # Notice the =
- n + 1
So far, so good...
This code generates 1 2 3 4 5 [0, 0, 0, 0, 0], as it doesn't like humans.
= 5.times.map do |n| # Notice the =
= n + 1 # on both lines
This code generates 1 2 3 4 5 [1, 1, 1, 1, 1], as it's nested in the <p>.
%p # Here we are nested one <p> deep
= 5.times.map do |n|
= n + 1
This code generates 1 2 3 4 5 [2, 2, 2, 2, 2], as it's double nested (and so on).
%p # Here we are nested two <p> deep
%p
= 5.times.map do |n|
= n + 1
Does anyone have an explanation for what is happening here in the HAML internals? Is = n+1 both adding a string to some output buffer, and then returning it's nesting level?
Actually, what you are asking is very version dependent. I mean it will produce different results in upcoming version of Haml (v4.1.0).
But first let's find out why this output in version < 4.1.0?
Under the hood, Haml translates templates into executable Ruby code. This job is done in Haml::Compiler class and you can easily debug this code:
require 'haml'
puts Haml::Engine.new(%q{
%p
%p
%p
= 5.times.map do |i|
= i
}).render # => produces the output
# you have in last example
to find out what the associated Ruby code looks like.
With some simplifications, it looks like this:
haml_temp = 5.times.map do |i|
_hamlout.push_text(" #{_hamlout.format_script_false_false_false_false_false_true_false(( i ));}\n", 0, false);
end
You can easily see by now, that your values are mapped to the expression:
_hamlout.push_text(...)
_hamlout here is an instance of Haml::Buffer and this is the last line, which returns value from the push_text method:
#real_tabs += tab_change
#real_tabs is indent level. If we use no indent, then it's 0, when one %p is involved, it becomes 1 and so on. tab_change argument is 0 (see debugged code).
So the output for Haml version 4.0.7 is equal to the level of nesting. This is exactly how your output looks like.
But this behavior will be probably "broken" in upcoming version 4.1.0. Compare the last line of the same method from current master branch:
#buffer << text
which will return some textual value.

Rpy2 - Select Results and Output to CSV File

I'm currently doing Cox Proportional Hazards Modeling using Rpy2 - I imagine my question will cover other functions and the results from calling them as well though.
After I run the function, I have a variable which contains the results from the function, in the form of a vector. I have tried explicitly converting this to a DataFrame (resultsDataFrame = DataFrame(resultVector)). There are no errors returned when doing this. However, when I do resultsDataFrame.to_csvfile(filename) I get the following error:
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
modelFrame.to_csvfile('/Users/fortylashes/Documents/Matthews_Research/Cox_PH/ResultOutput_Exp1.csv')
File "/Library/Python/2.7/site-packages/rpy2/robjects/vectors.py", line 1031, in to_csvfile
'col.names': col_names, 'qmethod': qmethod, 'append': append})
RRuntimeError: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""coxph"" to a data.frame
Furthermore, when I simply do:
for result in resultVector:
print (result)
I get an extremely long list of results- including information on each entry in the dataset used in the model, for each variable (so 9,000 records x 9 variables = 81,000 unneeded results). The results I really need are at the bottom of this vector and look like this:
coef exp(coef) se(coef) z p
age_age6574 -0.057775 0.944 0.05469 -1.056 2.9e-01
age_age75plus -0.020795 0.979 0.04891 -0.425 6.7e-01
sex_female -0.005304 0.995 0.03961 -0.134 8.9e-01
stage_late -0.261609 0.770 0.04527 -5.779 7.5e-09
access -0.000494 1.000 0.00069 -0.715 4.7e-01
Likelihood ratio test=36.6 on 5 df, p=7.31e-07 n= 9752, number of events= 2601
*NOTE: There were several more variables for which data was reported in the initial results (the 9,000 x 9 that I was talking about) but weren't actually used in the model.
I was wondering if there was a way to explicitly get this data, put it in one long ordered row, and then output it to a csv file?
::::UPDATE::::
When I call theModel.names I get a list of the various measures which can be called by numerical index:
[1] "coefficients" "var" "loglik"
[4] "score" "iter" "linear.predictors"
[7] "residuals" "means" "concordance"
[10] "method" "n" "nevent"
[13] "terms" "assign" "wald.test"
[16] "y" "formula" "call"
From this I can get the coefficients, which can then be exponentiated. I have not found, however, the p-value, the z score or the likelihood test ratio, which I will need.

find function matlab in numpy/scipy

Is there an equivalent function of find(A>9,1) from matlab for numpy/scipy. I know that there is the nonzero function in numpy but what I need is the first index so that I can use the first index in another extracted column.
Ex: A = [ 1 2 3 9 6 4 3 10 ]
find(A>9,1) would return index 4 in matlab
The equivalent of find in numpy is nonzero, but it does not support a second parameter.
But you can do something like this to get the behavior you are looking for.
B = nonzero(A >= 9)[0]
But if all you are looking for is finding the first element that satisfies a condition, you are better off using max.
For example, in matlab, find(A >= 9, 1) would be the same as [~, idx] = max(A >= 9). The equivalent function in numpy would be the following.
idx = (A >= 9).argmax()
matlab's find(X, K) is roughly equivalent to numpy.nonzero(X)[0][:K] in python. #Pavan's argmax method is probably a good option if K == 1, but unless you know apriori that there will be a value in A >= 9, you will probably need to do something like:
idx = (A >= 9).argmax()
if (idx == 0) and (A[0] < 9):
# No value in A is >= 9
...
I'm sure these are all great answers but I wasn't able to make use of them. However, I found another thread that partially answers this:
MATLAB-style find() function in Python
John posted the following code that accounts for the first argument of find, in your case A>9 ---find(A>9,1)-- but not the second argument.
I altered John's code which I believe accounts for the second argument ",1"
def indices(a, func):
return [i for (i, val) in enumerate(a) if func(val)]
a = [1,2,3,9,6,4,3,10]
threshold = indices(a, lambda y: y >= 9)[0]
This returns threshold=3. My understanding is that Python's index starts at 0... so it's the equivalent of matlab saying 4. You can change the value of the index being called by changing the number in the brackets ie [1], [2], etc instead of [0].
John's original code:
def indices(a, func):
return [i for (i, val) in enumerate(a) if func(val)]
a = [1, 2, 3, 1, 2, 3, 1, 2, 3]
inds = indices(a, lambda x: x > 2)
which returns >>> inds [2, 5, 8]
Consider using argwhere in Python to replace MATLAB's find function. For example,
import numpy as np
A = [1, 2, 3, 9, 6, 4, 3, 10]
np.argwhere(np.asarray(A)>=9)[0][0] # Return first index
returns 3.
import numpy
A = numpy.array([1, 2, 3, 9, 6, 4, 3, 10])
index = numpy.where(A >= 9)
You can do this by first convert the list to an ndarray, then using the function numpy.where() to get the desired index.