I have a hope to perform a full self-cross join on a large data file of points. However, I cannot use programming language to perform the operation because I cannot store it in memory. I would like to find all combinations of points within the set. Below would be an example of my dataset.
x y
1 9
2 8
3 7
4 6
5 5
I would like to cross join on this data to generate 25-row table containing all the combination of points. Would there be a low memory solution? perhaps with awk ?
Thank you,
Nicholas Hayden
P.S. I am a novice programmer.
perhaps in two steps, create a header, column1 and column2 files and join the column1 and column2 and append to header file
awk 'NR==1{print > "cross"} NR>1 {print $1 > "col1"; print $2 > "col2"}' file
join -j9 col1 col2 -o1.1,2.1 >> cross
rm col1, col2
obviously make sure the temp and final file names won't clash with the existing ones.
Note, the join command on MacOS doesn't have the -j option, so change it to equivalent long form
join -19 -29 col1 col2 -o1.1,2.1 >> cross
in both alternatives we're asking join to use the non-existent 9th field as the key which matches every line of the first file to every line in the second to generate the cross product of the two files.
If the memory usage wasn't an issue I'd probably do this:
$ awk 'NR==1 { print; next } # print the header
{ x[NR]=$1; y[NR]=$2 } # read data ro two hashes x and y
END { for(i=2;i<=NR;i++)
for(j=2;j<=NR;j++)
print x[i],y[j] # print all combinations of x and y
}' file
Keeping the memory usage low obviously requires keeping data out of memory and that means accessing the file a lot. So while processing FILENAME for x, open the same file with another name (file below) and process that record by record for y:
$ awk 'NR==1 { print; next } # print header
{ file=FILENAME; x=$1; nr=1 # duplicate FILENAME, keep $1, create local nr
while((getline <file) > 0) # process file record by record
if(nr++>1) {print x,$2 } # print $1 of FILENAME and $2 of file
close(file) }' file # close the file
x y
1 9
1 8
1 7
1 6
1 5
2 9
...
I'd probably never use that code as it is for anything useful but maybe you can mix those 2 solutions to create something suitable.
Related
I would like to use xattr to store some meta-data on my files directly on the files. These are essentially tags that I use for the categorization of files when I do searches on them. My goal is to extend the usual Mac OS X tags by associating more info to each tag, for instance the date of addition of that tag and maybe other thing.
I was thinking to add an xattr to the files, using xattr -w. My first guess would be to store something like a JSON in this xattr value, but I was wondering
1) what are the limits of size I can store in an xattr? (man of xattr is vauge and refers to something called _PC_XATTR_SIZE_BITS which I cannot locate anywhere)
2) anything wrong with storing a JSON formatted string as an xattr?
According to man pathconf, there is a “configurable system limit or option variable” called _PC_XATTR_SIZE_BITS which is
the number of bits used to store maximum extended attribute size in bytes. For
example, if the maximum attribute size supported by a file system is 128K, the value
returned will be 18. However a value 18 can mean that the maximum attribute size can be
anywhere from (256KB - 1) to 128KB. As a special case, the resource fork can have much
larger size, and some file system specific extended attributes can have smaller and preset
size; for example, Finder Info is always 32 bytes.
You can determine the value of this parameter using this small command line tool written in Swift 4:
import Foundation
let args = CommandLine.arguments.dropFirst()
guard let pathArg = args.first else {
print ("File path argument missing!")
exit (EXIT_FAILURE)
}
let v = pathconf(pathArg, _PC_XATTR_SIZE_BITS)
print ("_PC_XATTR_SIZE_BITS: \(v)")
exit (EXIT_SUCCESS)
I get:
31 bits for HFS+ on OS X 10.11
64 bits for APFS on macOS 10.13
as the number of bits used to store maximum extended attribute size. These imply that the actual maximum xattr sizes are somewhere in the ranges
1 GiB ≤ maximum < 2 GiB for HFS+ on OS X 10.11
8 EiB ≤ maximum < 16 EiB for APFS on macOS 10.13
I seem to be able to write at least 260kB, like this by generating 260kB of nulls and converting them to the letter a so I can see them:
xattr -w myattr "$(dd if=/dev/zero bs=260000 count=1|tr '\0' a)" fred
1+0 records in
1+0 records out
260000 bytes transferred in 0.010303 secs (25235318 bytes/sec)
And then read them back with:
xattr -l fred
myattr: aaaaaaaaaaaaaaaaaa...aaa
And check the length returned:
xattr -l fred | wc -c
260009
I suspect this is actually a limit of ARGMAX on the command line:
sysctl kern.argmax
kern.argmax: 262144
Also, just because you can store 260kB in an xattr, that does not mean it is advisable. I don't know about HFS+, but on some Unixy filesystems, the attributes can be stored directly in the inode, but if you go over a certain limit, additional space has to be allocated on disk for the data.
——-
With the advent of High Sierra and APFS to replace HFS+, be sure to test on both filesystems - also make sure that Time Machine backs up and restores the data as well and that utilities such as ditto, tar and the Finder propagate them when copying/moving/archiving files.
Also consider what happens when Email a tagged file, or copy it to a FAT-formatted USB Memory Stick.
I also tried setting multiple attributes on a single file and the following script successfully wrote 1,000 attributes (called attr-0, attr-1 ... attr-999) each of 260kB to a single file - meaning that the file effectively carries 260MB of attributes:
#!/bin/bash
for ((a=1;a<=1000;a++)) ; do
echo Setting attr-$a
xattr -w attr-$a "$(dd if=/dev/zero bs=260000 count=1 2> /dev/null | tr '\0' a)" fred
if [ $? -ne 0 ]; then
echo ERROR: Failed to set attr
exit
fi
done
These can all be seen and read back too - I checked.
I have a pattern in a line and I want to get the last line if there are consecutive occurrence in the file.
Example file:
apple 1
banana 5
banana 6
apple 2
apple 5
apple 7
banana 9
Expected output:
apple 1
banana 6
apple 7
banana 9
Assuming that each line is a proper list, it's a matter of remembering the last line and printing the previous value when it is different to the current one.
gets $fin oldline; # Assume there's at least one line for simplicity of coding
while {[gets $fin newline] >= 0} {
if {[lindex $newline 0] ne [lindex $oldline 0]} {
puts $oldline; # There was a difference, so print out the old one
}
set oldline $newline; # Save the new line we read for the next iteration
}
puts $oldline; # The last line to be read hasn't been printed yet
Determining whether two lines are the same is the main problem; it's likely to be more complex with real data than just applying lindex. This is where you get into using regexp or scan to parse the data, and how you do that is a non-trivial problem that requires actually understanding the format of the real data.
Dealing with the case of having no lines at all is a separate matter. Do that by checking for the return value of that initial gets, and if it is less than zero, not going into the loop or printing the final value at all.
I know that you can "hack" nesting associative arrays in tcl, and I also know that with dictionaries (which I have no experience with) you can nest them pretty easily. I'm trying to find a way to store the values of a function that has two variables, and then at the end I just want to print out a table of the two variables (column and row headers) with the function values in the cells. I can make this work, but it is neither succinct nor efficient.
Here's what should be printed. The rows are values of b and columns are values of a (1,2,3,4,5 for simplicity):
b
1 2 3 4 5
1 y(1,1) y(1,2) y(1,3) y(1,4) y(1,5)
2 y(2,1) y(2,2) y(2,3) y(2,4) y(2,5)
a 3 y(3,1) y(3,2) y(3,3) y(3,4) y(3,5)
4 y(4,1) y(4,2) y(4,3) y(4,4) y(4,5)
5 y(5,1) y(5,2) y(5,3) y(5,4) y(5,5)
To store this, I imagine I would simply do two nested for loops over a and b and somehow store the results in nested dictionaries. Like have one dictionary with 5 entries, 1 for each value of b, and each entry in this is another dictionary for each value of b.
To print it, the only way I can think of is to just explicitly print out each table line and call each dictionary entry. I'm not too versed in output formatting with tcl, but I can probably manage there.
Can anyone think of a more elegant way to do this?
Here are a couple of examples on how you might use the struct::matrix package.
Example 1 - Simple Create/Display
package require struct::matrix
package require Tclx
# Create a 3x4 matrix
set m [::struct::matrix]
$m add rows 3
$m add columns 4
# Populate data
$m set rect 0 0 {
{1 2 3 4}
{5 6 7 8}
{9 10 11 12}
}
# Display it
puts "Print matrix, cell by cell:"
loop y 0 [$m rows] {
loop x 0 [$m columns] {
puts -nonewline [format "%4d" [$m get cell $x $y]]
}
puts ""
}
Output
Print matrix, cell by cell:
1 2 3 4
5 6 7 8
9 10 11 12
Discussion
In the first part of the script, I created a matrix, add 3 rows and 4 columns--a straight forward process.
Next, I called set rect to populate the matrix with data. Depend on your need, you might want to look into set cell, set column, or set row. For more information, please consult the reference for struct::matrix.
When it comes to displaying the matrix, instead of using the Tcl's for command, I prefer the loop command from the Tclx package, which is simpler to read and use.
Example 2 - Read from a CSV file
package require csv
package require struct::matrix
package require Tclx
# Read a matrix from a CSV file
set m [::struct::matrix]
set fileHandle [open data.csv]
::csv::read2matrix $fileHandle $m "," auto
close $fileHandle
# Displays the matrix
loop y 0 [$m rows] {
loop x 0 [$m columns] {
puts -nonewline [format "%4d" [$m get cell $x $y]]
}
puts ""
}
The data file, data.csv:
1,2,3,4
5,6,7,8
9,10,11,12
Output
1 2 3 4
5 6 7 8
9 10 11 12
Discussion
The csv package provides a simple way to read from a CSV file to a matrix.
The heart of the operation is in the ::csv::read2matrix command, but before that, I have to create an empty matrix and open the file for reading.
The code to display the matrix is the same as previous example.
Conclusion
While the struct::matrix package seems complicated at first; I only need to learn a couple of commands to get started.
Elegance is in the eye of the beholder :)
With basic core Tcl, I think you understand your options reasonably well. Either arrays or nested dictionaries have clunky edges when it comes to tabular oriented data.
If you are willing to explore extensions (and Tcl is all about the extension) then you might consider the matrix package from the standard Tcl library. It deals with rows and columns as key concepts. If you need to do transformations on tabular data then I would suggest TclRAL, a relational algebra library that defines a Relation data type and will handle all types of tabular data and provide a large number of operations on it. Alternatively, you could try something like SQLite which will also handle tabular data, provide for manipulating it and has robust persistent storage. The Tcl wiki will direct you to details of all of these extensions.
However, if these seem too heavyweight for your taste or if you don't want to suffer the learning curve, rolling up your sleeves and banging out an array or nested dictionary solution, while certainly being rather ad hoc, is probably not that difficult. Elegant? Well, that's for you to judge.
Nested lists work reasonably well for tabular data from 8.4 onwards (with multi-index lindex and lset) provided you've got compact numeric indices. 8.5's lrepeat is good for constructing an initial matrix too.
set mat [lrepeat 5 [lrepeat 5 0.0]]
lset mat 2 3 1.3
proc printMatrix {mat} {
set height [llength $mat]
set width [llength [lindex $mat 0]]
for {set j 0} {$j < $width} {incr j} {
puts -nonewline \t$j
}
puts ""
for {set i 0} {$i < $height} {incr i} {
puts -nonewline $i
for {set j 0} {$j < $width} {incr j} {
puts -nonewline \t[lindex $mat $i $j]
}
puts ""
}
}
printMatrix $mat
You should definitely consider using the struct::matrix and report packages from tcllib.
package require csv
package require struct::matrix
array set OPTS [subst {
csv_input_filename {input.csv}
}]
::struct::matrix indata
set chan [open $OPTS(csv_input_filename)]
csv::read2matrix $chan indata , auto
close $chan
# prints matrix as list format
puts [join [indata get rect 0 0 end end] \n];
# prints matrix as csv format
csv::joinmatrix indata
# cleanup
indata destroy
This is a one-liner way to print out a matrix as list or csv format respectively.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Write the shortest program that calculates the Frobenius number for a given set of positive numbers. The Frobenius number is the largest number that cannot be written as a sum of positive multiples of the numbers in the set.
Example: For the set of the Chicken McNuggetTM sizes [6,9,20] the Frobenius number is 43, as there is no solution for the equation a*6 + b*9 + c*20 = 43 (with a,b,c >= 0), and 43 is the largest value with this property.
It can be assumed that a Frobenius number exists for the given set. If this is not the case (e.g. for [2,4]) no particular behaviour is expected.
References:
http://en.wikipedia.org/wiki/Coin_problem
http://mathworld.wolfram.com/FrobeniusNumber.html
[Edit]
I decided to accept the GolfScript version. While the MATHEMATICA version might be considered "technically correct", it would clearly take the fun out of the competition. That said, I'm also impressed by the other solutions, especially Ruby (which was very short for a general purpose language).
Mathematica 0 chars (or 19 chars counting the invoke command)
Invoke wtih
FrobeniusNumber[{a,b,c,...}]
Example
In[3]:= FrobeniusNumber[{6, 9, 20}]
Out[3]= 43
Is it a record? :)
Ruby 100 86 80 chars
(newline not needed)
Invoke with frob.rb 6 9 20
a=$*.map &:to_i;
p ((1..eval(a*"*")).map{|i|a<<i if(a&a.map{|v|i-v})[0];i}-a)[-1]
Works just like the Perl solution (except better:). $* is an array of command line strings; a is the same array as ints, which is then used to collect all the numbers which can be made; eval(a*"*") is the product, the max number to check.
In Ruby 1.9, you can save one additional character in by replacing "*" with ?*.
Edit: Shortened to 86 using Symbol#to_proc in $*.map, inlining m and shortening its calculation by folding the array.
Edit 2: Replaced .times with .map, traded .to_a for ;i.
Mathematica PROGRAM - 28 chars
Well, this is a REAL (unnecessary) program. As the other Mathematica entry shows clearly, you can compute the answer without writing a program ... but here it is
f[x__]:=FrobeniusNumber[{x}]
Invoke with
f[6, 9, 20]
43
GolfScript 47/42 chars
Faster solution (47).
~:+{0+{.1<{$}{1=}if|}/.!1):1\{:X}*+0=-X<}do];X(
Slow solution (42). Checks all values up to the product of every number in the set...
~:+{*}*{0+{.1<{$}{1=}if|}/1):1;}*]-1%.0?>,
Sample I/O:
$ echo "[6 9 20]"|golfscript frobenius.gs
43
$ echo "[60 90 2011]"|golfscript frobenius.gs
58349
Haskell 155 chars
The function f does the work and expects the list to be sorted. For example f [6,9,20] = 43
b x n=sequence$replicate n[0..x]
f a=last$filter(not.(flip elem)(map(sum.zipWith(*)a)(b u(length a))))[1..u] where
h=head a
l=last a
u=h*l-h-l
P.S. since that's my first code golf submission I'm not sure how to handle input, what are the rules?
C#, 360 characters
using System;using System.Linq;class a{static void Main(string[]b)
{var c=(b.Select(d=>int.Parse(d))).ToArray();int e=c[0]*c[1];a:--e;
var f=c.Length;var g=new int[f];g[f-1]=1;int h=1;for(;;){int i=0;for
(int j=0;j<f;j++)i+=c[j]*g[j];if(i==e){goto a;}if(i<e){g[f-1]++;h=1;}
else{if(h>=f){Console.Write(e);return;}for(int k=f-1;k>=f-h;k--)
g[k]=0;g[f-h-1]++;h++;}}}}
I'm sure there's a shorter C# solution than this, but this is what I came up with.
This is a complete program that takes the values as command-line parameters and outputs the result to the screen.
Perl 105 107 110 119 122 127 152 158 characters
Latest edit: Compound assignment is good for you!
$h{0}=$t=1;$t*=$_ for#ARGV;for$x(1..$t){$h{$x}=grep$h{$x-$_},#ARGV}#b=grep!$h{$_},1..$t;print pop#b,"\n"
Explanation:
$t = 1;
$t *= $_ foreach(#ARGV);
Set $t to the product of all of the input numbers. This is our upper limit.
foreach $x (1..$t)
{
$h{$x} = grep {$_ == $x || $h{$x-$_} } #ARGV;
}
For each number from 1 to $t: If it's one of the input numbers, mark it using the %h hash; otherwise, if there is a marked entry from further back (difference being anything in the input), mark this entry. All marked entries are non-candidates for Frobenius numbers.
#b=grep{!$h{$_}}(1..$t);
Extract all UNMARKED entries. These are Frobenius candidates...
print pop #b, "\n"
...and the last of these, the highest, is our Frobenius number.
Haskell 153 chars
A different take on a Haskell solution. I'm a rank novice at Haskell, so I'd be surprised if this couldn't be shortened.
m(x:a)(y:b)
|x==y=x:m a b
|x<y=x:m(y:b)a
|True=y:m(x:a)b
f d=l!!s-1where
l=0:foldl1 m[map(n+)l|n<-d]
g=minimum d
s=until(\n->l!!(n+g)-l!!n==g)(+1)0
Call it with, e.g., f [9,6,20].
FrobeniusScript 5 characters
solve
Sadly there does not yet exist any compiler/interpreter for this language.
No params, the interpreter will handle that:
$ echo solve > myProgram
$ frobeniusScript myProgram
6
9
20
^D
Your answer is: 43
$ exit
I have a simple program which reads a bunch of things from the filesystem, filters the results , and prints them. This simple program implements a domain specific language to make selection easier. This DSL "compiles" down into an execution plan that looks like this (Input was C:\Windows\System32\* OR -md5"ABCDEFG" OR -tf):
Index Success Failure Description
0 S 1 File Matches C:\Windows\System32\*
1 S 2 File MD5 Matches ABCDEFG
2 S F File is file. (Not directory)
The filter is applied to the given file, and if it succeeds, the index pointer jumps to the index indicated in the success field, and if it fails, the index pointer jumps to the number indicated in the failure field. "S" means that the file passes the filter, F means that the file should be rejected.
Of course, a filter based upon a simple file attribute (!FILE_ATTRIBUTE_DIRECTORY) check is much faster than a check based upon the MD5 of the file, which requires opening and performing the actual hash of the file. Each filter "opcode" has a time class associated with it; MD5 gets a high timing number, ISFILE gets a low timing number.
I would like to reorder this execution plan so that opcodes that take a long time are executed as rarely as possible. For the above plan, that would mean it would have to be:
Index Success Failure Description
0 S 1 File is file. (Not directory)
1 S 2 File Matches C:\Windows\System32\*
2 S F File MD5 Matches ABCDEFG
According to the "Dragon Book", picking the best order of execution for three address code is an NP-Complete problem (At least according to page 511 of the second edition of that text), but in that case they are talking about register allocation and other issues of the machine. In my case, the actual "intermediate code" is much simpler. I'm wondering of a scheme exists that would allow me to reorder the source IL into the optimal execution plan.
Here is another example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
Parsed to:
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
which is optimally:
Index Success Failure Description
0 1 2 File is file. (Not directory)
1 2 S File Matches C:\Windows\System32\Drivers\*
2 3 F File Matches C:\Windows\Inf\*
3 S F File is a Portable Executable
It sounds like it might be easier to pick an optimal order before compiling down to your opcodes. If you have a parse tree, and it is as "flat" as possible, then you can assign a score to each node and then sort each node's children by the lowest total score first.
For example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
1 2 3 4
OR
/ \
AND AND
/ \ / \
1 2 3 4
You could sort the AND nodes (1, 2) and (3, 4) by the lowest score and then assign that score to each node. Then sort the children of the OR node by the lowest score of their children.
Since AND and OR are commutative, this sorting operation won't change the meaning of your overall expression.
#Greg Hewgill is right, this is easier to perform on the AST than on the Intermediate code. As you want to work on the Intermediate code, the first goal is to transform it into a dependency tree (which will look like the AST /shrug).
Start with the leaves - and it is probably easiest if you use negative-predicates for NOT.
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
Extract Leaf (anything with both children as S, F, or an extracted Node; insert NOT where required; Replace all references to Leaf with reference to parent node of leaf)
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 L1 F File is file. (Not directory)
L1=NOT(cost(child))
|
Pred(cost(PATH))
Extract Node (If Success points to Extracted Node use conjunction to join; Failure uses disjunction; Replace all references to Node with reference to resulting root of tree containing Node).
Index Success Failure Description
0 1 L3 File Matches C:\Windows\Inf\*
1 S L3 File is a Portable Executable
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node
Index Success Failure Description
0 L5 L3 File Matches C:\Windows\Inf\*
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=IS(cost(child))
| |
| 1=Pred(cost(PORT_EXE))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node (In the case where Success and Failure both refer to Nodes, you will have to inject the Node into the tree by pattern matching on the root of the sub-tree defined by the Node)
If root is OR, invert predicate if necessary to ensure reference is Success and inject as conjunction with child not referenced by Failure.
If root is AND, invert predicate if necessary to ensure reference is Failure and inject as disjunction with child root referenced by Success.
Resulting in:
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=AND(cost(as for L3))
| / \
| L6=IS(cost(child)) L7=IS(cost(child))
| | |
| 1=Pred(cost(PORT_EXE)) 0=Pred(cost(PATH))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))