How can I control the size of a RecordBatch precisely?

How can I control the size of a RecordBatch precisely? - pyarrow

While I use pyarrow to generate RecordBatch(or Table), I need to construct the data(consists of array) first. For example:
data = [
....: pa.array([1, 2, 3, 4]),
....: pa.array(['foo', 'bar', 'baz', 'big string xxxxxxxxx']),
....: pa.array([True, None, False, True])
....: ]
batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
batch.nbytes
The result is:
the size of the 'batch' is 83 bytes;
the number of rows in the 'batch' is 3;
the number of columns in the 'batch' is 4;
The question is: Can I control the size of the RecordBatch not to exceed 64 bytes？And How?
Is there some user API of pyarrow to do this?
Thanks!

Related

Get the value at a specific index in PyTorch

I have a ground truth label array for size 5.
y=tensor([958, 85, 244, 182, 294])
I have the output for scores array of shape : [5,1000]
scores = tensor([[ 1.0406, 1.1808, 4.4227, ..., 4.6864, 8.0145, 5.2128],
[ 6.9101, 4.6083, 6.9259, ..., 9.7415, 9.6305, 9.3974],
[ 7.6097, 4.0396, 4.4560, ..., 3.4892, 11.6411, 2],
[ 1.0693, 4.6295, 5.3638, ..., 10.9041, 10.8380, 9.2077],
[ 1.7085, 1.4938, 8.6876, ..., 15.1423, 9.6055, 9.8920]],
grad_fn=<ViewBackward>)
I want the value from scores array based on the corresponding index of y. So for y[0], which is 958, I want the corresponding value from scores[1], position 958.
Is there some direct Pytorch function I can use?

Yes, you can do it by using your y array as an index:
scores[torch.arange(5), y]

Alternatively, a more general approach would be to use torch.gather which is:
>>> torch.gather(scores, 1, y[:,None])[:,0]

Groovy: compare two lazy maps/jsons

I have two jsons/lazy maps in the format as shown below. I now need to compare them to find if there is any difference between them. The reason I combine each set of values in a string so that the comparison becomes faster as my actual inputs (i.e. json messages) are going to be really large.
reqJson:
[["B1": 100, "B2": 200, "B3": 300, "B4": 400],["B1": 500, "B2": 600, "B3": 700, "B4": 800], ["B1": 900, "B2": 1000, "B3": 2000, "B4": 3000], ["B1": 4000, "B2": 5000, "B3": 6000, "B4": 7000]]
respJson:
[["B1": 100, "B2": 200, "B3": 300, "B4": 400],["B1": 500, "B2": 600, "B3": 700, "B4": 800], ["B1": 900, "B2": 1000, "B3": 2000, "B4": 3000], ["B1": 4000, "B2": 5000, "B3": 6000, "B4": 7000], ["B1": 8000, "B2": 9000, "B3": 10000, "B4": 11000]]
My code looks something like as shown below but somehow I am unable to get the desired result. I am unable to figure out what is going wrong. I am taking each value from response Json and compare it with any value in request-Json to find if there is a difference or not.
def diffCounter = 0
Set diffSet = []
respJson.each { respJ ->
reqJson.any {
reqJ ->
if (respJ.B1+respJ.B2+respJ.B3+respJ.B4 != reqJ.B1+reqJ.B2+reqJ.B3+reqJ.B4) {
diffCounter += 1
diffSet << [
"B1" : respJ.B1,
"B2" : respJ.B2,
"B3" : respJ.B3,
"B4" : respJ.B4
]
}
}
}
println ("Difference Count: "+ diffCounter)
println ("Difference Set: "+ diffSet)
Actual Output:
Difference Count: 5
Difference Set: [[B1:100, B2:200, B3:300, B4:400], [B1:500, B2:600, B3:700, B4:800], [B1:900, B2:1000, B3:2000, B4:3000], [B1:4000, B2:5000, B3:6000, B4:7000], [B1:8000, B2:9000, B3:10000, B4:11000]]
Expected Output:
Difference Count: 1
Difference Set: [["B1": 8000, "B2": 9000, "B3": 10000, "B4": 11000]]
NOTE: It can also happen that the request-json is bigger than the response-json so in that case I need to store the difference obtained from request-json into the diffSet.
Any inputs/suggestions in this regard will be helpful.

As #daggett mentioned, if your JSONs become more nested/complicated, you will want to use a library to do this job for you.
In your use case of pure Lists of elements (with values that can be concatenated/added to form a unique key for that element) there is no problem with doing it 'manually'.
The problem with your code is that you check if any reqJson entry has a different count, which for 2+ different reqJson entries is always true.
What you really want to check is if there is any matching reqJson entry that has the same count. And if you can't find any matching entry, then you know that entry only exists in respJson.
def diffCounter = 0
Set diffSet = []
respJson.each { respJ ->
def foundMatching = reqJson.any { reqJ ->
respJ.B1 + respJ.B2 + respJ.B3 + respJ.B4 == reqJ.B1 + reqJ.B2 + reqJ.B3 + reqJ.B4
}
if (!foundMatching) {
diffCounter += 1
diffSet << [
"B1" : respJ.B1,
"B2" : respJ.B2,
"B3" : respJ.B3,
"B4" : respJ.B4
]
}
}
println ("Difference Count: "+ diffCounter)
println ("Difference Set: "+ diffSet)
You mention that reqJson can become bigger than respJson and that in that case you want to switch the roles of the two arrays in the comparison, so that you always get the unmatched elements from the larger array. A trick to do this is to start by swapping the two variables around.
if (reqJson.size() > respJson.size()) {
(reqJson, respJson) = [respJson, reqJson]
}
Note that the time complexity of this algorithm is O(m * n * 2i), meaning it grows linearly with the multiplication of the sizes of the two arrays (m and n, here 5 and 4), times the count of property accesses we do every loop on both elements (i for both elements, here 4 because there are 4 Bs), because we potentially check each element of the smaller array one time for each element of the bigger array.
So if the arrays are tens of thousands of elements long, this will become very slow. A simple way to speed it up to O(m * i + n * i) would be to:
make a Set smallArrayKeys out of the concatenates messages/added values of the smaller array
iterate through the bigger array, check if it's concatenated message is contained in the smallArrayKeys Set, and if not then it only exists in the bigger array.

Return a value of dictionary where a variable is inbetween keys or values

I have some data that I think would work best as a dictionary or JSON. The data has an initial category, a, b...z, and five bands within each category.
What I want to be able to do is give a function a category and a value and for the function to return the corresponding band.
I tried to create a dictionary like this where the values of each band are the lower threshold i.e. for category a, Band 1 is between 0 and 89:
bandings = {
'a' :
{
'Band 1' : 0,
'Band 2': 90,
'Band 3': 190,
'Band 4': 420,
'Band 5': 500
},
'b' :
{
'Band 1' : 0,
'Band 2': 500,
'Band 3': 1200,
'Band 4': 1700,
'Band 5': 2000
}
}
So if I was to run a function:
lookup_band(category='a', value=100)
it would return 'Band 3' as 100 is between 90 and 189 in category a
I also experimented with settings keys as ranges but struggled with how to handle a range of > max value in Band 5.
I can change the structure of the dictionary or use a different way of referencing the data.
Any ideas, please?

You can structure your data a little bit differently (using sorted lists instead of dictionaries) and use bisect module. For example:
from bisect import bisect
bandings = {
'a': [0, 90, 190, 420, 500],
'b': [0, 500, 1200, 1700, 2000]
}
def lookup_band(bandings, band, value):
return 'Band {}'.format(bisect(bandings[band], value))
print(lookup_band(bandings, 'a', 100)) # Band 2
print(lookup_band(bandings, 'b', 1700)) # Band 4
print(lookup_band(bandings, 'b', 9999)) # Band 5

Filter a json data by another array in underscore.js

I have a search field and I want to add some complex functionality using underscore.js.
Sometimes users search for a whole "sentence" like "Samsung galaxy A20s ultra". I want to filter JSON data using any of the words in the search string and sort by results that contain more of the words.
Sample data:
var phones = [
{name: "Samsung A10s", id: 845},
{name: "Samsung galaxy", id: 839},
{name: "Nokia 7", id: 814},
{name: "Samsung S20s ultra", id: 514},
{name: "Apple iphone ultra", id: 159},
{name: "LG S20", id: 854}];
What is the best way to do it in underscore?

In this answer, I'll be building a function searchByRelevance that takes two arguments:
a JSON array of phones with name and id properties, and
a search string,
and which returns a new JSON array, with only the phones of which the name has at least one word in common with the search string, sorted such that the phones with the most common words come first.
Let's first identify all the subtasks and how you could implement them with Underscore. Once we've done that, we can compose them into the searchByRelevance function. In the end, I'll also spend some words on how we might determine what is "best".
Subtasks
Split a string into words
You don't need Underscore for this. Strings have a builtin split method:
"Samsung galaxy A20s ultra".split(' ')
// [ 'Samsung', 'galaxy', 'A20s', 'ultra' ]
However, if you have a whole array of strings and you want to split them all, so you get an array of arrays, you can do so using _.invoke:
_.invoke([
'Samsung A10s',
'Samsung galaxy',
'Nokia 7',
'Samsung S20s ultra',
'Apple iphone ultra',
'LG S20'
], 'split', ' ')
// [ [ 'Samsung', 'A10s' ],
// [ 'Samsung', 'galaxy' ],
// [ 'Nokia', '7' ],
// [ 'Samsung', 'S20s', 'ultra' ],
// [ 'Apple', 'iphone', 'ultra' ],
// [ 'LG', 'S20' ] ]
Find the words that two arrays have in common
If you have two arrays of words,
var words1 = [ 'Samsung', 'galaxy', 'A20s', 'ultra' ],
words2 = [ 'Apple', 'iphone', 'ultra' ];
then you can get a new array with just the words they have in common using _.intersection:
_.intersection(words1, words2) // [ 'ultra' ]
Count the number of words in an array
This is again something you don't need Underscore for:
[ 'Samsung', 'A10s' ].length // 2
But if you have multiple arrays of words, you can get the word counts for all of them using _.map:
_.map([
[ 'Samsung', 'A10s' ],
[ 'Samsung', 'galaxy' ],
[ 'Nokia', '7' ],
[ 'Samsung', 'S20s', 'ultra' ],
[ 'Apple', 'iphone', 'ultra' ],
[ 'LG', 'S20' ]
], 'length')
// [ 2, 2, 2, 3, 3, 2 ]
Sort an array by some criterion
_.sortBy does this. For example, the phones data by id:
_.sortBy(phones, 'id')
// [ { name: 'Apple iphone ultra', id: 159 },
// { name: 'Samsung S20s ultra', id: 514 },
// { name: 'Nokia 7', id: 814 },
// { name: 'Samsung galaxy', id: 839 },
// { name: 'Samsung A10s', id: 845 },
// { name: 'LG S20', id: 854 } ]
To sort descending instead of ascending, you can first sort ascending and then reverse the result using the builtin reverse method:
_.sortBy(phones, 'id').reverse()
// [ { name: 'LG S20', id: 854 },
// { name: 'Samsung A10s', id: 845 },
// { name: 'Samsung galaxy', id: 839 },
// { name: 'Nokia 7', id: 814 },
// { name: 'Samsung S20s ultra', id: 514 },
// { name: 'Apple iphone ultra', id: 159 } ]
You can also pass a criterion function. The function receives the current item and it can do anything, as long as it returns a string or number to use as the rank of the current item. For example, this sorts the phones by the last letter of the name (using _.last):
_.sortBy(phones, function(phone) { return _.last(phone.name); })
// [ { name: 'LG S20', id: 854 },
// { name: 'Nokia 7', id: 814 },
// { name: 'Samsung S20s ultra', id: 514 },
// { name: 'Apple iphone ultra', id: 159 },
// { name: 'Samsung A10s', id: 845 },
// { name: 'Samsung galaxy', id: 839 } ]
Group the elements of an array by some criterion
Instead of sorting directly, we might also first only group the items by a criterion. Here's grouping the phones by the first letter of the name, using _.groupBy and _.first:
_.groupBy(phones, function(phone) { return _.first(phone.name); })
// { S: [ { name: 'Samsung A10s', id: 845 },
// { name: 'Samsung galaxy', id: 839 },
// { name: 'Samsung S20s ultra', id: 514 } ],
// N: [ { name: 'Nokia 7', id: 814 } ],
// A: [ { name: 'Apple iphone ultra', id: 159 } ],
// L: [ { name: 'LG S20', id: 854 } ] }
We have seen that we can pass keys to sort or group by, or a function that returns something to use as a criterion. There is a third option which we can use here instead of the function above:
_.groupBy(phones, ['name', 0])
// { S: [ { name: 'Samsung A10s', id: 845 },
// { name: 'Samsung galaxy', id: 839 },
// { name: 'Samsung S20s ultra', id: 514 } ],
// N: [ { name: 'Nokia 7', id: 814 } ],
// A: [ { name: 'Apple iphone ultra', id: 159 } ],
// L: [ { name: 'LG S20', id: 854 } ] }
Getting the keys of an object
This is what _.keys is for:
_.keys({name: "Samsung A10s", id: 845}) // [ 'name', 'id' ]
You can also do this with the standard Object.keys. _.keys works in old environments where Object.keys doesn't. Otherwise, they are interchangeable.
Turn an array of things into other things
We have previously seen the use of _.map to get the lengths of multiple arrays of words. In general, it takes an array or object and something that you want to be done with each element of that array or object, and it will return an array with the results:
_.map(phones, 'id')
// [ 845, 839, 814, 514, 159, 854 ]
_.map(phones, ['name', 0])
// [ 'S', 'S', 'N', 'S', 'A', 'L' ]
_.map(phones, function(phone) { return _.last(phone.name); })
// [ 's', 'y', '7', 'a', 'a', '0' ]
Note the similarity with _.sortBy and _.groupBy. This is a general pattern in Underscore: you have a collection of something and you want to do something with each element, in order to arrive at some sort of result. The thing you want to do with each element is called the "iteratee". Underscore has a function that ensures you can use the same iteratee shorthands in all functions that work with an iteratee: _.iteratee.
Sometimes you may want to do something with each element of a collection and combine the results in a way that is different from what _.map, _.sortBy and the other Underscore functions already do. In this case, you can use _.reduce, the most general function of them all. For example, here's how we can create a mixture of the names of the phones, by taking the first letter of the name of the first phone, the second letter of the name of the second phone, and so forth:
_.reduce(phones, function(memo, phone, index) {
return memo + phone.name[index];
}, '')
// 'Sakse0'
The function that we pass to _.reduce is invoked for each phone. memo is the result that we've built so far. The result of the function is used as the new memo for the next phone that we process. In this way, we build our string one phone at a time. The last argument to _.reduce, '' in this case, sets the initial value of memo so we have something to start with.
Concatenate multiple arrays into a single one
For this we have _.flatten:
_.flatten([
[ 'Samsung', 'A10s' ],
[ 'Samsung', 'galaxy' ],
[ 'Nokia', '7' ],
[ 'Samsung', 'S20s', 'ultra' ],
[ 'Apple', 'iphone', 'ultra' ],
[ 'LG', 'S20' ]
])
// [ 'Samsung', 'A10s', 'Samsung', 'galaxy', 'Nokia', '7',
// 'Samsung', 'S20s', 'ultra', 'Apple', 'iphone', 'ultra',
// 'LG', 'S20' ]
Putting it all together
We have an array of phones and a search string, we want to somehow compare each of those phones to the search string, and finally we want to combine the results of that so we get the phones by relevance. Let's start with the middle part.
Does "each of those phones" ring a bell? We are creating an iteratee! We want it to take a phone as its argument, and we want it to return the number of words that its name has in common with the search string. This function will do that:
function relevance(phone) {
return _.intersection(phone.name.split(' '), searchTerms).length;
}
This assumes that there is a searchTerms variable defined outside of the relevance function. It has to be an array with the words in the search string. We'll deal with this in a moment; let's address how to combine our results first.
While there are many ways possible, I think the following is quite elegant. I start with grouping the phones by relevance,
_.groupBy(phones, relevance)
but I want to omit the group of phones that have zero words in common with the search string:
var groups = _.omit(_.groupBy(phones, relevance), '0');
Note that I'm omitting the string key '0', not the number key 0, because the result of _.groupBy is an object, and the keys of an object are always strings.
Now we need to order the remaining groups by the number of matching words. We know the number of matching words for each group by taking the keys of our groups,
_.keys(groups)
and we can sort these ascending first, but we must take care to cast them back to numbers, so that we will sort 2 before 10 (numerical comparison) instead of '10' before '2' (lexicographical comparison):
_.sortBy(_.keys(groups), Number)
then we can reverse this in order to arrive at the final order of our groups.
var tiers = _.sortBy(_.keys(groups), Number).reverse();
Now we just need to transform this sorted array of keys into an array with the actual groups of phones. To do this, we can use _.map and _.propertyOf:
_.map(tiers, _.propertyOf(groups))
Finally, we only need to flatten this into one big array, in order to have our search results by relevance.
_.flatten(_.map(tiers, _.propertyOf(groups)))
Let's wrap all of this up into our searchByRelevance function. Remember that we still needed to define searchTerms outside of our relevance iteratee:
function searchByRelevance(phones, searchString) {
var searchTerms = searchString.split(' ');
function relevance(phone) {
return _.intersection(phone.name.split(' '), searchTerms).length;
}
var groups = _.omit(_.groupBy(phones, relevance), '0');
var tiers = _.sortBy(_.keys(groups), Number).reverse();
return _.flatten(_.map(tiers, _.propertyOf(groups)));
}
Now put it to the test!
searchByRelevance(phones, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
// { name: 'Samsung S20s ultra', id: 514 },
// { name: 'Samsung A10s', id: 845 },
// { name: 'Apple iphone ultra', id: 159 } ]
What is "best"?
If you measure "goodness" by the number of lines of code, then less code is generally better. We implemented searchByRelevance above in just eight lines of code, so that seems pretty good.
It is, however, a bit dense. The number of lines increases, but the readability improves a bit, if we use chaining:
function searchByRelevance(phones, searchString) {
var searchTerms = searchString.split(' ');
function relevance(phone) {
return _.intersection(phone.name.split(' '), searchTerms).length;
}
var groups = _.chain(phones)
.groupBy(relevance)
.omit('0');
return groups.keys()
.sortBy(Number)
.reverse()
.map(_.propertyOf(groups.value()))
.flatten()
.value();
}
Yet another dimension of "goodness" is performance. Could searchByRelevance be faster? To get a sense of this, we usually take the smallest and most frequent operation, and we calculate how often we'll be executing that operation for a given size of input.
The main thing we'll be doing a lot in searchByRelevance, is comparing words. This is not the smallest operation, because comparing words consists of comparing letters, but because words in English tend to be short, we can pretend for now that comparing two words is our smallest and most executed operation. This makes the calculations a bit easier.
For each phone, we will be comparing each word in its name with each word in our search string. If we have 100 phones, and the average phone name has 3 words, and the search string has 5 words, then we will be making 100 * 3 * 5 = 1500 word comparisons.
Computers are fast, so 1500 is nothing. Generally, if the number of times you execute your smallest step remains under 100000 (100k), you probably won't even notice a delay unless that smallest step is very expensive.
However, the number of word comparisons will grow quite explosively with larger inputs. If we have 20000 (20k) phones, 5 words in the average name and a search string of 10 words, we are already making a million word comparisons. That could mean staring at your screen for a few seconds before the results come in.
Can we write a variant of searchByRelevance that can search 20k phones with long names in an eyeblink? Yes, and in fact we can probably also do a million or more! I won't go into the details line by line, but we can get much better speed by using appropriate lookup structures:
// lookup table by word in the name
function createIndex(phones) {
return _.reduce(phones, function(lookup, phone) {
_.each(phone.name.split(' '), function(word) {
var matchingPhones = (lookup[word] || []);
matchingPhones.push(phone.id);
lookup[word] = matchingPhones;
});
return lookup;
}, {});
}
// search using lookup tables
function searchByRelevance(phonesById, idsByWord, searchString) {
var groups = _.chain(searchString.split(' '))
.map(_.propertyOf(idsByWord))
.compact()
.flatten()
.countBy()
.pairs()
.groupBy('1');
return groups.keys()
.sortBy(Number)
.reverse()
.map(_.propertyOf(groups.value()))
.flatten(true) // only one level of flattening
.map('0')
.map(_.propertyOf(phonesById))
.value();
}
To use this, we create the lookup tables once, then reuse them for each search. We need to recreate the lookup tables only if the JSON data of phones change.
var phonesById = _.indexBy(phones);
var idsByWord = createIndex(phones);
searchByRelevance(phonesById, idsByWord, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
// { name: 'Samsung S20s ultra', id: 514 },
// { name: 'Samsung A10s', id: 845 },
// { name: 'Apple iphone ultra', id: 159 } ]
searchByRelevance(phonesById, idsByWord, 'Apple')
// [ { name: 'Apple iphone ultra', id: 159 } ]
To appreciate how much faster this is, let's count the smallest operations again. In createIndex, the smallest most frequent operation is storing an association between a word and the id of a phone. We do this once for each phone, for each word in its name. In searchByRelevance, the smallest most frequent operation is incrementing the relevance of a given phone in the countBy step. We do this once for each word in the search string, for each phone that matches that word.
We can estimate the number of matching phones for a given search string if we make some reasonable assumptions. The most frequent words in the phone names are probably the brands, such as "Samsung" and "Apple". Since there are at least ten brands, we can assume that the number of phones that match a given search term is generally less than 10% of the total number of phones. So the time it takes to execute one search is the number of words in the search string, times the number of phones, times 10% (i.e., divided by 10).
So if we have 100 phones with on average 3 words in the name, then indexing takes 100 * 3 = 300 times storing an association in the idsByWord lookup table. Performing a search with 5 words in the search string takes only 5 * 100 * 10% = 50 relevance increments. This is already much faster than the 1500 word comparisons we needed without lookup tables, although the human behind the computer will not notice the difference in this case.
The speed advantage of the approach with the lookup table further increases with larger inputs:
┌───────────────────┬───────┬────────┬───────┐
│ Problem size │ Small │ Medium │ Large │
├───────────────────┼───────┼────────┼───────┤
│ phones │ 100 │ 20k │ 1M │
│ words per name │ 3 │ 5 │ 8 │
│ search terms │ 5 │ 10 │ 15 │
├───────────────────┼───────┼────────┼───────┤
│ w/o lookup tables │ │ │ │
│ word comparisons │ 1500 │ 1M │ 120M │
├───────────────────┼───────┼────────┼───────┤
│ w/ lookup tables │ │ │ │
│ associations │ 300 │ 100k │ 8M │
│ increments │ 50 │ 20k │ 1.5M │
└───────────────────┴───────┴────────┴───────┘
This is, in fact, still underestimating the speed advantage, since the percentage of phones that match a given search term is likely to drop as the number of phones increases.
Lookup tables make searching much faster. But is it better? As I said before, for small problem sizes, the speed difference will not be noticable. A disadvantage of the lookup tables is that this requires more code, which makes it a bit harder understand, as well as taking more effort to maintain. It also requires a lookup table as large as the number of associations, which means we will be using much more additional memory than before.
To conclude, what is "best" always depends on a tradeoff between different constraints, such as code size, speed and memory usage. It is up to you to decide how you want to weigh these constraints relative to each other.

Integer CSV Compression Algorithm

I did surface level research about the existance of an algorithm that compresses comma seperated integers however i did not find anything relevant.
My goal is to compress large amounts of structured comma separated integers whos value ranges are known. Is there a known algorithm to do such a thing? If not where would be a good start to read about some relevant areas of interest which will get me started on developing such algorithm? Ofcourse the algorithm has to be reversable and lossles such that i can uncompress the compressed data to retrieve the csv values.
The data structure is an array of three values, first number's domain is from 0 to 4, second is from 0 to 6, third is from 0 to n where n is not a large number. This structure is repeated to create data which is in a two dimensional array.

Using standard compression algorithms such as gzip or bzip2 on structured data does not yield optimum compression efficiency, therefore constructing a case specific algorithm did the trick.
The data structure is shown below with an example.
// cell: a data structure, array of three numbers
// digits[0]: { 0, 1, 2, 3, 4 }
// digits[1]: { 0, 1, 2, 3 }
// digits[2]: { 0, 1, 2, ..., n } n is not an absurdly large number
// Below it is reused in a multi-dimensional array.
var cells = [
[ [3, 0, 1], [4, 2, 4], [3, 0, 2], [4, 1, 3] ],
[ [4, 2, 3], [3, 0, 3], [4, 3, 3], [1, 1, 0] ],
[ [3, 3, 0], [2, 3, 1], [2, 2, 5], [0, 2, 4] ],
[ [2, 1, 0], [3, 0, 0], [0, 2, 3], [1, 0, 0] ]
];
I did various tests on this data structure (excluding the white-spaces as string) using standard compression algorithms:
gz compressed from 171 to 88 bytes
bzip2 compressed from 171 to 87 bytes
deflate compressed from 171 to 76 bytes
The algorithm I constructed compressed the data down to 33 bytes works up till n = 192. So on a case specific basis I was able to compress my data with more than double efficiency of standard text compression algorithms.
The way I achieved such compression is by mapping the possible values of all the different combinations which cells can hold to integers. If you want to investigate such a concept it's known as combinatorics in Mathematics. I then converted the base 10 integer into a higher base for string representation.
Since I am aiming for human usability (the compressed code will be typed) I used base 62 which I represented as {[0-9], [a-z], [A-Z]} from 0 to 61 respectively. I buffered the cell length when converted to Base62 to two digits. This allowed for 62*62 (3844) different cell combinations.
Finally, I added a base 62 digit at the beginning of the compressed string which represents the number of columns. When decompressing the y size is used to deduce the x size from the string's length. Thus the data can be correctly decompressed with no loss of data.
The compressed string of the above example looks like this:
var uncompressed = compress(cells); // "4n0w1H071c111h160i0B0O1s170308110"
I have provided an explanation of my method to solve my problem to help other facing a similar problem. I have not provided my code for obscurity reasons.
TL;DR
To compress structured data:
Represent discrete object as an integer
Encode the base 10 integer to a higher base
Repeat for all objects
Append number of rows or columns to the compressed string
To decompress structured data:
Read the rows or columns and deduce the other from the string length
Reverse steps 1 and 2 in compression
Repeat for all objects

Unless there's some specific structure to your list that you're not divulging and that might drastically help compression, standard lossless compression algorithms such a gzip or bzip2 should handle a string of numbers just fine.
Libraries for such common algorithms should be ubiquitously available for pretty much all languages and platforms.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How can I control the size of a RecordBatch precisely? - pyarrow

Related

Get the value at a specific index in PyTorch

Groovy: compare two lazy maps/jsons

Return a value of dictionary where a variable is inbetween keys or values

Filter a json data by another array in underscore.js

Integer CSV Compression Algorithm

Categories

Resources