What is Universal Google Analytic custom dimension character/byte limit - cross-browser

I know that General GA data is sent by 1 pixel .gif request data. Where custom value [key, value] pair combined length had maximum fixed size. One limiting factor was IE older browsers that had maximum byte (not sure, but I assumed it was 2048 byte's for all values).
I assumed also that it was encoded in UTF-8 (variable-length multibyte encoding) where character can take 1-4 bytes in length. At some point in time I read that maximum character limit for custom value is 150 characters. At the time I did not need to know any better.
In case Universal GA the key part is stored in GA server and never sent. If I am not mistaken, each custom value has a separate request.
ga('create', 'UA-xxxxxxxx-x', 'xxx');
ga('set', {
'dimension1': '...',
'dimension2': '...',
'dimension3': '...',
'dimension4': '...',
'dimension5': '...'
});
ga('send', 'pageview');
My question is how meany bytes/characters '...' can have in case of Universal GA?

Encoded custom dimension value may not exceed 150 Bytes.
Best case: 150 characters
Worst case : 37 characters
Reference: link

Related

Unexpected bytes used in chrome storage

I'm writing an extension that makes use of the chrome.storage API. I want to truncate each item to make sure it is below the maximum bytes threshold of the storage (local and sync).
The documentation states that the byte size of each individual item is
measured by the JSON stringification of its value plus its key length.
I use the following code to calculate the expected byte size:
new TextEncoder().encode(JSON.stringify(value)).length + key.length
I use the following code to check actual usage:
chrome.storage.<storage-area>.set({ [key]: value }, () => {
chrome.storage.<storage-area>.getBytesInUse(key, bytes => {
console.log("actual bytes in use", bytes);
});
});
Given a key of "test" and a value of "abc", the expected byte usage is 9b. The actual byte usage is 9b.
Given a key of "test" and a value of "«ταБЬℓσ»", the expected byte usage is 23b. The actual byte usage is 23b.
Given a key of "test" and a value of "<", the expected byte usage is 7b. The actual byte usage is 12b.
The storage is of course cleared between each check.
In the last example, what is causing those 5 extra, unexpected, bytes? What am I missing?
Edit: I'm using Google Chrome version 73.0.3683.75 (Official Build) (64-bit)
I found the reason thanks to w0xx0m's comment.
Chrome/Chromium replaces the less than character with "\u003C" to prevent script execution.
Source code can be found here.

Is Not BigInt Enough To House sha1?

I want to know if BigInt is enough in size.
I have created a registration.php where the user gets emailed an account activation link to click to verify his email so his account gets activated.
Account Activation Link is in this format:
[php]
$account_activation_link =
"http://www.".$site_domain."/".$social_network_name."/activate_account.php?primary_website_email=".$primary_website_email."&account_activation_code=".$account_activation_code."";
[/php]
Account Activation Code is in this format:
$account_activation_code = sha1( (string) mt_rand(5, 30)); //Type Casted the INT to STRING on the 1st parameter of sha1 as it needs to be a STRING.
Now, the following link got emailed:
http://www.myssite.com/folder/activate_account.php?primary_website_email=my.email#gmail.com&account_activation_code=22d200f8670dbdb3e253a90eee5098477c95c23d
Note the account activation code that got generated by sha1:
22d200f8670dbdb3e253a90eee5098477c95c23d
But in my mysql db, in the "account_activation_code" column, I only see:
"22". The rest of the activation code is missing. Why is that ?
The column is set to BigInt. Is not that enough to house the Sha1 generated code ?
What is your suggestion ?
Thank You
Hashing methods like SHA-1 produce binary values that are on the order of 160+ bits long depending on the variant used. The common SHA256 one is 256 bits long. No cryptographic hash will fit in a 64-bit BIGINT field because 64-bit hashes are uselessly small, you'll have nothing but collisions.
Normally people store hashes as their hex-encoded equivalents in a VARCHAR(255) column. These can be indexed and perform well enough in most situations, especially one where you do periodic lookups based on clicks. From a performance and storage perspective there's no problems here.
Short answer: BIGINT is way too small.
A hash is basically a stream of bits (160 bits in the case of SHA-1). While it's certainly possible to render those bits as a base 2 number and convert it to base 10, you need a really big storage to do so (as far as I know it's not common to see integer variables larger then 64 bits) and there aren't obvious advantages. BIGINT is a 64-bit type, thus cannot do the job.
Unless you have a good reason to store it as number, I'd simply go for either a binary column type or its plain-text hexadecimal representation in a good old VARCHAR (the latter tends to be more practical to handle).
You are trying to store a string in a BigInt. That is your issue. SHA hashes are a mix of alphanumeric characters not just numbers. Change the field to a VARCHAR and you'll be fine

Correct way to store a bit array

I'm working on a project that needs to store something like
101110101010100011010101001
into the database. It's not a file or archive: it's only a bit array, and I think that storing it into a varchar column is waste of space/performance.
I've searched about the BLOB and the VARBINARY type. But both of then allows to insert a value like 54563423523515453453, that's not exactly a bit array.
For sure, if I store a bit array like 10001000 into a BLOB/varbinary/varchar column, it will consume more than a byte, and I want that the minimum space is consumed. In the case of eight bits, it needs to consume only one byte, 16 bits two bytes, and so on.
If it's not possible, then what is the best approach to waste the minimum amount of space in this case?
Important notes: The size of the array is variable, and is not divisible by eight in every situation. Sometimes I will need to store 325 bits, other times 7143 bits....
In one of my previous projects, I converted streams of 1's and 0' to decimal, but they were shorter. I dont know if that would be applicable in your project.
On the other hand, imho, you should clarify what will you need to do with that data once you get it stored. Search? Compare? It might largely depend on the purpose of the database.
Could you gzip it and then store it? Is that applicable?
Binary is a string representation of a number. The string
101110101010100011010101001
represents the number
... + 1*25 + 0*24 + 1*23 + 0*22 + 0*21 + 1*20
As such, it can be stored in a 32-bit integer if were to be converted from a binary string to the number it represents. In Perl, one would use
oct('0b'.$binary)
But you have a variable number of bits. Not a problem! Just process them 8 at a time to create a string of bytes to place in a BLOB or similar.
Ah, but there's a catch. You'll need to add padding to get a number divisible by 8, which means you'll have to use a means of removing that padding. A simple approach if there's a known maximum length is to use a length prefix. e.g. If you know the number of bits is never going to exceed 65,535, encode the number of bits in the first two bytes of the string.
pack('nB*', length($binary), $binary)
which is reverted using
my ($length, $binary) = unpacked('nB*', $packed);
substr($binary, $length) = '';

What column type/length should I use for storing a Bcrypt hashed password in a Database?

I want to store a hashed password (using BCrypt) in a database. What would be a good type for this, and which would be the correct length? Are passwords hashed with BCrypt always of same length?
EDIT
Example hash:
$2a$10$KssILxWNR6k62B7yiX0GAe2Q7wwHlrzhF3LqtVvpyvHZf0MwvNfVu
After hashing some passwords, it seems that BCrypt always generates 60 character hashes.
EDIT 2
Sorry for not mentioning the implementation. I am using jBCrypt.
The modular crypt format for bcrypt consists of
$2$, $2a$ or $2y$ identifying the hashing algorithm and format
a two digit value denoting the cost parameter, followed by $
a 53 characters long base-64-encoded value (they use the alphabet ., /, 0–9, A–Z, a–z that is different to the standard Base 64 Encoding alphabet) consisting of:
22 characters of salt (effectively only 128 bits of the 132 decoded bits)
31 characters of encrypted output (effectively only 184 bits of the 186 decoded bits)
Thus the total length is 59 or 60 bytes respectively.
As you use the 2a format, you’ll need 60 bytes. And thus for MySQL I’ll recommend to use the CHAR(60) BINARYor BINARY(60) (see The _bin and binary Collations for information about the difference).
CHAR is not binary safe and equality does not depend solely on the byte value but on the actual collation; in the worst case A is treated as equal to a. See The _bin and binary Collations for more information.
A Bcrypt hash can be stored in a BINARY(40) column.
BINARY(60), as the other answers suggest, is the easiest and most natural choice, but if you want to maximize storage efficiency, you can save 20 bytes by losslessly deconstructing the hash. I've documented this more thoroughly on GitHub: https://github.com/ademarre/binary-mcf
Bcrypt hashes follow a structure referred to as modular crypt format (MCF). Binary MCF (BMCF) decodes these textual hash representations to a more compact binary structure. In the case of Bcrypt, the resulting binary hash is 40 bytes.
Gumbo did a nice job of explaining the four components of a Bcrypt MCF hash:
$<id>$<cost>$<salt><digest>
Decoding to BMCF goes like this:
$<id>$ can be represented in 3 bits.
<cost>$, 04-31, can be represented in 5 bits. Put these together for 1 byte.
The 22-character salt is a (non-standard) base-64 representation of 128 bits. Base-64 decoding yields 16 bytes.
The 31-character hash digest can be base-64 decoded to 23 bytes.
Put it all together for 40 bytes: 1 + 16 + 23
You can read more at the link above, or examine my PHP implementation, also on GitHub.
If you are using PHP's password_hash() with the PASSWORD_DEFAULT algorithm to generate the bcrypt hash (which I would assume is a large percentage of people reading this question) be sure to keep in mind that in the future password_hash() might use a different algorithm as the default and this could therefore affect the length of the hash (but it may not necessarily be longer).
From the manual page:
Note that this constant is designed to change over time as new and
stronger algorithms are added to PHP. For that reason, the length of
the result from using this identifier can change over time. Therefore,
it is recommended to store the result in a database column that can
expand beyond 60 characters (255 characters would be a good choice).
Using bcrypt, even if you have 1 billion users (i.e. you're currently competing with facebook) to store 255 byte password hashes it would only ~255 GB of data - about the size of a smallish SSD hard drive. It is extremely unlikely that storing the password hash is going to be the bottleneck in your application. However in the off chance that storage space really is an issue for some reason, you can use PASSWORD_BCRYPT to force password_hash() to use bcrypt, even if that's not the default. Just be sure to stay informed about any vulnerabilities found in bcrypt and review the release notes every time a new PHP version is released. If the default algorithm is ever changed it would be good to review why and make an informed decision whether to use the new algorithm or not.
I don't think that there are any neat tricks you can do storing this as you can do for example with an MD5 hash.
I think your best bet is to store it as a CHAR(60) as it is always 60 chars long
I think best choice is nonbinary type, because in comparison is less combination and should be faster. If data is encoded with base64_encode then each position, each byte have only 64 possible values. If encoded with bin2hex each byte have only 16 possible values, but string is much longer. In binary byte have 256 position on each.
I use for hashes in form of encode64 VARCHAR(255) column with ascii character set and the same collation.
VARBINARY causes comparison problem as described in MySQL documentation. I don't know why answers advice to use VARBINARY have so many positives.
I checked this on my author site, where measure time (just refresh to see).

Creating a hash of a string thats sortable

Is there anyway to create hashs of strings where the hashes can be sorted and have the same results as if the strings themselves were sorted?
This won't be possible, at least if you allow strings longer than the hash size. You have 256^(max. string size) possible strings mapped to 256^(hash size) hash values, so you'll end up with some of the strings unsorted.
Just imagine the simplest hash: Truncating every string to (hash size) bytes.
Yes. It's called using the entire input string as the hash.
As others have pointed out it's not practical to do exactly what you've asked. You'd have to use the string itself as the hash which would constrain the lengths of strings that could be "hashed" and so on.
The obvious approach to maintaining a "sorted hash" data structure would be to maintain both a sorted list (heap or binary tree, for example) and a hashed mapping of the data. Inserts and removals would be O(log(n)) while retrievals would be O(1). Off hand I'm not sure how often this would be worth the additional complexity and overhead.
If you had a particularly large data set, mostly read-only and such that logarithmic time retrieval was overly expensive then I suppose it might be useful. Note that the cost of updates is actually the sum of the constant time (hash) and the logarithmic time (binary tree or heap) operations. However O(1) + O(log(n)) reduces to the larger of the two terms during asymptotic analysis. (The underlying cost is still there --- relevant to any implementation effort regardless of its theoretical irrelevance).
For a significant range of data set sizes the cost of maintaining this hypothetical hybrid data structure could be estimated as "twice" the cost of maintaining either of the pure ones. (In other words many implementations of a binary tree over can scale to billions of elements (2^~32 or so) in time cost that's comparable to the cost of the typical hash functions). So I'd be hard-pressed to convince myself that such added code complexity and run-time cost (of a hybrid data structure) would actually be of benefit to a given project.
(Note: I saw that Python 3.1.1 added the notion of "ordered" dictionaries ... and this is similar to being sorted, but not quite the same. From what I gather the ordered dictionary preserves the order in which elements were inserted to the collection. I also seem to remember some talk of "views" ... objects in the language which can access keys of a dictionary in some particular manner (sorted, reversed, reverse sorted, ...) at (possibly) lower cost than passing the set of keys through the built-in "sorted()" and "reversed()." I haven't used these nor have a looked at the implementation details. I would guess that one of these "views" would be something like a lazily evaluated index, performing the necessary sorting on call, and storing the results with some sort of flag or trigger (observer pattern or listener) that's reset when the back-end source collection is updated. In that scheme a call to the "view" would update its index; subsequence calls would be able to use those results so long as no insertions nor deletions had been made to the dictionary. Any call to the view subsequent to key changes would incur the cost of updating the view. However this is all pure speculation on my part. I mention it because it might also provide insight into some alternative ways to approach the question).
Not unless there are fewer strings than hashes, and the hashes are perfect. Even then you still have to ensure the hash order is the same as the string order, this is probably not possible unless you know all the strings ahead of time.
No. The hash would have to contain the same amount of information as the string it is replacing. Otherwise, if two strings mapped to the same hash value, how could you possibly sort them?
Another way of thinking about it is this: If I have two strings, "a" and "b", then I hash both of them with this sort preserving hash function and get f(a) and f(b). However, there are an infinite number of strings that are greater than "a" but less than "b". This would require hashing the strings to arbitrary precision Real values (because of cardinality). In the end, you would basically just have the string encoded as a number.
You're essentially asking if you can compress the key strings into smaller keys while preserving their collation order. So it depends on your data. If your strings are composed of only hexadecimal digits, for example, they can be replaced with 4-bit codes.
But for the general case, it can't be done. You'd end up "hashing" each source key into itself.
I stumble upon this, and although everyone is correct with their answers, I needed a solution exactly like this to use in elasticsearch (don't ask why). Sometimes we don't need a perfect solution for all cases, we just need one to work with the constraints that are acceptable. My solution is able to generate a sortable hashcode for the first n chars of the string, I did some preliminary tests and didn't have any collisions. You need to define beforehand the charset that is used and play with n to a deemed acceptable value of the first chars needed to sort and try to maintain the result hash code in the positive interval of the defined type for it to work, in my case, for Java Long type I could go up to 13 chars.
Below is my code in Java, hopefully, it will help someone else that needs this.
String charset = "abcdefghijklmnopqrstuvwxyz";
public long orderedHash(final String s, final String charset, final int n) {
Long hash = 0L;
if(s.isEmpty() || n == 0)
return hash;
Long charIndex = (long)(charset.indexOf(s.charAt(0)));
if(charIndex == -1)
return hash;
for(int i = 1 ; i < n; i++)
hash += (long)(charIndex * Math.pow(charset.length(), i));
hash += charIndex + 1 + orderedHash(s.substring(1), charset, n - 1);
return hash;
}
Examples:
orderedHash("a", charset, 13) // 1
orderedHash("abc", charset, 13) // 4110785825426312
orderedHash("b", charset, 13) // 99246114928149464
orderedHash("google", charset, 13) // 651008600709057847
orderedHash("stackoverflow", charset, 13) // 1858969664686174756
orderedHash("stackunderflow", charset, 13) // 1858969712216171093
orderedHash("stackunderflo", charset, 13) // 1858969712216171093 same, 13 chars limitation
orderedHash("z", charset, 13) // 2481152873203736576
orderedHash("zzzzzzzzzzzzz", charset, 13) // 2580398988131886038
orderedHash("zzzzzzzzzzzzzz", charset, 14) // -4161820175519153195 no good, overflow
orderedHash("ZZZZZZZZZZZZZ", charset, 13) // 0 no good, not in charset
If more precision is needed, use an unsigned type or a composite one made of two longs for example and compute the hashcode with substrings.
Edit: Although the previously algorithm sufficed for my use I noticed that it was not really ordering correctly the strings if they didn't have a length bigger that the chosen n. With this new algorithm it should be ok now.