Is PayPal's "PNREF" always 12 characters? - mysql

Does anyone know if PayPal's "PNREF" (returned from zero-dollar authorizations) is always 12 characters?
This I ask because I want to optimize my mySQL storage.
And also, I trust SO's answer more than PP's :-D

Don't "optimize" your storage. Not only do server-grade terabyte sized drives cost just a few hundred dollars, making the cost of storing a handful of bytes nearly zero, but VARCHAR(255) columns only take up as much space as you have content because they are variable length.
If you ran a million transactions and saved ten bytes on each, you've saved all of ten megabytes of data, or about $0.0001 worth of storage. I'm presuming if you've run a million transactions you can afford the bytes. The PayPal fees will be literally several quadrillion times higher.
In actuality there's zero savings between 12 characters in VARCHAR(12) and VARCHAR(255). Internally these are represented as a single length byte plus N bytes for the content. For regular 7-bit values that means 13 bytes per entry.
The only difference is you're arbitrarily limiting the former to 12 characters and will get truncation errors (if this flag is set, as it is on newer versions of MySQL) if you insert longer values, or you'll lose data and have no idea until it's probably too late to fix it.
Just use VARCHAR(255) so that your code doesn't explode when PayPal decides today's the day to use 14 characters. These things can change without warning and without any logical reason.

Related

Store large chart data points in MySQL

I am creating an application that stores ECG data.
I want to eventually graph this data in react but for now I need help storing it.
The biggest problem is storing the data points that will go along the x & y axis on the graph. Along the bottom is time and the y axis will be some value between. There are no limits but as it’s basically a heart rhythm most points will lie close to 0.
What is the best way to store the x and y data??
An example of the y data : [204.77, 216.86 … 3372.872]
The files that I will be getting this data from can contain millions of data points, depending on the sampling rate and the time the experiment took.
What is the best way to store this type of data in MySQL. I cannot use any other DB as they’re not installed on the server this will be hosted on.
Thanks
Well as you said there are million of points, so
JSON is the best way to store these points.
The space required to store a JSON document is roughly the same as for LONGBLOB or LONGTEXT;
Please have a look into this -
https://dev.mysql.com/doc/refman/8.0/en/json.html
The JSON encoding of your sample data would take 7-8 bytes per reading. Multiply that by the number of readings you will get at a single time. There is a practical limit of 16MB for a string being fed to MySQL. That seems "too big".
A workaround is to break the list into, say, 1K points at a time. Then there would be several rows, each row being manageable. There would be virtually no limit on the number of points you could store.
FLOAT is 4 bytes, but you would need a separate row for each reading. So, let's say about 25 bytes per row (including overhead); So size is not a problem, however, two other problems could arise. 7 digits is about the limit of precision for FLOAT. Fetching a million rows will not be very fast.
DOUBLE is 8 bytes, 16 digits of precision.
DECIMAL(6,2) is 3 bytes and overflows above 9999.99.
Considering that a computer monitor has less than 4 digits of precision (4K pixels < 10^4), I would argue for FLOAT as "good enough".
Another option is to take the JSON string, compress it, then store that in a LONGBLOB. The compression will give you an average of about 2.5 bytes per reading and the row for one complete reading will be a few megabytes.
I have experience difficulty in INSERTing a row bigger than 1MB. Changing a setting let me got to 16MB; I have not tried any row bigger than that. If you run into troubles there, start a new question with just that topic. I will probably come along and explain how to chunk up the data, thereby even allowing a "row" spread over multiple database rows that could effectively be even bigger than 4GB. That is the 'hard' limit for JSON, LONGTEXT and LONGBLOB.
You did not mention the X values. Are you assuming that they are evenly spaced? If you need to provide X,Y pairs, the computations above get a bit messier, but I have provided some of the data for analysis.

Does a 64-bit Integer PRACTICALLY have a limit?

I am developing a chat feature for my website.
In my MySQL database I use a 64bit signed integer for the chat_id attribute - which is an autoincrement.
So I am worried that once my system obtains a lot of traffic, the chat_id value could overflow.
So my question is does a 64bit integer practically overflow ?
And if so, is there a 128bit integer in MySql, JavaScript and PHP ?
64-bit signed has as highest value 9,223,372,036,854,775,807. It is extremely unlikely that the service you are building you will every have this many chat sessions. Not because I don't believe it will be popular once, but rather because this number is incredibly large.
To give an indication how large this number is (from Wikipedia):
In Java the time in milliseconds is a 64-bit signed integer. It will take 292 millions years to overflow...
So, no, you won't need more than a 64-bit signed integer for a unique chat_id, even more because you use an incrementer.
The largest 64 bit unsigned integer is a staggering 18,446,744,073,709,551,615 - this is 18 quintillion, 446 quadrillion, 744 trillion, 73 billion, 709 million, 551 thousand and 615
If by chat ID you're referring to each unique chat started between two users you'd struggle to reach this amount even if you held the largest messaging service in the entire world.
If for some reason you believe you'd need a value greater than this, store it in a VARCHAR as a combination of digits and characters instead... e.g. 00-09, then 0a-0Z (really crude and won't actually be a good idea in this exact form but you get the point) etc going up systematically like that for as long as you like, that way if you need more space you can just increase the size of the field
When it comes to being an ID value, no, its theoretical limit can not really be reached in practical usage for your scenario.
However, if you tried to store some randomly generated number in it instead, such as an UUID (which requires more than 64bit to store, but I've seen people trying to "speed it up" by downgrading it into a long number and store as such), then you could have matching values when you get to larger number of entries. Please notice this part about UUID, because most folks just observe how it looks theoretically improbable to have a match, then they pump dozens of millions of records within a month and find out that they start having collisions on a regular basis very often due to a way how collision chance rises dramatically as more and more of such "fake UUIDs" are generated.
To sum it up - for your auto-increment, it's fine. But don't try to "squeeze" a 128bit UUID into it and randomize it for some (future) migrations or something. "Hacks" often backfire.

index on url or hashing considering RAM

I am working on a project which needs to add/update around 1 million urls daily. Some days are mostly updates and some days are mostly add and some days are mix.
So, on every query there is need to look up uniqueness of url in url table.
How look up for url can be made really fast because at the moment index is set at url column and it works good but in coming weeks RAM would not be enough if index are kept on same column and new records will be added in millions.
That's why I am looking for a solution so that when there will be 150+ million urls in total then its look up should be fast. I am thinking of creating indexing on md5 but then worries about collision chances. A friend tipped me to calculate crc32 hash also and concatenate with md5 to make collision possibility to zero and store it in binary(20) that way only 20 bytes would be taken as index instead of 255 currently varchar(255) set as url column data type.
Currently there are total around 50 million urls and with 8GB ram its working fine.
Yesterday, I asked a question url text compression (not shortening) and storing in mysql related to the same project.
[Edit]
I have thought of another solution of putting crc32 hash only in decimal form to speed up look up. And at application level porting a check on how many records are returned. If more than 1 record is returned then exact url should also be matched.
That way collision would also be avoided while keep low load on RAM and disk space by storing 4 bytes for each row instead of 20 bytes (md5+crc32). What you say?
After reading all your questions ( unique constraint makes hashes useless? , 512 bit hash vs 4 128bit hash and url text compression (not shortening) and storing in mysql), I understood that your problem is more or less the following:
"I need to store +150M URLs in mySQL, using 8GB of RAM, and still have a good performance on writing them all and retrieving them, because daily I'll update them, so I'll retrive a lot of URLs, check them against the database. Actually it has 50M URLs, and will grow about 1M each day in the following 3 monts."
Is that it?
The following points are important:
How is the format of the URL that you'll save? Will you need to read the URL back, or just update informations about it, but never search based in partial URLs, etc?
Assuming URL = "http://www.somesite.com.tv/images/picture01.jpg" and that you want to store everything, inclusing the filename. If it's different, please provide more details or correct my answer assumptions.
If can save space by replacing some group of characters in the URL. Not all ASCII characters are valid in an URL, as you can see here: RFC1738, so you can use those to represent (and compress) the URL. For example: using character 0x81 to represent "http://" can make you save 6 characters, 0x82 to represent ".jpg" can save you another 3 bytes, etc.
Some words might be very common (like "image", "picture", "video", "user"). If you choose to user characters 0x90 up to 0x9f + any other character (so, 0x90 0x01, 0x90 0x02, 0x90 0xfa) to encode such words, you can have 16 * 256 = 4,096 "dictionary entries" to encode the most used words. You'll use 2 bytes to represent 4 - 8 characters.
Edit: as you can read in the mentioned RFC, above, in the URL you can only have the printable ASCII characters. This means that only characters 0x20 to 0x7F should be used, with some observations made in the RFC. So, any character after 0x80 (hexadecimal notation, would be character 128 decimal in the ASCII table) shouldn't be used. So, if can choose one character (let's say the 0x90) to be one flag to indicate "the following byte is a indication in the dictionary, the index that I'll use". One character (0x90) * 256 characters (0x00 up to 0xFF) = 256 entries in the dictionary. But you can also choose to use characters 0x90 to 0x9f (or 144 to 159 in decimal) to indicate that they are a flag to the dictionary, thus giving you 16 *256 possibilities...
These 2 methods can save you a lot of space in your database and are reversible, without the need to worry about collisions, etc. You'll simple create a dictionary in your application and go encode/decode URLs using it, very fast, making your database much lighter.
Since you already have +50M URLs, you can generate statistics based on them, to generate a better dictionary.
Using hashes : Hashes, in this case, are a tradeoff between size and security. How bad will it be if you get a collision?
And in this case you can use the birthday paradox to help you.
Read the article to understand the problem: if all inputs (possible characters in the URL) were equivalent, you could stimate the probability of a collision. And could calculate the opposite: given your acceptable collision probability, and your number of files, how broad should your range be? And since your range is exactlly related to the number of bits generated by the hash function...
Edit: if you have a hash function that gives you 128 bits, you'll have 2^128 possible outcomes. So, your "range" in the birthday paradox is 2^128: it's like your year have 2^128 days, instead of 365. So, you calculate the probabilities of collision ("two files being born in the same day, with a year that have 2^128 days instead of 365 days). If you choose to use a hash that gives you 512 bits, your range would go from 0 to 2^512...
And, again, have the RFC in mind: not all bytes (256 characters) are valid in the internet / URL world. So, the probabillity of collisions decrease. Better for you :).

url text compression (not shortening) and storing in mysql

I have url table in mysql which has only two fields id and varchar(255) for url. There are currently more than 50 million urls there and my boss has just given my clue about the expansion of our current project which will result in more urls to be added in that url table and expected numbers are well around 150 million in the mid of the next year.
Currently database size is about 6GB so I can safely say that if things are left same way then it will cross 20GB which is not good. So, I am thinking of some solution which can reduce the disk space of url storage.
I also want to make it clear that this table is not a busy table and there are not too many queries at the momen so I am just looking to save disk space and more importantly I am looking to explore new ideas of short text compression and its storage in mysql
BUT in future that table can also be accessed heavily so its better to optimize the table well before the time come.
I worked quite a bit to change the url into numeric form and store using BIGINT but as it has limitations of 64 bits so it didn't work out quite well. And same is the problem with BIT data type and imposes the limit of 64 bits too.
My idea behind converting to numeric form is basically as 8byte BIGINT stores 19 digits so if each digit points a character in a character set of all possible characters then it can store 19 characters in 8 bytes if all characters are ranged from 1-10 but as in real world scenario there are 52 characters of English and 10 numeric plus few symbols so its well around 100 character set. So, in worst case scenario BIGINT can still point to 6 characters and yes its not a final verdict it still needs some workout to know exactly what each digit is point to it is 10+ digit or 30+ digit or 80+ digit but you have got pretty much the idea of what I am thinking about.
One more important thing is that as url are of variable length so I am also trying to save disk space of small urls so I don't want to give a fixed length column type.
I have also looked into some text compression algo like smaz and Huffman compression algo but not pretty much convinced because they use some sort of dictionary words but I am looking for a clean method.
And I don't want to use binary data type because it also take too many space like varchars in bytes.
Another idea to try might be to identify common strings and represent them with a bitmap. For instance, have two bits to represent the protocol (http, https, ftp or something else), another bit to indicate whether the domain starts with "wwww", two bits to indicate whether the domain ends with ".com", ".org", ".edu" or something else. You'd have to do some analysis on your data and see if these make sense, and if there are any other common strings you can identify.
If you have a lot of URLs to the same site, you could also consider splitting your table into two different ones, one holding the domain and the other containing the domain-relative path (and query string & fragment id, if present). You'd have a link table that had the id of the URL, the id of the domain and the id of the path, and you'd replace your original URL table with a view that joined the three tables. The domain table wouldn't have to be restricted to the domain, you could include as much of the URL as was common (e.g., 'http://stackoverflow.com/questions'). This wouldn't take too much code to implement, and has the advantage of still being readable. Your numeric encoding could be more efficient, once you get it figured out, you'll have to analyze your data to see which one makes more sense.
If you are looking for 128 bit integers then you can use binary(16) here 16 is bytes. And you can extend it to 64 bytes (512 bits) so it doesn't take more space than bit data type. You can say Binary data type as an expansion of BIT data type but its string variant.
Having said that I would suggest dictionary algorithms to compress URLs and short strings but with the blend of techniques used by url shortening services like using A-Z a-z 0-9 combination of three words to replace large dictionary words and you would have more combinations available than available words 62 X 62 X 62.
Though I am not sure what level of compression you would achieve but its not a bad idea to implement url compression this way.

Maximum number of rows in an MS Access database engine table?

We know the MS Access database engine is 'throttled' to allow a maximum file size of 2GB (or perhaps internally wired to be limited to fewer than some power of 2 of 4KB data pages). But what does this mean in practical terms?
To help me measure this, can you tell me the maximum number of rows that can be inserted into a MS Access database engine table?
To satisfy the definition of a table, all rows must be unique, therefore a unique constraint (e.g. PRIMARY KEY, UNIQUE, CHECK, Data Macro, etc) is a requirement.
EDIT: I realize there is a theoretical limit but what I am interested in is the practical (and not necessarily practicable), real life limit.
Some comments:
Jet/ACE files are organized in data pages, which means there is a certain amount of slack space when your record boundaries are not aligned with your data pages.
Row-level locking will greatly reduce the number of possible records, since it forces one record per data page.
In Jet 4, the data page size was increased to 4KBs (from 2KBs in Jet 3.x). As Jet 4 was the first Jet version to support Unicode, this meant that you could store 1GB of double-byte data (i.e., 1,000,000,000 double-byte characters), and with Unicode compression turned on, 2GBs of data. So, the number of records is going to be affected by whether or not you have Unicode compression on.
Since we don't know how much room in a Jet/ACE file is taken up by headers and other metadata, nor precisely how much room index storage takes, the theoretical calculation is always going to be under what is practical.
To get the most efficient possible storage, you'd want to use code to create your database rather than the Access UI, because Access creates certain properties that pure Jet does not need. This is not to say there are a lot of these, as properties set to the Access defaults are usually not set at all (the property is created only when you change it from the default value -- this can be seen by cycling through a field's properties collection, i.e., many of the properties listed for a field in the Access table designer are not there in the properties collection because they haven't been set), but you might want to limit yourself to Jet-specific data types (hyperlink fields are Access-only, for instance).
I just wasted an hour mucking around with this using Rnd() to populate 4 fields defined as type byte, with composite PK on the four fields, and it took forever to append enough records to get up to any significant portion of 2GBs. At over 2 million records, the file was under 80MBs. I finally quit after reaching just 700K 7 MILLION records and the file compacted to 184MBs. The amount of time it would take to get up near 2GBs is just more than I'm willing to invest!
Here's my attempt:
I created a single-column (INTEGER) table with no key:
CREATE TABLE a (a INTEGER NOT NULL);
Inserted integers in sequence starting at 1.
I stopped it (arbitrarily after many hours) when it had inserted 65,632,875 rows.
The file size was 1,029,772 KB.
I compacted the file which reduced it very slightly to 1,029,704 KB.
I added a PK:
ALTER TABLE a ADD CONSTRAINT p PRIMARY KEY (a);
which increased the file size to 1,467,708 KB.
This suggests the maximum is somewhere around the 80 million mark.
As others have stated it's combination of your schema and the number of indexes.
A friend had about 100,000,000 historical stock prices, daily closing quotes, in an MDB which approached the 2 Gb limit.
He pulled them down using some code found in a Microsoft Knowledge base article. I was rather surprised that whatever server he was using didn't cut him off after the first 100K records.
He could view any record in under a second.
It's been some years since I last worked with Access but larger database files always used to have more problems and be more prone to corruption than smaller files.
Unless the database file is only being accessed by one person or stored on a robust network you may find this is a problem before the 2GB database size limit is reached.
We're not necessarily talking theoretical limits here, we're talking about real world limits of the 2GB max file size AND database schema.
Is your db a single table or
multiple?
How many columns does each table have?
What are the datatypes?
The schema is on even footing with the row count in determining how many rows you can have.
We have used Access MDBs to store exports of MS-SQL data for statistical analysis by some of our corporate users. In those cases we've exported our core table structure, typically four tables with 20 to 150 columns varying from a hundred bytes per row to upwards of 8000 bytes per row. In these cases, we would bump up against a few hundred thousand rows of data were permissible PER MDB that we would ship them.
So, I just don't think that this question has an answer in absence of your schema.
It all depends. Theoretically using a single column with 4 byte data type. You could store 300 000 rows. But there is probably alot of overhead in the database even before you do anything. I read some where that you could have 1.000.000 rows but again, it all depends..
You can also link databases together. Limiting yourself to only disk space.
Practical = 'useful in practice' - so the best you're going to get is anecdotal. Everything else is just prototyping and testing results.
I agree with others - determining 'a max quantity of records' is completely dependent on schema - # tables, # fields, # indexes.
Another anecdote for you: I recently hit 1.6GB file size with 2 primary data stores (tables), of 36 and 85 fields respectively, with some subset copies in 3 additional tables.
Who cares if data is unique or not - only material if context says it is. Data is data is data, unless duplication affects handling by the indexer.
The total row counts making up that 1.6GB is 1.72M.
When working with 4 large Db2 tables I have not only found the limit but it caused me to look really bad to a boss who thought that I could append all four tables (each with over 900,000 rows) to one large table. the real life result was that regardless of how many times I tried the Table (which had exactly 34 columns - 30 text and 3 integer) would spit out some cryptic message "Cannot open database unrecognized format or the file may be corrupted". Bottom Line is Less than 1,500,000 records and just a bit more than 1,252,000 with 34 rows.