I'm working on an application that saves phone numbers, the mask for the phone number is (99) 9999-9999.
Should I save the whole string on the database. i.e.:
(99) 9999-9999
or just the data i.e.:
9999999999
and only format it in the UI?
I'm leading towards the second one but I couldn't give good reasons on why is that. My coleague argument was that the first one (the one with the mask) would be easier, since it's not necessary to apply the mask in different UIs (reports, webpage).
Separate data and presentation logic - this is good practice.
I suggest you to store only number, as number is data, and formatting is not (store only data in database).
Second, maybe for this moment you have 1 format, but believe me - in some time you will need another format and then you will have to re-format it (some kind of murphy's law)
Of course, for performance reasons you can cache visual presentation - create additional field(s) for it and use it for display, update it when main "data" field is updated
Related
Suppose one is trying to save such API responses for analytics later, ie, a single response has about a 1000 persons
Each object has about 26 properties.
The API query is made every 5 minutes for example.
{person1 : {propertyA:a1, propertyB:b1 ....... propertyZ:z1}
person2 : {propertyA:a2, propertyB:b2 ....... propertyZ:z2}
....
....
person999: {propertyA:a999, propertyB:b999 ....... propertyZ:z999}
person1000: {propertyA:a1000, propertyB:b1000 ....... propertyZ:z1000}}
What is the best way to store such kind of data for analytics later? What kind of database? (the simpler the better)
Should the multiple responses of such API calls be stored in single rows or make multiple columns for each object? Or some other way like JSON dbs?
Note - the person might change over time, eg person100 might stop being updated or become inactive .... so an API resposne in future might not include person100 instead another record for person1001 might be added (unrelated to person100 becoming inactive)
Additional info :
Data would be updated say every 5 mins for a say 5 years (to give an idea about usage/retention of data).
Queries would mostly be limited to how a personX is changing over a given time frame that is likely to range from a few hours to over 6 months.
Properties of a person are likely to have same/similar profile of attributes, althoug their values would obviously change over time
the simpler the better
The simplest would presumably be to keep the results of each API query in a single file, though if you did so, it would probably best to use a JSONLines format, with
one line per person. However, in either case, I would almost certainly add an 'id' field to make it trivially easy to query for a particular person, and to migrate the data elsewhere should that become necessary.
A variant of the above would be to have one file per person, again with a JSONLines format, but with the addition of some kind of timestamp.
Next up the ladder of complexity, you might want to consider a SQLite database. If you want to retain the JSON format, then you'd presumably want to add
indices, e.g. on the person id.
If the JSON object representation of each person is flat and the property list stable, then the conventional wisdom would be to store the data in columnar format. A reasonable compromise would be to move the properties of interest to columns, and to relegate all the other (relevant) details to JSON-valued columns.
Of course there are umpteen other database options, and you can climb the complexity ladder as high as it goes. Likewise for cost. You might like to look at TimescaleDB for starters.
Managing Scale
If the data for an individual does not change very often, there will
presumably be various ways to reduce the redundancy.
At one end of the spectrum of possibilities, you could simply discard
an entire record if the prior retained record for that person is essentially the same.
Towards the other end of the spectrum, you could recast the data as a
series of events that would be easy to store as a table:
timestamp id propertyName value
This would have the advantage of giving you flexibility w.r.t. both
the universe of persons and the set of properties of interest.
See also https://www.timescale.com/blog/time-series-compression-algorithms-explained/
Footnote: The PmWiki system https://en.m.wikipedia.org/wiki/PmWiki illustrates how a fairly complex “database” system can be constructed using the underlying file system.
I have requirement to store NANP(North American Numbering Plan) numbers. This means I don't care and no need to bother about international numbers.
Numbering plan goes like this :
NPA-NXX-XXXX
I would filter & strip extra space or dash(-) to make into 10 digit correct format. Currently we use MySQL and CouchDB for some other stuff but would prefer to keep in MySQL DB as preferred storage system.
I'm looking for fast read operation to match numbers during runtime and write can be little slow as mostly insert/update will happen in off hours.
Since it is given that NPA & NXX will never start with 0 so if we can separate
them and they can be used as integer type in case of want to breakdown.
For NoSQL case, it is possible to generate separate document for each area code and then further isolate NXX & XXXX.
For RDBMS case, a full number can be stored as indexed integer for fast accessibility.
What would be the best database design to store these numbers ?
Thanks in advance.
I'm looking for fast read operation to match numbers during runtime
With CouchDB you can store every number as ID of a doc e.g.
{
_id: "NPA-NXX-XXXX",
_rev: "1-..."
}
To match any number you send a lightweight
HEAD path/to/CouchDB/dbname/NPA-NXX-XXXX
and it will respond with the statuscode 200 (has matched) or 404 (no match).
Write operations can be done in big bulks (/dbname/_bulk_docs).
Because the numbers are stored as id's the primary index of CouchDB can be used for the HEAD requests (as described above) - that means that every write is immediately available for reads.
I am building a complex ordering system and I am struggling with whether I should store some of the more detailed information in a single column as JSON or if I should create the multiple tables and logic to keep JSON out of the picture.
Since each order will have multiple required dates, ship dates, parts, kits (collections of parts), and more. It just seems easier to store this as JSON of a single 'order'row.
Are there any major down sides to doing this?
JSON is geared more towards short term storage to send data from one thing to another. It is horribly inefficient space and computationally wise for long term storage compared to a database. You will also loose the ability to query the data directly without parsing it first (e.g "select * from table where orderdate < today"). You'll also have to develop your own tools to view the data, since if you try to view it in the database directly, everything will run together.
In short, this is almost always a really bad idea.
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.
I am working on an embedment section on my site where users can embed different media from various services, youtube, myspace music, vimeo etc
I am trying to work out the best way to store it. Users do not have to embed all of the options and can only embed one of each type (one video for example).
Initially I thought just have a table with a row per embeded item like so:
embedid (auto increment primary key),
userid, embedded_item_id (e.g a
youtube id)
but then I realised that some embeddable items require multiple arguments such as myspace music so I thought id make a table where each user has one row.
userid, youtubeid, vimeoid,
myspaceid1, myspaceid2
but it seems a bit clumsy especially considering there will always be empty rows as users can not ever have all of them. Does anyone have a better solution?
`EmbededItem' table has columns common to all items.
YouTube, Vimeo, MySpace have only columns specific to each one.
So, here's what I'd do in such a situation:
Setup your table with columns for your primary key and userid fields and anything else you may need to identify the user or application (maybe a 'mediatype' field). The rest, put into a VARCHAR field, make it large enough to hold lots of data. Not sure how much space you would need, but I'm going to venture a guess that you will need between 1K and 4K+ of space.
The reason for a VARCHAR field: you never know what other new fields you will need in the future. Let's say next year youtube adds another parameter, or a new media format comes along. If you model your database to represent all fields individually, you will create an application that is not scalable to future or other media formats. Such modeling is great when you're describing a system on paper, but not so good when you implement code.
So, now that you have a varchar field to store all your data in, you have several options for how to store the data:
You can store the data as an XML document and parse it on input/output (But you will most likely need more than 4k of space), and you will incur the cost of parsing XML.
You can store the data as whatever data format you may need for your application (serialized object for java, JSON for javascript, etc). If you're serializing an object, you may also need more than 4k of space, and a VARBINARY field, not VARCHAR.
comma delimited string, although this fails if your strings contain commas. I probably would not recommend this.
null delimited key/value pair strings, with a double null at the end. You will need a VARBINARY data field for this one.
Number 4 is my favorite, and something I would recommend. I've used this pattern for an existing web project, where my strings are stored in format of:
'uid=userid/0var1=value1/0val2=value2/0url=urltosite/0/0'
Works like a charm. I use the data to build dynamic web pages for my users. (My application is C though, so it deals well with parsing a character array).
Your application could use the data from your first columns (like 'mediatype') to execute specific parsing routines if required, and use the VARCHAR/VARBINARY fields as input. Scaling to new types of embeddable media would be as simple as writing a new (or extending an existing) parser and defining a new 'mediatype' value.
Hope this helps.