How best create views for temporal data in couchbase - couchbase

I am interested in doing a website based on news, review articles etc.
Some key information i need to check is that the articles is in a valid time range i.e. between its start date and stop date to run on the site and its online.
How can i write a view for that? Ideally you would want a view like
var now = new Date();
if(doc.online && (doc.startDate <= now && now <= doc.stopDate))
emit(doc.headline,null);
but that wont work if i understand views correctly i.e. the only update on doc insert/update
Do i really want to do a compound key that has the start and stop date in them so i can do range type queries and also have to deal with other complexities e.g. the type of article e.g. news, tagging of article etc.
Thanks

Related

How to keep updates (diffs) of some entity in the database

What is the best way to keep updates (diffs) of the some entity in the database? Here, at StackOverflow, we can edit questions and answers. And then we can look at any revision of the question or answer we want. For example: revisions of some random question. Maybe someone knows how it realized in StackOverflow?
To be clear in my case I have some entity (article) with some fields (name, description, content). Many users can edit the same article. I want to keep history of the article updates (something like version control) and I want to keep only diffs, not the whole content of the updated article. By the way I use PostgreSQL, but can migrate to any other database.
UPD
Open bounty, so here is some requirements. You don't need to fully satisfy them. But if you do it will be much better. Nevertheless any answer is much appreciated. So I want to have an ability:
to keep only diffs to not waste my space for no purpose.
to fetch any revision (version) of some article. But fetching of the last revision of the article must be really quick. Fetching speed of other revisions is not so important.
to fetch any diff (and list of diffs) of some article. Article can have changes in fields: header, description or content (like StackOverflow have changes in header and content), so it must be taken into account.
In the past, I have used diff-match-patch with excellent (and fast) results. It is available for several languages (my experience with it was in C#). I did not use it for exactly the process you are describing (we were interested in merging), but it seems to me you could:
Save the initial version of an article's text/header/whatever.
When a change is made, use diff-match-patch to compute a patch between the newly edited version and what is already in the database. To get the latest version in the database, simply apply any patches that have already been generated to the original article in order.
Save the newly generated patch.
If you wanted to speed things up even more, you could cache the latest version of the article in its own row/table/however-you-organize-things so that getting the latest version is a simple SELECT. This way, you'd have the initial version, the list of patches, and the current version, giving you some flexibility and speed.
Since you have a set of patches in sequence, fetching any version of the article would simply be a matter of applying patches up to the one desired.
You can take a look at the patch demo to see what its patches look like and get an idea of how big they are.
Like I said, I have not used it for exactly this scenario, but diff-match-patch has been designed for doing more or less exactly what you are talking about. This library is on my short list of software I can use when I have no restrictions on libraries developed out-of-house.
Update: Some example pseudocode
As an example, you could set up your tables like so (this assumes a few other tables, like Authors):
Articles
--------
id
authorId
title
content
timestamp
ArticlePatches
--------------
id
articleId
patchText
timestamp
CurrentArticleContents
----------------------
id
articleId
content
Then some basic CRUD could look like:
Insert new article:
INSERT INTO Articles (authorId, title, content, timestamp)
VALUES(#authorId, #title, #content, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content)
VALUES(SCOPE_IDENTITY(),#content)
GO
Get all articles with latest content for each:
SELECT
a.id,
a.authorId,
a.title,
cac.content,
a.timestamp AS originalPubDate
FROM Articles a
INNER JOIN CurrentArticleContents cac
ON a.id = cac.articleId
Update an article's content:
//this would have to be done programatically
currentContent =
(SELECT content
FROM CurrentArticleContents
WHERE articleId = #articleId)
//using the diff-match-patch API
patches = patch_make(currentContent, newContent);
patchText = patch_toText(patches);
//setting #patchText = patchText and #newContent = newContent:
(INSERT INTO ArticlePatches(articleId, patchText, timestamp)
VALUES(#articleId, #patchText, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content, timestamp)
VALUES(#articleId, #newContent, GETDATE())
GO)
Get the article at a particular point in time:
//again, programatically
originalContent = (SELECT content FROM Articles WHERE articleId = #articleId)
patchTexts =
(SELECT patchText
FROM ArticlePatches
WHERE articleId = #articleId
AND timestamp <= #selectedDate
ORDER BY timestamp ASCENDING)
content = originalContent
foreach(patchText in patchTexts)
{
//more diff-match-patch API
patches = patch_fromText(patchText)
content = patch_apply(patches, content)[0]
}
i have similar issue in workplace.
i implement the use of trigger after update to record all needed data to a another table (which of course you can save only the difference field only), and the new on exist on the real table, while the log live in another table.
Ok, first #PaulGriffin's answer is complete. And #VladimirBaranov's also got me to thinking about the optimal way to do updates. I've got a tasty way to handle it in the use case of frequent updates and less frequent reads. Lazy full record updates.
For example, editing a document online, possibly from different devices. Optimize for light network traffic and less frequent large db record updates. Without the ability of the client to go to specific versions (rather than always getting latest).
The database has collection of deltas, indexed by version(could be timestamp), and a lastDocument with lastDocUpdate (version/timestamp).
Frequent use case: Client Edits
Send only the delta to the Server & update database with delta
Use Case: Client with old document version requests updates
Send all deltas since current client version
Less Frequent use case: New device, no prior data on client
On Server, look at lastDocument, apply deltas since lastDocUpdate
Save updated lastDocument in db, and send to client
The most expensive operation is updating the full document in the database, but it's only done when necessary, ie the client has no version of the document.
That rare action is what actually triggers the full document update.
This setup has no extra db writes, minimal data sent to the client (which updates its' doc with the deltas), and the large text field is updated only when we are already applying the deltas on the server and full document is needed by client.

Rails - Saving Daily Metrics

I'm currently developing a Rails application, on top of PostgreSQL, that stores daily data for our company. We run ads on Facebook, and we have a few hundred ads running at any one time. I pull metrics every day, and import to my application, which then either creates or updates based on if it exists. However, I want to be able to see daily performance over the course of, say a week or month. What would be the easiest way to accomplish this?
My facebook_ad model has X amount of rows, 1 for each ad campaign. Each column denotes a specific metric, i.e. amount spent, clicks, etc. Should I create a new table for each date? Is there a way to timestamp every entry and include the time in my queries? I've made good progress up until here, and no amount of searching has brought me to a strategy I could use.
Side note, we are hoping to access to their API, which would probably solve most of this. But we want to build something in the interim, so we can be as efficient as possible until then, which could be 6 months or more.
Edited::
I want to query and graph the data based on the daily data. For example, grab the metrics from 10/01/14 - 10/08/14 for one ad, and be able to see 10/01/14: MetricA = 1, MetricB = 2; 10/02/14: MetricA = 4, MetricB = 5; 10/03/14: MetricA = 6, Metric B = 3, etc. We want to be able to see trends and see how changes affect performance.
I would definitely not recommend creating a new table for each date -- that would be a data management nightmare. There shouldn't be any reason you can't have each ad campaign in the same table based on what you've said above. You could have a created and updated column in the table which defaults to now(), and if you update it for any reason, set the updated column to now() again. (I like to add those columns to just about every table I create -- it's often useful for a variety of queries).
You could then query that table based on the desired timeframe to get your performance statistics. Depending upon the exact nature of what you want to query, Window Functions may prove to be quite useful.

mysql - storing a range of values

I have a resource that has a availability field that lists what hours of a day its available for use?
eg. res1 available between 0-8,19-23 hours on a day, the range here can be comma separated values of hour ranges. e.g are 0-23 for 24 hour access, 0-5,19-23 or 0-5,12-15,19-23
What's the best way to store this one? Is char a good option? When the resource is being accessed, my php needs to check the current hour with the hour defined here and then decide whether to allow this access or not. Can I ask mysql to tell me if the current hour is in the range specified here?
I'd store item availability in a separate table, where for each row I'd have (given your example):
id, startHour, endHour, resourceId
And I'd just use integers for the start and end times. You can then do queries against a join to see availability given a certain hour of the day using HOUR(NOW()) or what have you.
(On the other hand, I would've preferred a non-relational database like MongoDb for this kind of data)
1) create a table for resource availability, normalized.
CREATE TABLE res_avail
{
ra_resource_id int,
ra_start TIME,
ra_end TIME
# add appropriate keys for optimization here
)
2) populate with ($resource_id, '$start_time', '$end_time') for each range in your list (use explode())
3) then, you can query: (for example, PHP)
sql = "SELECT ra_resource_id FROM res_avail where ('$time' BETWEEN ra_start AND ra_end)";
....
I know this is an old question, but since v5.7 MySQL supports storing values in JSON format. This means you can store all ranges in one JSON field. This is great if you want to display opening times in your front-end using JavaScript. But it's not the best solution when you want to show all places that are currently open, because querying on a JSON field means a full table scan. But it would be okay if you only need to check on for one place at the time. For example, you load a page showing the details of one place and display whether it's open or closed.

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!

MySQL - Coming up with a Unique Key for each record, not the primary Key

Ok this is a tricky one to explain.
I am creating an app that will have PAGES, currently I'm using PageID as the key to SEL the record.
The issue I'm having now is that I want users to be able to EDIT pages, but not lose the previous page (for history, recording keeping reasons, like a changelog or wiki page history).
This is making me think I need a new field in the PAGE table that acts as the pageID, but isn't the Primary Key that is auto-incremented every time a row is added.
Google Docs has a DOCID: /Doc?docid=0Af_mFtumB56WZGM4d3Y3d2JfMTNjcDlkemRjeg
That way I can have multiple records with the same Doc ID, and show a history change log based on the dataAdded field. And when a user wants to view that DOCID, I simply pull the most recent one.
Thoughts? I appreciate your smart thinking to point me in the right direction!
You're on the right track. What you need is a history or revision id, and a document id. The history id would be the primary key, but you would also have a key on the document id for query purposes.
With history tracking, you add a bit more complexity to your application. You have to be careful that the main view of the document is showing the current history revision (ie. largest history id for a given document id).
As well, if you are storing large documents, every edit is essentially going to add another copy of the document to your database, and the table will quickly grow very large. You might want to consider implementing some kind of "diff" storage, where you store only the changes to the document and not the full thing, or keeping history edits in a separate table for history-searching only.
UUID() creates a randomly generated 128bit number, like
'6ccd780c-baba-1026-9564-0040f4311e29'
This number will not be repeated in a few millions years.
//note most digits are based upon timestamp and machine information, so many of the digits will be similar upon repeated calls, but it will always be unique.
Keep an audit table with the history of the changes. This will allow you to go back if you need to roll back the changes or view change history for example.
You might model it like this:
An app has multiple pages, a page has multiple versions (each with some version info (e.g., date, edit count), and a foreign key to its page)
Viewing a page shows the most recent version
Saving an edit creates a new version
each document is really a revision:
doc - (doc_id)
revision - (rev_id, doc_id, version_num, name, description, content, author_id, active tinyint default 1)
then you can open any content with just the rev_id: /view?id=21981
select * from revision r, doc d where r.rev_id = ? and r.doc_id = d.doc_id
This sounds like a good job for two tables to me. You might have one page_header table and one page_content table. The header table would hold static info like title, categorization (whatever) and the content table would hold the actual editable content. Each time the user updates the page, insert a new page_content record versus updating an existing one. When you display the page just make sure you grab the latest page_content record. This is a simple way to keep a history and roll back if needed.
Good luck!