Validation and Creating unique ID - mysql

Okay - for my project I was asked to identify some validation techniques for a process that we have to transform some data. Let me give you a background.
We receive data from client - we load the file, and only pull in the fields necessary for processing. A few checks are done at this stage. From here we run scripts on the data which essentially does all the heavy lifting. (Dropping duplicates, checking dates, etc). Then it runs through a blackbox system and spits us out results.
We have been notified by client that we are extremely off in our counts for a particular group. roughly $4mill dollars for this one.
We have a process to identify a unique member, by generating a pol_ID, a Suf_ID and with their associated groupname, they are considered unique in our system, and in our processing system.
We need a process to handle the records for these unique members.
A unique member can have one to many claims associated to their name in a given time period.
When we receive claim information, it is generally handled by using the payor_field + claimno + a generated sequence number (sometimes this sequence number is the last two digits of claimno)...
Ex. Three claims come into system, and after processing through load, we see
the client has repeated the claimno - since we using the last two digits, it no longer makes them unique and drops a two of the three records. Only retaining the first one.
WKS-01100 75.02 - stays
WKS-01100 6000.56 - drops
WKS-01100 560.23 - drops
My problem comes into play, because we usually make assumptions on the claimno that if we parse off the last two digits, it is unique, in testing this case we have tried creating an explicit incremental sequence number in another column to consider this unique. Which then doubles our results.
Now my questions are as follows:
Is there another way to make these claims unique? Auto-Increment is not an option. Consider the client can send duplicate claimnos which is where our problem lies, they can potentially recycle their claimno.
Since it's month based, maybe there could be some kind of month id on the end..?
Would any binary representation of the sequence number work? It is an INT data type..
(Also should be noted we deal with historical data that goes back 24 months, and each month we get the next consecutive months data, and we drop the first month in the set)
We are not limited on what we do to transform this claimno so I am open to suggestions...tried to keep it short but let me know if I need to add more info :) Thanks.

Do you have a timestamp saved for each claim? A possible solution is to append the timestamp to make the claim unique.
WKS-01100-1330464175
WKS-01100-1327514036
WKS-01100-1341867984

Related

What is the best data structure for "saved search" feature with daily email notifications?

The feature works the following way:
Website has users and users can have any number of their searches
saved (e.g. Jobs in NY, PHP jobs, etc). There are a lot of parameters
involved so this is virtually impossible to index (I am using MySQL).
Every day a number of new jobs get posted to the website
Every 24 hours we take the jobs posted within the last 24 hours and match them up against the existing job searches and then email users about matching jobs.
The problem here is that it is a high-traffic website and even for an optimistic case (few new jobs posted), it takes 10 minutes to run this search query. Are there any classical solutions for this problem? We've been using Sphinx for search-intensive places but I can't apply it here because Sphinx won't return all results, it will cut them off eventually. For now the best thing I could come with is to have search.matched_job_ids column and then whenever a job is posted, match it against all existing searches and record the job id in the matched_job_ids column of matched searches. At the end of the day we will email users and truncate the column. This technically doesn't offer any performance improvement but spreads the load over time by executing many smaller search queries rather than one big query. Are there any better approaches?
Each job can be described with the number of parameters - job sphere, job name, salary and so on. Each parameter has set of predefined values -
Job sphere - IT,medicine,industry...
Job name - programmer, tester, driver...
10-50 thousands per month, 50-100...
Flexy time, full time, freelance...
Let's encode saved search. Maximal number of values among all parameters (I believe it is job name) is the base of numeral system. Number of parameters - number of digits.
BIGINT = 2^64-1 = 18 446 744 073 709 551 616 = 20 digits. In normal 10-base system you can describe 20-1 (first digit is fixed) = 19 parameters each having 10 values. As 10 values is not enough for describing such parameter as job name you should use 30-60-base system. Of course, it leads to decreasing total number of parameters, but I thing it's possible to describe some job with 12-15 parameters.
Create table savedSearches(code,mail) which indexed on (code,mail). Index type - primary key.
New job posted:
1) Encode it programatically.
2) select mail from savedSearhes where code=calculatedCode. Mail is in covered index - select sholuld fast enough.3) Send new job to selected mails.
Important note - one parameter - host company of posted job can have too much values. I think you should store it separately, not in savedSearhes table as user usually don't care about company - he cares about salary, skills e.t.c.
If user wants to search not fixed parameter, for instance not just programmer position but tester, team leader you have to search not single encoded number but interval.
My idea is just assumption, some base for further investigations))

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.
I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.
Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.
See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

Should id or timestamp be used to determine the creation order of rows within a database table? (given possibility of incorrectly set system clock)

A database table is used to store editing changes to a text document.
The database table has four columns: {id, timestamp, user_id, text}
A new row is added to the table each time a user edits the document. The new row has an auto-incremented id, and a timestamp matching the time the data was saved.
To determine what editing changes a user made during a particular edit, the text from the row inserted in response to his or her edit is compared to the text in the previously inserted row.
To determine which row is the previously inserted row, either the id column or the timestamp column could be used. As far as I can see, each method has advantages and disadvantages.
Determining the creation order using id
Advantage: Immune to problems resulting from incorrectly set system clock.
Disadvantage: Seems to be an abuse of the id column since it prescribes meaning other than identity to the id column. An administrator might change the values of a set of ids for whatever reason (eg. during a data migration), since it ought not matter what the values are so long as they are unique. Then the creation order of rows could no longer be determined.
Determining the creation order using timestamp
Advantage: The id column is used for identity only, and the timestamp is used for time, as it ought to be.
Disadvantage: This method is only reliable if the system clock is known to have been correctly set each time a row was inserted into the table. How could one be convinced that the system clock was correctly set for each insert? And how could the state of the table be fixed if ever it was discovered that the system clock was incorrectly set for a not precisely known period in the past?
I seek a strong argument for choosing one method over the other, or a description of another method that is better than the two I am considering.
Using the sequential id would be simpler as it's probably(?) a primary key and thus indexed and quicker to access. Given that you have user_id, you can quickly assertain the last and prior edits.
Using the timestamp is also applicable, but it's likely to be a longer entry, and we don't know if it's indexed at all, plus the potential for collisions. You rightly point out that system clocks can change... Whereas sequential id's cannot.
Given your update:
As it's difficult to see what your exact requirements are, I've included this as evidence of what a particular project required for 200K+ complex documents and millions of revisions.
From my own experience (building a fully auditable doc/profiling system) for an internal team of more than 60 full-time researchers. We ended up using both an id and a number of other fields (including timestamp) to provide audit-trailing and full versioning.
The system we built has more than 200 fields for each profile and thus versioning a document was far more complex than just storing a block of changed text/content for each one; Yet, each profile could be, edited, approved, rejected, rolled-back, published and even exported as either a PDF or other format as ONE document.
What we ended up doing (after a lot of strategy/planning) was to store sequential versions of the profile, but they were keyed primarily on an id field.
Timestamps
Timestamps were also captured as a secondary check and we made sure of keeping system clocks accurate (amongst a cluster of servers) through the use of cron scripts that checked the time-alignment regularly and corrected them where necessary. We also used Ntpd to prevent clock-drift.
Other captured data
Other data captured for each edit also included (but not limited to):
User_id
User_group
Action
Approval_id
There were also other tables that fulfilled internal requirements (including automatically generated annotations for the documents) - as some of the profile editing was done using data from bots (built using NER/machine learning/AI), but with approval being required by one of the team before edits/updates could be published.
An action log was also kept of all user actions, so that in the event of an audit, one could look at the actions of an individual user - even when they didn't have the permissions to perform such an action, it was still logged.
With regard to migration, I don't see it as a big problem, as you can easily preserve the id sequences in moving/dumping/transferring data. Perhaps the only issue being if you needed to merge datasets. You could always write a migration script in that event - so from a personal perspective I consider that disadvantage somewhat diminished.
It might be worth looking at the Stack Overflow table structures for there data explorer (which is reasonably sophisticated). You can see the table structure here: https://data.stackexchange.com/stackoverflow/query/new, which comes from a question on meta: How does SO store revisions?
As a revision system, SO works well and the markdown/revision functionality is probably a good example to pick over.
Use Id. It's simple and works.
The only caveat is if you routinely add rows from a store-and-forward server so rows may be added later but should treated as being added earlier
Or add another column whose sole purpose is to record the editing order. I suggest you do not use datetime for this.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

What's the best way to generate a unique number which has to follow certain rules?

Some background: In Germany (at least) invoice numbers have to follow certain rules:
The have to be ordered
They have to be continuous (may not have gaps)
Since a few months they are allowed to contain characters. Some customers want to use that possibility and customers don't know that or are afraid and they insist on digit-only invoice numbers.
Additionally the customers don't want to start them at zero.
Is I can think of many ways to generate such a number I wonder: What's the best way to do this?
In order to avoid starting at 0 - just start at 10000. Forget about the zero-padding.
You have to consider when the number is going to be allocated.
If you allocate the number when the invoice is first opened for edit, for instance the number 10014 is allocated, and the user cancels the invoice then you have a gap, since you have to keep in mind that someone else could already have begun to create an invoice with the id 10015, so you can't just rollback the number.
If you allocate the number when the invoice has been completely written and is being saved then you'll avoid the scenario, and you'll avoid having gaps, but you will not know which invoice number the invoice is going to have before it is saved.
Also, you need to make sure that it is threadsafe, so that two users can't create the same invoice number.
static object _invoiceNumberLock = new object();
public static string GetInvoiceNumber()
{
lock(_invoiceNumberLock)
{
//Connect to database and get MAX(invoicenumber)+1
//Increase the invoicenumber in SQL database by one
//Perhaps also add characters
}
}
Also consider backing up the uniqueness by having a UNIQUE INDEX on the invoicenumber column in the SQL database.
My two ideas:
(without chars): a sequence, starting at 1, left-padded with zeroes (or beginning with another number and padded with zeroes) - example: 1000002554
(with chars): a sequence in hexadecimal base, padded with zeroes (eventually with a prefix) - example: AF00CAFE01
It would be cool to have invoice numbers in hex :)
For my invoices, they always consist of the last digit of the year followed by a 4-digit invoice number starting at 4096, so you would get a value like 85021. This was to easily handle billing my (up to) 15 clients twice a month.
It's irrelevant now since I raise only about 6 invoices a year (I went permanent and now only do small bits of work, mostly leaving the company for investment purposes) but this scheme always worked for me.
It stopped the clients from feeling that they were getting work done by a small company (no invoice number 1 or even 80001).
In terms of the sequential nature, the invoices were simply stored in files with the invoice number in the file name, so it was easy to create the next one. No database was required - it really depends on how big your shop is.
Some more ideas:
Pick an arbitrary sequence of bits to be the top of the number. This will work as a "magic key" so you can recognize that number out of context. For example, Newegg product IDs seem to start with "N82E" and UPS tracking numbers seem to start with "1Z".
If you can bend the "contiguous" rule, make the last bit be a parity bit, for error checking.